[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Tue Mar 6 18:13:25 EST 2012

>> 
>> * Giving greater weight to a record if the search terms appear in the title
>> or subject (ideally, we would like these field to be configurable.) This is
>> something that is tweakable in search.relevance_ranking, but my
>> understanding is that the use of these tweaks results in a major reduction
>> in search performance.
>> 
> 
> Indeed they do, however rewriting them in C to be super-fast would
> improve this situation.  It's primarily a matter of available time and
> effort.  It's also, however, pretty specialized work as you're dealing
> with Postgres at a very intimate level.

Mike, could you elaborate what bits of code you're talking about here that could be rewritten in C?

Some of my colleagues at Sitka and I were trying to find out why broad searches are unusually slow and eventually found that our adjustments in search.relevance_adjustment were slowing things down.   Months earlier the CD patch was added to trunk to circumvent this problem without our knowledge, so we tried backporting that code and testing it however, in our initial tests, we weren't entirely satisfied with the CD modifiers' ability to rank items.  

Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down.  While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records:

explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 10000;

When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records.  You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger.  It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment.

Anyway, not entirely sure about how this analysis holds up in trunk as we've done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new code in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to those bits.  

I've attached some sample SQL of part of a 2.0 query and the same query without naco_normalize run on the metabib table.  In my testing on our production dataset, this query -- a search for "Canada" -- went from over 80 seconds to less than 10 by removing the naco_normalize (it's still being run on the incoming term though which is probably unavoidable)

My thought for a solution would be that we could have naco_normalize run as an INSERT trigger on that field.  Obviously the whole tables would need to be updated which is no small task.  I'm also not sure if that would impact other things, ie: where else the metabib.x_field_entry.value field is used, but but generally I'd think we'd almost always be using that value for a comparison of some kind and want that value in a normalized form.   Another option may be to not normalize in those comparisons, however it's slightly less attractive IMO. Anyway I'd be interested to hear your thoughts on that.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bad-queries.sql.txt
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20120306/49cee521/attachment.txt>
-------------- next part --------------

~James Fournie
BC Libraries Cooperative