[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Tue Mar 6 22:55:24 EST 2012

On Tue, Mar 6, 2012 at 6:13 PM, James Fournie
<jfournie at sitka.bclibraries.ca> wrote:
>>>
>>> * Giving greater weight to a record if the search terms appear in the title
>>> or subject (ideally, we would like these field to be configurable.) This is
>>> something that is tweakable in search.relevance_ranking, but my
>>> understanding is that the use of these tweaks results in a major reduction
>>> in search performance.
>>>
>>
>> Indeed they do, however rewriting them in C to be super-fast would
>> improve this situation.  It's primarily a matter of available time and
>> effort.  It's also, however, pretty specialized work as you're dealing
>> with Postgres at a very intimate level.
>
> Mike, could you elaborate what bits of code you're talking about here that could be rewritten in C?
>

I mean specifically the elaborate COALESCE/NULLIF/regexp (aka ~) parts
of the SELECT clauses that implement the first-word, word-order and
full-phrase relevance bumps that come from
search.relevance_adjustment.  There's also the option of attempting to
rewrite naco_normalize and search_normalize (see below) in C.  Lots of
string mangling to which Perl is particularly suited, but it's not
impossible by any means, and there are Postgres components (the
'unaccent' contrib/extension, for instance) that we could probably
build on.

> Some of my colleagues at Sitka and I were trying to find out why broad searches are unusually slow and eventually found that our adjustments in search.relevance_adjustment were slowing things down.   Months earlier the CD patch was added to trunk to circumvent this problem without our knowledge, so we tried backporting that code and testing it however, in our initial tests, we weren't entirely satisfied with the CD modifiers' ability to rank items.
>

Right.  These are more subtle than the heavy-handed
search.relevance_adjustment settings, and therefore have a less
drastic effect.  But they also reduce the need for some of the
search.relevance_adjustment entries, so in combination we should be
able to find a good balance, especially if some of the rel_adjustment
effects can be rewritten in C.

> Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down.  While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records:
>
> explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 10000;
>
> When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records.  You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger.  It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment.
>

Indeed, and naco_normalize is not necessarily the only normalizer that
will be applied to each and every field!  If you search a class or
field that uses other (pos >= 0) normalizers, all of those will also
be applied to both the column value and the user input.

There's some good news on this front, though.  Galen recently
implemented a trimmed down version of naco_normalize, called
search_normalize, that should be a bit faster.  That should lower the
total cost by a noticeable amount over many thousands of rows.

> Anyway, not entirely sure about how this analysis holds up in trunk as we've done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new code in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to those bits.
>
> I've attached some sample SQL of part of a 2.0 query and the same query without naco_normalize run on the metabib table.  In my testing on our production dataset, this query -- a search for "Canada" -- went from over 80 seconds to less than 10 by removing the naco_normalize (it's still being run on the incoming term though which is probably unavoidable)
>

It is unavoidable, but they should only be run once on user input and
the result cached.  EXPLAIN will tell the tale, and if it's not then
the normalizer functions aren't properly marked STABLE.

Hrm... and looking at your example, I spotted a chance for at least
one optimization.  If we recognize that there is only one term in a
search (as in your "canada" example) we can skip the word-order
rel_adjustment if we're told to apply it, saving ~1/3 of the cost of
that particular chunk.

> My thought for a solution would be that we could have naco_normalize run as an INSERT trigger on that field.  Obviously the whole tables would need to be updated which is no small task.  I'm also not sure if that would impact other things, ie: where else the metabib.x_field_entry.value field is used, but but generally I'd think we'd almost always be using that value for a comparison of some kind and want that value in a normalized form.   Another option may be to not normalize in those comparisons, however it's slightly less attractive IMO. Anyway I'd be interested to hear your thoughts on that.
>

We need the pre-normalized form for some things, but there are options
(that all have tradeoffs, of course):
  * We could find those things, other than search, for which we use
m.X_entry.value and move them elsewhere.  The tradeoff would be that
any change in config.metabib_field or normalizer configuration would
have to cause a rewrite of that column.
  * We could store a normal_value version that is fully normalized,
but that will nearly double the table size, and that column would
still have to be rewritten every time a config.metabib_field row or
normalizer configuration changes.  In this case, though, we'd at least
still have the original value and wouldn't have to go back to the
MARC.
  * Alternatively, I mentioned off-hand the option of direct indexing.
 The idea is to use an expression index defined by each row in in
config.metabib_field and have a background process keep the indexes in
sync with configuration as things in that table (and the related
normalizer configuration, etc) changes.  I fear that's the path to
madness, but it would be the most space efficient way to handle
things.

IMO, since we have a couple angles of attack under the current schema
for optimizing what we're doing, those are the safer course to start
with.

[Side note: if you don't need the language-base relevance bump
(because, say, the vast majority of your collection is english),
remove the default_preferred_language[_weight] elements from your
opensrf.xml -- you should save a good bit from that alone.]

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com