[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Mike Rylander mrylander at gmail.com
Wed Mar 7 07:36:19 EST 2012


On Wed, Mar 7, 2012 at 2:28 AM, Dan Scott <dan at coffeecode.net> wrote:
> Lots of <snip>s implied below; also note that I'm running James' 2.0
> query on a 2.1 system.
>
> On Tue, Mar 06, 2012 at 10:55:24PM -0500, Mike Rylander wrote:
>> On Tue, Mar 6, 2012 at 6:13 PM, James Fournie
>> <jfournie at sitka.bclibraries.ca> wrote:
>> >>>
>> >>> * Giving greater weight to a record if the search terms appear in the title
>> >>> or subject (ideally, we would like these field to be configurable.) This is
>> >>> something that is tweakable in search.relevance_ranking, but my
>> >>> understanding is that the use of these tweaks results in a major reduction
>> >>> in search performance.
>> >>>
>> >>
>> >> Indeed they do, however rewriting them in C to be super-fast would
>> >> improve this situation.  It's primarily a matter of available time and
>> >> effort.  It's also, however, pretty specialized work as you're dealing
>> >> with Postgres at a very intimate level.
>
> Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
> improvement, but it looks like another bottleneck for broad searches is
> in having to visit & sort hundreds of thousands of rows, so that they
> can be sorted by rank, with the added I/O cost of using a disk merge for
> these broad searches rather than in-memory quicksort.
>
> For comparison, I swapped out 'canada' for 'paraguay' and explain
> analyzed the results; 'canada' uses a disk merge because it needs to
> deal with 482,000 rows of data and sort 596,000 KB of data, while
> 'paraguay' (which only has to sort 322 rows) used an in-memory quicksort
> at 582 KB.
>
> This is on a system where work_mem is set to 288 MB - much higher than
> one would generally want, particularly for the number of physical
> connections that could potentially get ramped up. That high work_mem
> helps with reasonably broad searches, but searching for "Canada" in a
> Canadian academic library, you might as well be searching for "the"...
>

All true, but also not something we can do much about (without a
precalculated rank, a la PageRank); also, testing on 9.0 around its
release shows that pre-limiting as we used to was slower than what we
do today.  I don't have the details in front of me, but there it is.

>
>> Indeed, and naco_normalize is not necessarily the only normalizer that
>> will be applied to each and every field!  If you search a class or
>> field that uses other (pos >= 0) normalizers, all of those will also
>> be applied to both the column value and the user input.
>>
>> There's some good news on this front, though.  Galen recently
>> implemented a trimmed down version of naco_normalize, called
>> search_normalize, that should be a bit faster.  That should lower the
>> total cost by a noticeable amount over many thousands of rows.
>
> You might be thinking of something else? I'm pretty sure that
> 2bc4e97f72b shows that I implemented search_normalize() simply to avoid
> problems with apostrophe mangling in the strict naco_normalize()
> function - and I doubt there will be any observable difference in
> performance.

Sorry, I probably am.  My apologies, Dan, I didn't intend to misdirect
your credit.  That said, I'd be surprised if anything that shortened
the pl/perl we use didn't help some, in aggregate, on very large
queries.  It's testable...

>
>> Hrm... and looking at your example, I spotted a chance for at least
>> one optimization.  If we recognize that there is only one term in a
>> search (as in your "canada" example) we can skip the word-order
>> rel_adjustment if we're told to apply it, saving ~1/3 of the cost of
>> that particular chunk.
>
> I can confirm this; running the same query on our system with word order
> removed carved response times down to 390 seconds from 580 seconds.
> Still unusable, but better. (EXPLAIN ANALYZE of the inner query
> attached).
>
>>   * Alternatively, I mentioned off-hand the option of direct indexing.
>>  The idea is to use an expression index defined by each row in in
>> config.metabib_field and have a background process keep the indexes in
>> sync with configuration as things in that table (and the related
>> normalizer configuration, etc) changes.  I fear that's the path to
>> madness, but it would be the most space efficient way to handle
>> things.
>
> I don't think that's the path to madness; it appeals to me, at least.
> (Okay, it's probably insane then.)
>

In a broad sense it appeals to me, too.  You and I were the ones who
discussed this in the long-long-ago, IYR.  It's when I start digging
into the details of implementation and the implications for
config-change-based thrashing that I start going a little mad ...

OH!  Ranking via ts_rank[_cd].  We can't do it without the tsvector in hand.

But if we can work around that somehow (a table that stores only that
value, reducing the sort size you mention above?), it's not
impossible.

>> [Side note: if you don't need the language-base relevance bump
>> (because, say, the vast majority of your collection is english),
>> remove the default_preferred_language[_weight] elements from your
>> opensrf.xml -- you should save a good bit from that alone.]
>
> You may also want to remove that if you have a collection with an evenly
> distributed mix of languages and a corresponding user base. With our
> bilingual population & collection, and languages (English & French) that
> share a lot of similar roots (particularly when an English stemming
> algorithm is applied), the added bump for English was quite disturbing
> for francophones & quickly disabled.
>

Good point!

> Also, it appears that removing the pertinent clause didn't affect
> response times at all. But it's 2:30 am, so I should stop testing and
> trying to draw conclusions at this point!

That should be looked into.  Removing the weight setting, in
particular, should have caused the whole language check to go away.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-general mailing list