[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Wed Mar 7 13:29:03 EST 2012

Thanks guys for your feedback, just a few more thoughts I had...

On 2012-03-06, at 11:28 PM, Dan Scott wrote:
>>>> Indeed they do, however rewriting them in C to be super-fast would
>>>> improve this situation.  It's primarily a matter of available time and
>>>> effort.  It's also, however, pretty specialized work as you're dealing
>>>> with Postgres at a very intimate level.
> 
> Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
> improvement,

For me, I am not terribly excited about the prospect of another bit of C floating around, particularly if it's in the Postgres bits.  I personally am not familiar with C except I can compile it and run it and maybe vaguely get an idea of what it does, but I can't really debug it very well or do anything terribly useful with it.  My concern is adding more "specialized work" to the system makes things harder to maintain and less accessible to newcomers with the typical Linux+Perl+SQL or JS knowledge.   We could write open-ils.circ in C as well, but it's just not a good idea, so I think in general we should dismiss "rewrite it in C" as a solution to anything.

>>> Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down.  While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records:
>>> 
>>> explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 10000;
> 
> To quibble, the increase in the number of records to the time to process
> doesn't appear to be an exponential slowdown; it's linear (at least on
> our system); 10 times the records = (roughly) 10 times as long to
> retrieve, which is what I would expect:

Yes sorry my mistake, I had done the same measurement so I don't know why I used the word, sort of like using the word "literally" when you don't mean "literally" I guess :)

>>> When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records.  You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger.  It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment.
>>> 
> 
> Have you tried giving the function a different cost estimate, per
> https://bugs.launchpad.net/evergreen/+bug/874603/comments/3 for a
> different but related problem? It's quite possible that something like:
> ....
> That said, some quick testing suggests that it doesn't make a difference
> to the plan, at least for the inner query that's being sent to
> search.query_parser_fts().

Yeah, I tried it but it seemed like it did nothing.   I suspect that it is actually being treated as IMMUTABLE

> 
>>  * Alternatively, I mentioned off-hand the option of direct indexing.
>> The idea is to use an expression index defined by each row in in
>> config.metabib_field and have a background process keep the indexes in
>> sync with configuration as things in that table (and the related
>> normalizer configuration, etc) changes.  I fear that's the path to
>> madness, but it would be the most space efficient way to handle
>> things.
> 
> I don't think that's the path to madness; it appeals to me, at least.
> (Okay, it's probably insane then.)

I am going to be totally frank and say that think it honestly might be the path to madness.  Why do we need more moving parts like a background process?  I'm a little confused, how would a background process keep the indexes in sync when the configuration changes?  Can you flesh this out a bit more?  I don't feel I fully understand how this would work.

To go back to my original idea of normalizing the indexes -- snip back to Mike's first response -- I'm just wondering:

> We need the pre-normalized form for some things
>* We could find those things, other than search, for which we use
> m.X_entry.value and move them elsewhere.  The tradeoff would be that
> any change in config.metabib_field or normalizer configuration would
> have to cause a rewrite of that column.

Realistically, how often would one change normalizers or metabib_fields?   I don't think it's done lightly so it seems like a simple solution with a reasonable tradeoff -- you'd rarely want to change your normalizers so you're only running naco_normalizer on the table once in a blue moon vs running it on much of the table every time a search happens.   In Solr and I'm sure other search engines you have to reindex if you change these kinds of things. 

The question is where is ]the m.X_entry.value used?  It doesn't seem like anywhere when I skim the code but everything's so dynamically generated that it's hard to tell.  Any thoughts where it might be used?   I just have a hard time thinking of a use case for that field.  

~James Fournie
BC Libraries Cooperative