[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Dan Scott dan at coffeecode.net
Wed Mar 7 02:28:27 EST 2012


Lots of <snip>s implied below; also note that I'm running James' 2.0
query on a 2.1 system.

On Tue, Mar 06, 2012 at 10:55:24PM -0500, Mike Rylander wrote:
> On Tue, Mar 6, 2012 at 6:13 PM, James Fournie
> <jfournie at sitka.bclibraries.ca> wrote:
> >>>
> >>> * Giving greater weight to a record if the search terms appear in the title
> >>> or subject (ideally, we would like these field to be configurable.) This is
> >>> something that is tweakable in search.relevance_ranking, but my
> >>> understanding is that the use of these tweaks results in a major reduction
> >>> in search performance.
> >>>
> >>
> >> Indeed they do, however rewriting them in C to be super-fast would
> >> improve this situation.  It's primarily a matter of available time and
> >> effort.  It's also, however, pretty specialized work as you're dealing
> >> with Postgres at a very intimate level.

Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
improvement, but it looks like another bottleneck for broad searches is
in having to visit & sort hundreds of thousands of rows, so that they
can be sorted by rank, with the added I/O cost of using a disk merge for
these broad searches rather than in-memory quicksort.

For comparison, I swapped out 'canada' for 'paraguay' and explain
analyzed the results; 'canada' uses a disk merge because it needs to
deal with 482,000 rows of data and sort 596,000 KB of data, while
'paraguay' (which only has to sort 322 rows) used an in-memory quicksort
at 582 KB. 

This is on a system where work_mem is set to 288 MB - much higher than
one would generally want, particularly for the number of physical
connections that could potentially get ramped up. That high work_mem
helps with reasonably broad searches, but searching for "Canada" in a
Canadian academic library, you might as well be searching for "the"...

(Aside: we're working towards connection pooling to prevent total
consumption of resources during busy periods).

> > Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down.  While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records:
> >
> > explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 10000;

To quibble, the increase in the number of records to the time to process
doesn't appear to be an exponential slowdown; it's linear (at least on
our system); 10 times the records = (roughly) 10 times as long to
retrieve, which is what I would expect:

conifer=# explain analyze select naco_normalize(value) from
metabib.keyword_field_entry limit 100;
                    QUERY PLAN                                                             
---------------------------------------------------
 Limit  (cost=0.00..12.15 rows=100 width=198) (actual time=0.797..36.126
rows=100 loops=1)
   ->  Seq Scan on keyword_field_entry  (cost=0.00..783765.80
rows=6453080 width=198) (actual time=0.795..36.077 rows=100 loops=1)
 Total runtime: 36.196 ms
(3 rows)

conifer=# explain analyze select naco_normalize(value) from
metabib.keyword_field_entry limit 1000;
                     QUERY PLAN                                                              
-----------------------------------------------------
 Limit  (cost=0.00..121.46 rows=1000 width=198) (actual
time=0.601..392.982 rows=1000 loops=1)
   ->  Seq Scan on keyword_field_entry  (cost=0.00..783765.80
rows=6453080 width=198) (actual time=0.599..392.427 rows=1000 loops=1)
 Total runtime: 393.324 ms
(3 rows)

conifer=# explain analyze select naco_normalize(value) from
metabib.keyword_field_entry limit 10000;
                     QUERY PLAN                                                               
-------------------------------------------------------
 Limit  (cost=0.00..1214.56 rows=10000 width=198) (actual
time=0.132..2408.393 rows=10000 loops=1)
   ->  Seq Scan on keyword_field_entry  (cost=0.00..783765.80
rows=6453080 width=198) (actual time=0.131..2405.298 rows=10000 loops=1)
 Total runtime: 2410.517 ms
(3 rows)

> >
> > When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records.  You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger.  It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment.
> >

Have you tried giving the function a different cost estimate, per
https://bugs.launchpad.net/evergreen/+bug/874603/comments/3 for a
different but related problem? It's quite possible that something like:

-- The single-arg version just calls the two-arg version with a default
-- joiner
ALTER FUNCTION naco_normalize(TEXT) COST 2;
ALTER FUNCTION naco_normalize(TEXT, TEXT) COST 10;

-- And for master / 2.2
ALTER FUNCTION search_normalize(TEXT) COST 2;
ALTER FUNCTION search_normalize(TEXT, TEXT) COST 10;

These hints can help the planner create better plans, and if real-world
testing with big data sets bears them out, we should probably set those
costs (or something like them) in the default schema (along with cost
estimates for all of the normalization functions while we're at it).

That said, some quick testing suggests that it doesn't make a difference
to the plan, at least for the inner query that's being sent to
search.query_parser_fts().

> Indeed, and naco_normalize is not necessarily the only normalizer that
> will be applied to each and every field!  If you search a class or
> field that uses other (pos >= 0) normalizers, all of those will also
> be applied to both the column value and the user input.
> 
> There's some good news on this front, though.  Galen recently
> implemented a trimmed down version of naco_normalize, called
> search_normalize, that should be a bit faster.  That should lower the
> total cost by a noticeable amount over many thousands of rows.

You might be thinking of something else? I'm pretty sure that
2bc4e97f72b shows that I implemented search_normalize() simply to avoid
problems with apostrophe mangling in the strict naco_normalize()
function - and I doubt there will be any observable difference in
performance.

> Hrm... and looking at your example, I spotted a chance for at least
> one optimization.  If we recognize that there is only one term in a
> search (as in your "canada" example) we can skip the word-order
> rel_adjustment if we're told to apply it, saving ~1/3 of the cost of
> that particular chunk.

I can confirm this; running the same query on our system with word order
removed carved response times down to 390 seconds from 580 seconds.
Still unusable, but better. (EXPLAIN ANALYZE of the inner query
attached).

>   * Alternatively, I mentioned off-hand the option of direct indexing.
>  The idea is to use an expression index defined by each row in in
> config.metabib_field and have a background process keep the indexes in
> sync with configuration as things in that table (and the related
> normalizer configuration, etc) changes.  I fear that's the path to
> madness, but it would be the most space efficient way to handle
> things.

I don't think that's the path to madness; it appeals to me, at least.
(Okay, it's probably insane then.)
 
> [Side note: if you don't need the language-base relevance bump
> (because, say, the vast majority of your collection is english),
> remove the default_preferred_language[_weight] elements from your
> opensrf.xml -- you should save a good bit from that alone.]

You may also want to remove that if you have a collection with an evenly
distributed mix of languages and a corresponding user base. With our
bilingual population & collection, and languages (English & French) that
share a lot of similar roots (particularly when an English stemming
algorithm is applied), the added bump for English was quite disturbing
for francophones & quickly disabled.

Also, it appears that removing the pertinent clause didn't affect
response times at all. But it's 2:30 am, so I should stop testing and
trying to draw conclusions at this point!
-------------- next part --------------
EXPLAIN ANALYZE
SELECT  m.source AS id,
         ARRAY_ACCUM(DISTINCT m.source) AS records,
         (AVG(
   (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * x7b52820_keyword.weight
       *  /* first_word */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
       *  /* full_match */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE )::INT * 5, 1))::NUMERIC AS rel,
          (AVG(
    (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * x7b52820_keyword.weight
       *  /* first_word */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
       *  /* full_match */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE )::INT * 5, 1))::NUMERIC AS rank, 
          FIRST(mrd.date1) AS tie_break
    FROM  metabib.metarecord_source_map m
          JOIN metabib.rec_descriptor mrd ON (m.source = mrd.record)
          
   LEFT JOIN (
    SELECT fe.*, fe_weight.weight, x.tsq /* search */
      FROM  metabib.keyword_field_entry AS fe
     JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
     JOIN (SELECT 
     to_tsquery('keyword', COALESCE(NULLIF( '(' || btrim(regexp_replace(split_date_range(naco_normalize($_25078$canada$_25078$)),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) AS tsq ) AS x ON (fe.index_vector @@ x.tsq)
   ) AS x7b52820_keyword ON (m.source = x7b52820_keyword.source)
    WHERE 1=1
          AND ((x7b52820_keyword.id IS NOT NULL))
    GROUP BY 1
    ORDER BY 4 DESC NULLS LAST, 5 DESC NULLS LAST, 3 DESC
    LIMIT 10000
;

 Limit  (cost=182348.66..182373.66 rows=10000 width=895) (actual time=397585.651..397587.978 rows=10000 loops=1)
   ->  Sort  (cost=182348.66..182429.32 rows=32265 width=895) (actual time=397585.650..397586.679 rows=10000 loops=1)
         Sort Key: (((avg((((rank(fe.index_vector, (to_tsquery('keyword'::text, COALESCE(NULLIF('(canada)'::text, '()'::text), ''::text)))) * (fe_weight.weight)::double precisi
on) * (COALESCE(((NULLIF((naco_normalize(fe.value, ''::text) ~ '^canada'::text), false))::integer * 5), 1))::double precision) * (COALESCE(((NULLIF((naco_normalize(fe.value, ''::text) ~ '^canada$'::text), false))::integer * 5), 1))::double precision)) * (COALESCE(((NULLIF((first((populate_record(NULL::metabib.rec_desc_type, record_attr.attrs)).item_lang) = 'eng'::text), false))::integer * 5), 1))::double precision))::numeric), (first((populate_record(NULL::metabib.rec_desc_type, record_attr.attrs)).date1))
         Sort Method:  top-N heapsort  Memory: 2175kB
         ->  GroupAggregate  (cost=171574.13..180043.69 rows=32265 width=895) (actual time=17865.073..396509.549 rows=320560 loops=1)
               ->  Sort  (cost=171574.13..171654.79 rows=32265 width=895) (actual time=17864.049..19182.819 rows=482839 loops=1)
                     Sort Key: m.source
                     Sort Method:  external merge  Disk: 596000kB
                     ->  Hash Join  (cost=5023.65..169157.85 rows=32265 width=895) (actual time=3227.715..13491.037 rows=482839 loops=1)
                           Hash Cond: (fe.field = fe_weight.id)
                           ->  Nested Loop  (cost=5019.82..168710.38 rows=32265 width=895) (actual time=3227.635..13141.832 rows=482839 loops=1)
                                 ->  Nested Loop  (cost=5019.82..127948.09 rows=32265 width=488) (actual time=3227.556..9447.231 rows=482839 loops=1)
                                       ->  Nested Loop  (cost=5019.82..108877.78 rows=32265 width=480) (actual time=3227.494..6528.834 rows=499103 loops=1)
                                             ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.053..0.054 rows=1 loops=1)
                                             ->  Bitmap Heap Scan on keyword_field_entry fe  (cost=5019.82..108474.44 rows=32265 width=448) (actual time=3227.433..6394.146 rows=499103 loops=1)
                                                   Recheck Cond: (fe.index_vector @@ (to_tsquery('keyword'::text, COALESCE(NULLIF('(canada)'::text, '()'::text), ''::text))))
                                                   Filter: (fe.id IS NOT NULL)
                                                   ->  Bitmap Index Scan on metabib_keyword_field_entry_index_vector_idx  (cost=0.00..5011.75 rows=32265 width=0) (actual time=2960.927..2960.927 rows=568979 loops=1)
                                                         Index Cond: (fe.index_vector @@ (to_tsquery('keyword'::text, COALESCE(NULLIF('(canada)'::text, '()'::text), ''::text))))
                                       ->  Index Scan using metabib_metarecord_source_map_source_record_idx on metarecord_source_map m  (cost=0.00..0.58 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=499103)
                                             Index Cond: (m.source = fe.source)
                                 ->  Index Scan using record_attr_pkey on record_attr  (cost=0.00..1.25 rows=1 width=423) (actual time=0.007..0.007 rows=1 loops=482839)
                                       Index Cond: (record_attr.id = m.source)
                           ->  Hash  (cost=3.37..3.37 rows=37 width=8) (actual time=0.055..0.055 rows=37 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 2kB
                                 ->  Seq Scan on metabib_field fe_weight  (cost=0.00..3.37 rows=37 width=8) (actual time=0.009..0.030 rows=37 loops=1)
 Total runtime: 397748.838 ms
(27 rows)


More information about the Open-ils-general mailing list