[OPEN-ILS-DEV] Questions about keyword search indexing

Thu Jan 29 12:20:09 EST 2015

Kathy,

On the bug you linked to, in the IRC paste, I mentioned creating an
indexing definition for each field you want to search.  That would
certainly get you closer to what you want, but (as you've seen) adding many
index definitions has a cost in performance.

One tactic would be to create a new XSLT for config.xml_transform that
extracts exactly the "blob" you want from the MARCXML.  It could be very
simple, excluding specific fields (parts of the 260, all control fields,
many 0xx datafields) and returning a stripped-down MARCXML document with
specific fields excluded.  Then, just make use of that transform in your
keyword|keyword definition.  You could put that in place while the original
was still active, reingest, swap the names of the new and old definition,
remove the old definition (and the backing data in the metabib table), and
be done.  (NOTE: you'll want to vacuum and reindex the table afterwards,
overnight, if you do that.)

Another possibility, which would be core development (but not much), would
be to allow a stack of pre-indexing transforms.  For your specific case,
the idea would be to create a simple XSLT that excludes fields you never
want exposed (the non-publisher parts of the originInfo element, in MODS
context), and put that in front of the stock mods32 transform used for
keyword|keyword.  There would be some insert/update-time impact from that
if used liberally and with many different pre-transformations -- we try
hard to use each transform only once per record per insert/update to avoid
wasted time on transformations -- but that should not be noticeable to
staff editing MARC one record at a time.  It would have some impact batch
loading, but, if used only where necessary, or in the same way for all
index definitions, the impact would be minimal.

To answer your question about whether the number of index definitions can
impact search speed: yes, absolutely.  There are trade-offs involving table
size, index size, row count, row size, and row uniqueness.

The main reason for using MODS instead of the straight MARC is that MODS
handles the AACR2 interpretation for us; it stitches together the
appropriate fields, applies NFI to titles, and the like.  Otherwise we
would need to invent and implement our own plugin system for all those
rules (instead of letting LoC do that work for us via MODS), or hard-code
them in the indexing code.  In the case of the "blob" specifically, it
almost perfectly excludes everything we want to exclude ... with the
exception of making originInfo harder to deal with for those sites that
want to include publisher name.

HTH,

--
Mike Rylander
 | President
 | Equinox Software, Inc. / The Open Source Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com

On Thu, Jan 29, 2015 at 10:58 AM, Kathy Lussier <klussier at masslnc.org>
wrote:

> Hi all,
>
> Since implementing Evergreen, the consortia I work with have added several
> custom indexes, mostly for the keyword class, to address the following
> issues:
>
> * Add weights for particular indexes (title, subject, author) for a
> keyword search.
> * Add MARC fields that are not covered by MODS or are excluded from the
> default keyword index (aka the blob).
> * Add indexes for 880 fields.
>
> Although we love the way the system is flexible enough to allow us to make
> these adjustments, we've come across a few problems that has us re-thinking
> the way we set up our indexes. I wanted to send along some of the problems
> we've encountered and the solutions we're considering to see if you all
> have feedback on our potential solutions, if you see drawbacks we hadn't
> considered, or if you have alternative approaches that might work.
>
> Some of the issues we've encountered:
>
> * When adding an index to add a MARC tag that is not covered by the
> keyword blob, we run into the problem where these indexes are not
> "combined." For example, NOBLE added a local field (962) for captions of
> digital images to its records and then added this tag to an entry for
> config.metabib_field.  If you do a keyword search for "ship george west,"
> you will successfully find this record - http://evergreen.noblenet.org/
> eg/opac/record/1622058?query=ship%20george%20west;qtype=keyword;locg=1 -
> based on records in the 962 field, but if you do a keyword search for "ship
> george west marblehead," you will get no results because marblehead is in a
> different entry of the metabib.keyword_field_entry table than the other
> keywords are.
>
> For a brief period of 2.4, a new combined indexes feature was turned on by
> default to address this problem, but it was subsequently turned off due to
> the bug reported at https://bugs.launchpad.net/evergreen/+bug/1169693.
> Since we use additional indexes for the keyword class for the purpose of
> weighting, we don't want to turn them on because we're concerned about how
> it would affect relevance ranking.
>
> * When adding an index to add a MARC tag that is not covered by the
> keyword blob, we also run into a problem of unintentionally adding weight
> to that index even when we don't want to give it additional weight. As an
> example, two of our Evergreen sites added the 260b tag to the keyword index
> so that the publisher could be searchable. We then had an issue where a
> keyword search for "free spirit," pulled up lots of results for books by a
> publisher called Free Spirit ahead of the book that was being sought (Free
> Spirit: Growing Up On the Road and Off the Grid).
> http://bark.cwmars.org/eg/opac/results?query=free+spirit&qtype=title&fg%
> 3Aformat_filters=&locg=1&sort=. The weight for the publisher field is 1,
> and the weight for the title field is 10. However, those records come up
> sooner because they score so highly when it comes to coverage density.
>
> If there were an easier way to include the publisher field in the blob,
> then we could have avoided this problem. However, there didn't seem to be
> an easy way to include it while continuing to exclude the rest of the
> fields that are contained within the MODS origininfo element.
>
> * We're also aware of the fact that, since we have to add an index every
> time we need to add a MARC tag, we're adding more fields to the
> metabib.keyword_entry table. In addition, one of our sites has added 880
> indexes using the method I described here - http://markmail.org/message/
> dlzvcxezaycychd4 , which indexes all records, not just those with an 880
> field. We wonder if the addition of all these entries ultimately impacts
> performance. We are also looking at ways of indexing just those records
> with the 880 field when using the marc21expand880 format.
>
> * With the addition of our custom indexes, one question that frequently
> arises is if we're hurting overall search performance by adding indexes
> that result in more entries for our metabib.keyword_field_entry tables.
> Since I work with three different Evergreen sites, we're able to do some
> comparison. The site that has added the fewest (if any) custom indexes does
> indeed seem to have faster keyword searches.
>
> If a site is only using the default keyword blob and no other keyword
> indexes, the number of entries in the metabib_keyword_field_entry table
> should be the same as the number of bib records in the system. However,
> when looking at the sites that have added more custom indexes, we see as
> many as 10 times as many entries in the metabib_keyword_field_entry table
> as there are bib records in the database. At what point do the number of
> metabib entries begin to seriously impact search performance?
>
> One of our sites is considering a different approach to configuring its
> search indexes, and I wanted to put our ideas out here to see if you have
> feedback on it.
>
> Instead of using the keyword blob based on mods, they are experimenting
> with setting up all of their keyword indexes to be based on MARC tags.
> We've talked about two ways of making this happen:
>
> * Creating another keyword blob, similar to the current default index,
> that includes all of the MARC tags and subfields that we want included in
> the keyword index. Under this approach, we could easily add tags like the
> 260b or the 962 without creating an additional config.metabib_field entry.
> Those tags would just be incorporated in the main blob and, therefore,
> wouldn't be adding unintended weight to keyword searches or running up
> against the combined indexes problem. Under this scenario, there would
> still be cases where we would need to configure additional keyword indexes
> whenever we wanted to apply additional weight to a specific MARC field.
>
> * The other approach is to create multiple config.metabib_field entries
> for individual MARC tags, or perhaps for groups of tags. We would then turn
> the combined indexes on for keyword indexes. Based on Mike's comments at
> https://bugs.launchpad.net/evergreen/+bug/1169693/comments/2, since we no
> longer would have the large keyword blob, turning combined indexes on
> shouldn't lead to the relevance-ranking problems that arose when 2.4 was
> first released. We could then adjust the weight for those entries that
> include the more relevant MARC tags. However, I do have concerns that we'll
> continue to come across problems like the "Free Spirit" example I mentioned
> above where an exact match on a 260b tag will continue to float above other
> more relevant results.
>
> While this site continues looking at the indexes, can you tell me if there
> is anything we would be losing by going from indexes based on MODS to ones
> based purely on MARC tags? What were the original reasons for basing the
> default indexes on MODS?
>
> Also, are there any pros or cons to implementing either of the approaches
> I mentioned above? Do you see other ways to address some of the issues
> we've come across in creating custom indexes?
>
> Thanks in advance for your insight!
>
> Kathy
>
>
> --
> Kathy Lussier
> Project Coordinator
> Massachusetts Library Network Cooperative
> (508) 343-0128
> klussier at masslnc.org
> Twitter:http://www.twitter.com/kmlussier
> #evergreen IRC: kmlussier
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20150129/f1be31b7/attachment.html>