[OPEN-ILS-DEV] Questions about keyword search indexing

Thu Jan 29 10:58:34 EST 2015

Hi all,

Since implementing Evergreen, the consortia I work with have added 
several custom indexes, mostly for the keyword class, to address the 
following issues:

* Add weights for particular indexes (title, subject, author) for a 
keyword search.
* Add MARC fields that are not covered by MODS or are excluded from the 
default keyword index (aka the blob).
* Add indexes for 880 fields.

Although we love the way the system is flexible enough to allow us to 
make these adjustments, we've come across a few problems that has us 
re-thinking the way we set up our indexes. I wanted to send along some 
of the problems we've encountered and the solutions we're considering to 
see if you all have feedback on our potential solutions, if you see 
drawbacks we hadn't considered, or if you have alternative approaches 
that might work.

Some of the issues we've encountered:

* When adding an index to add a MARC tag that is not covered by the 
keyword blob, we run into the problem where these indexes are not 
"combined." For example, NOBLE added a local field (962) for captions of 
digital images to its records and then added this tag to an entry for 
config.metabib_field.  If you do a keyword search for "ship george 
west," you will successfully find this record - 
http://evergreen.noblenet.org/eg/opac/record/1622058?query=ship%20george%20west;qtype=keyword;locg=1 
- based on records in the 962 field, but if you do a keyword search for 
"ship george west marblehead," you will get no results because 
marblehead is in a different entry of the metabib.keyword_field_entry 
table than the other keywords are.

For a brief period of 2.4, a new combined indexes feature was turned on 
by default to address this problem, but it was subsequently turned off 
due to the bug reported at 
https://bugs.launchpad.net/evergreen/+bug/1169693. Since we use 
additional indexes for the keyword class for the purpose of weighting, 
we don't want to turn them on because we're concerned about how it would 
affect relevance ranking.

* When adding an index to add a MARC tag that is not covered by the 
keyword blob, we also run into a problem of unintentionally adding 
weight to that index even when we don't want to give it additional 
weight. As an example, two of our Evergreen sites added the 260b tag to 
the keyword index so that the publisher could be searchable. We then had 
an issue where a keyword search for "free spirit," pulled up lots of 
results for books by a publisher called Free Spirit ahead of the book 
that was being sought (Free Spirit: Growing Up On the Road and Off the 
Grid). 
http://bark.cwmars.org/eg/opac/results?query=free+spirit&qtype=title&fg%3Aformat_filters=&locg=1&sort=. 
The weight for the publisher field is 1, and the weight for the title 
field is 10. However, those records come up sooner because they score so 
highly when it comes to coverage density.

If there were an easier way to include the publisher field in the blob, 
then we could have avoided this problem. However, there didn't seem to 
be an easy way to include it while continuing to exclude the rest of the 
fields that are contained within the MODS origininfo element.

* We're also aware of the fact that, since we have to add an index every 
time we need to add a MARC tag, we're adding more fields to the 
metabib.keyword_entry table. In addition, one of our sites has added 880 
indexes using the method I described here - 
http://markmail.org/message/dlzvcxezaycychd4 , which indexes all 
records, not just those with an 880 field. We wonder if the addition of 
all these entries ultimately impacts performance. We are also looking at 
ways of indexing just those records with the 880 field when using the 
marc21expand880 format.

* With the addition of our custom indexes, one question that frequently 
arises is if we're hurting overall search performance by adding indexes 
that result in more entries for our metabib.keyword_field_entry tables. 
Since I work with three different Evergreen sites, we're able to do some 
comparison. The site that has added the fewest (if any) custom indexes 
does indeed seem to have faster keyword searches.

If a site is only using the default keyword blob and no other keyword 
indexes, the number of entries in the metabib_keyword_field_entry table 
should be the same as the number of bib records in the system. However, 
when looking at the sites that have added more custom indexes, we see as 
many as 10 times as many entries in the metabib_keyword_field_entry 
table as there are bib records in the database. At what point do the 
number of metabib entries begin to seriously impact search performance?

One of our sites is considering a different approach to configuring its 
search indexes, and I wanted to put our ideas out here to see if you 
have feedback on it.

Instead of using the keyword blob based on mods, they are experimenting 
with setting up all of their keyword indexes to be based on MARC tags. 
We've talked about two ways of making this happen:

* Creating another keyword blob, similar to the current default index, 
that includes all of the MARC tags and subfields that we want included 
in the keyword index. Under this approach, we could easily add tags like 
the 260b or the 962 without creating an additional config.metabib_field 
entry. Those tags would just be incorporated in the main blob and, 
therefore, wouldn't be adding unintended weight to keyword searches or 
running up against the combined indexes problem. Under this scenario, 
there would still be cases where we would need to configure additional 
keyword indexes whenever we wanted to apply additional weight to a 
specific MARC field.

* The other approach is to create multiple config.metabib_field entries 
for individual MARC tags, or perhaps for groups of tags. We would then 
turn the combined indexes on for keyword indexes. Based on Mike's 
comments at 
https://bugs.launchpad.net/evergreen/+bug/1169693/comments/2, since we 
no longer would have the large keyword blob, turning combined indexes on 
shouldn't lead to the relevance-ranking problems that arose when 2.4 was 
first released. We could then adjust the weight for those entries that 
include the more relevant MARC tags. However, I do have concerns that 
we'll continue to come across problems like the "Free Spirit" example I 
mentioned above where an exact match on a 260b tag will continue to 
float above other more relevant results.

While this site continues looking at the indexes, can you tell me if 
there is anything we would be losing by going from indexes based on MODS 
to ones based purely on MARC tags? What were the original reasons for 
basing the default indexes on MODS?

Also, are there any pros or cons to implementing either of the 
approaches I mentioned above? Do you see other ways to address some of 
the issues we've come across in creating custom indexes?

Thanks in advance for your insight!

Kathy

-- 
Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 343-0128
klussier at masslnc.org
Twitter:http://www.twitter.com/kmlussier
#evergreen IRC: kmlussier