[OPEN-ILS-DEV] Questions about keyword search indexing
Kathy Lussier
klussier at masslnc.org
Thu Jan 29 10:58:34 EST 2015
Hi all,
Since implementing Evergreen, the consortia I work with have added
several custom indexes, mostly for the keyword class, to address the
following issues:
* Add weights for particular indexes (title, subject, author) for a
keyword search.
* Add MARC fields that are not covered by MODS or are excluded from the
default keyword index (aka the blob).
* Add indexes for 880 fields.
Although we love the way the system is flexible enough to allow us to
make these adjustments, we've come across a few problems that has us
re-thinking the way we set up our indexes. I wanted to send along some
of the problems we've encountered and the solutions we're considering to
see if you all have feedback on our potential solutions, if you see
drawbacks we hadn't considered, or if you have alternative approaches
that might work.
Some of the issues we've encountered:
* When adding an index to add a MARC tag that is not covered by the
keyword blob, we run into the problem where these indexes are not
"combined." For example, NOBLE added a local field (962) for captions of
digital images to its records and then added this tag to an entry for
config.metabib_field. If you do a keyword search for "ship george
west," you will successfully find this record -
http://evergreen.noblenet.org/eg/opac/record/1622058?query=ship%20george%20west;qtype=keyword;locg=1
- based on records in the 962 field, but if you do a keyword search for
"ship george west marblehead," you will get no results because
marblehead is in a different entry of the metabib.keyword_field_entry
table than the other keywords are.
For a brief period of 2.4, a new combined indexes feature was turned on
by default to address this problem, but it was subsequently turned off
due to the bug reported at
https://bugs.launchpad.net/evergreen/+bug/1169693. Since we use
additional indexes for the keyword class for the purpose of weighting,
we don't want to turn them on because we're concerned about how it would
affect relevance ranking.
* When adding an index to add a MARC tag that is not covered by the
keyword blob, we also run into a problem of unintentionally adding
weight to that index even when we don't want to give it additional
weight. As an example, two of our Evergreen sites added the 260b tag to
the keyword index so that the publisher could be searchable. We then had
an issue where a keyword search for "free spirit," pulled up lots of
results for books by a publisher called Free Spirit ahead of the book
that was being sought (Free Spirit: Growing Up On the Road and Off the
Grid).
http://bark.cwmars.org/eg/opac/results?query=free+spirit&qtype=title&fg%3Aformat_filters=&locg=1&sort=.
The weight for the publisher field is 1, and the weight for the title
field is 10. However, those records come up sooner because they score so
highly when it comes to coverage density.
If there were an easier way to include the publisher field in the blob,
then we could have avoided this problem. However, there didn't seem to
be an easy way to include it while continuing to exclude the rest of the
fields that are contained within the MODS origininfo element.
* We're also aware of the fact that, since we have to add an index every
time we need to add a MARC tag, we're adding more fields to the
metabib.keyword_entry table. In addition, one of our sites has added 880
indexes using the method I described here -
http://markmail.org/message/dlzvcxezaycychd4 , which indexes all
records, not just those with an 880 field. We wonder if the addition of
all these entries ultimately impacts performance. We are also looking at
ways of indexing just those records with the 880 field when using the
marc21expand880 format.
* With the addition of our custom indexes, one question that frequently
arises is if we're hurting overall search performance by adding indexes
that result in more entries for our metabib.keyword_field_entry tables.
Since I work with three different Evergreen sites, we're able to do some
comparison. The site that has added the fewest (if any) custom indexes
does indeed seem to have faster keyword searches.
If a site is only using the default keyword blob and no other keyword
indexes, the number of entries in the metabib_keyword_field_entry table
should be the same as the number of bib records in the system. However,
when looking at the sites that have added more custom indexes, we see as
many as 10 times as many entries in the metabib_keyword_field_entry
table as there are bib records in the database. At what point do the
number of metabib entries begin to seriously impact search performance?
One of our sites is considering a different approach to configuring its
search indexes, and I wanted to put our ideas out here to see if you
have feedback on it.
Instead of using the keyword blob based on mods, they are experimenting
with setting up all of their keyword indexes to be based on MARC tags.
We've talked about two ways of making this happen:
* Creating another keyword blob, similar to the current default index,
that includes all of the MARC tags and subfields that we want included
in the keyword index. Under this approach, we could easily add tags like
the 260b or the 962 without creating an additional config.metabib_field
entry. Those tags would just be incorporated in the main blob and,
therefore, wouldn't be adding unintended weight to keyword searches or
running up against the combined indexes problem. Under this scenario,
there would still be cases where we would need to configure additional
keyword indexes whenever we wanted to apply additional weight to a
specific MARC field.
* The other approach is to create multiple config.metabib_field entries
for individual MARC tags, or perhaps for groups of tags. We would then
turn the combined indexes on for keyword indexes. Based on Mike's
comments at
https://bugs.launchpad.net/evergreen/+bug/1169693/comments/2, since we
no longer would have the large keyword blob, turning combined indexes on
shouldn't lead to the relevance-ranking problems that arose when 2.4 was
first released. We could then adjust the weight for those entries that
include the more relevant MARC tags. However, I do have concerns that
we'll continue to come across problems like the "Free Spirit" example I
mentioned above where an exact match on a 260b tag will continue to
float above other more relevant results.
While this site continues looking at the indexes, can you tell me if
there is anything we would be losing by going from indexes based on MODS
to ones based purely on MARC tags? What were the original reasons for
basing the default indexes on MODS?
Also, are there any pros or cons to implementing either of the
approaches I mentioned above? Do you see other ways to address some of
the issues we've come across in creating custom indexes?
Thanks in advance for your insight!
Kathy
--
Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 343-0128
klussier at masslnc.org
Twitter:http://www.twitter.com/kmlussier
#evergreen IRC: kmlussier
More information about the Open-ils-dev
mailing list