[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Mike Rylander mrylander at gmail.com
Tue Mar 6 17:00:10 EST 2012


On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier <klussier at masslnc.org> wrote:
> Hi all,
>
> I mentioned this during an e-mail discussion on the list last month, but I
> just wanted to hear from others in the Evergreen community about whether
> there is a desire to improve the relevance ranking for search results in
> Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can
> look at things like the document length, word proximity, and unique word
> count. We've found that we had to remove the modifiers for document length
> and unique word count to prevent a problem where brief bib records were
> ranked way too high in our search results.

FWIW, there is a library testing some new combinations of CD modifiers
and having some success.  As soon as I know more I will share (if they
don't first).

>
> In our local discussions, we've thought the following enhancements could
> improve the ranking of search results:
>
> * Giving greater weight to a record if the search terms appear in the title
> or subject (ideally, we would like these field to be configurable.) This is
> something that is tweakable in search.relevance_ranking, but my
> understanding is that the use of these tweaks results in a major reduction
> in search performance.
>

Indeed they do, however rewriting them in C to be super-fast would
improve this situation.  It's primarily a matter of available time and
effort.  It's also, however, pretty specialized work as you're dealing
with Postgres at a very intimate level.

> * Using some type of popularity metric to boost relevancy for popular
> titles. I'm not sure what this metric should be (number of copies attached
> to record? Total circs in last x months? Total current circs?), but we
> believe some type of popularity measure would be particularly helpful in a
> public library where searches will often be for titles that are popular. For
> example, a search for "twilight" will most likely be for the Stephanie
> Meyers novel and not this
> http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
> Rylander had indicated in a previous e-mail
> (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
> handle this through an overnight cron job without a negative impact on
> search speeds.

Right ... A regular stats-gathering job could certainly allow this,
and (if the "QuqeryParser explain" branch gets merged to master so we
have a standard search canonicalization function) logged query
analysis is another option as well.

>
> Do others think these two enhancements would improve the search results in
> Evergreen? Do you think there are other things we could do to improve
> relevancy? My main concern would be that any changes might slow down search
> speeds, and I would want to make sure that we could do something to retrieve
> better search results without a slowdown.
>

I would prefer better results with a speed /increase/! :)  But, who wouldn't.

I can offer at least one lower-hanging fruit idea: switch from GIST
indexes to GIN indexes by default, as they're much faster these days.

> Also, I was wondering if this type of project might be a good candidate for
> a Google Summer of Code project.
>

The fairly mechanical change from GIST to GIN indexing is definitely a
small-effort thing. I think the other ideas listed here (and still
others from the past, like direct MARC indexing, and use of tsearch
weighting classes) are probably worth trying -- particularly the
relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
turn out to be too big.  It's worth listing them as ideas for
candidates to propose, though.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-general mailing list