[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Thu May 10 12:58:48 EDT 2012

On Mon, May 7, 2012 at 3:12 PM, Kathy Lussier <klussier at masslnc.org> wrote:
> Hi Mike,
>
>
>> FWIW, there is a library testing some new combinations of CD modifiers
>> and having some success.  As soon as I know more I will share (if they
>> don't first).
>
> Did anything ever come of this? I would be interested in seeing any examples
> that resulted in improved relevancy.
>

The testing occurred, but I haven't heard the the outcome yet.  I'll
dig for it ASAP.

>
>> The fairly mechanical change from GIST to GIN indexing is definitely a
>> small-effort thing. I think the other ideas listed here (and still
>> others from the past, like direct MARC indexing, and use of tsearch
>> weighting classes) are probably worth trying -- particularly the
>> relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
>> turn out to be too big.  It's worth listing them as ideas for
>> candidates to propose, though.
>
> I was happy to see that "Optimize Evergreen: Convert PL/Perl-based
> PostgreSQL stored procedures to PL/SQL or PL/C" was one of the accepted GSoC
> projects. However, since I got a little lost in the technical details of
> this discussion, I was curious if, when this GSoC project is complete, we
> can can feel more comfortable about using search.relevance_ranking to tweak
> the relevancy without adversely affecting search performance.
>

Short version: yes

Longer version: that's exactly one of the goals, and there are some
other avenues of attack as well that should speed search and are
related to (but not strictly inside) the GSoC project.

--miker

> I know there were two related GSoC ideas listed, and I wasn't sure if both
> needed to be done together to ultimately improve search speeds.
>
> Thanks!
>
> Kathy
>
>
> --
>
> Kathy Lussier
> Project Coordinator
> Massachusetts Library Network Cooperative
> (508) 756-0172
> (508) 755-3721 (fax)
> klussier at masslnc.org
> Twitter: http://www.twitter.com/kmlussier
>
>
> On 3/6/2012 5:00 PM, Mike Rylander wrote:
>>
>> On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier<klussier at masslnc.org>
>>  wrote:
>>>
>>> Hi all,
>>>
>>> I mentioned this during an e-mail discussion on the list last month, but
>>> I
>>> just wanted to hear from others in the Evergreen community about whether
>>> there is a desire to improve the relevance ranking for search results in
>>> Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it
>>> can
>>> look at things like the document length, word proximity, and unique word
>>> count. We've found that we had to remove the modifiers for document
>>> length
>>> and unique word count to prevent a problem where brief bib records were
>>> ranked way too high in our search results.
>>
>>
>> FWIW, there is a library testing some new combinations of CD modifiers
>> and having some success.  As soon as I know more I will share (if they
>> don't first).
>>
>>>
>>> In our local discussions, we've thought the following enhancements could
>>> improve the ranking of search results:
>>>
>>> * Giving greater weight to a record if the search terms appear in the
>>> title
>>> or subject (ideally, we would like these field to be configurable.) This
>>> is
>>> something that is tweakable in search.relevance_ranking, but my
>>> understanding is that the use of these tweaks results in a major
>>> reduction
>>> in search performance.
>>>
>>
>> Indeed they do, however rewriting them in C to be super-fast would
>> improve this situation.  It's primarily a matter of available time and
>> effort.  It's also, however, pretty specialized work as you're dealing
>> with Postgres at a very intimate level.
>>
>>> * Using some type of popularity metric to boost relevancy for popular
>>> titles. I'm not sure what this metric should be (number of copies
>>> attached
>>> to record? Total circs in last x months? Total current circs?), but we
>>> believe some type of popularity measure would be particularly helpful in
>>> a
>>> public library where searches will often be for titles that are popular.
>>> For
>>> example, a search for "twilight" will most likely be for the Stephanie
>>> Meyers novel and not this
>>> http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
>>> Rylander had indicated in a previous e-mail
>>> (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
>>> handle this through an overnight cron job without a negative impact on
>>> search speeds.
>>
>>
>> Right ... A regular stats-gathering job could certainly allow this,
>> and (if the "QuqeryParser explain" branch gets merged to master so we
>> have a standard search canonicalization function) logged query
>> analysis is another option as well.
>>
>>>
>>> Do others think these two enhancements would improve the search results
>>> in
>>> Evergreen? Do you think there are other things we could do to improve
>>> relevancy? My main concern would be that any changes might slow down
>>> search
>>> speeds, and I would want to make sure that we could do something to
>>> retrieve
>>> better search results without a slowdown.
>>>
>>
>> I would prefer better results with a speed /increase/! :)  But, who
>> wouldn't.
>>
>> I can offer at least one lower-hanging fruit idea: switch from GIST
>> indexes to GIN indexes by default, as they're much faster these days.
>>
>>> Also, I was wondering if this type of project might be a good candidate
>>> for
>>> a Google Summer of Code project.
>>>
>>
>> The fairly mechanical change from GIST to GIN indexing is definitely a
>> small-effort thing. I think the other ideas listed here (and still
>> others from the past, like direct MARC indexing, and use of tsearch
>> weighting classes) are probably worth trying -- particularly the
>> relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
>> turn out to be too big.  It's worth listing them as ideas for
>> candidates to propose, though.
>>
>

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com