[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Mon May 7 15:12:41 EDT 2012

Hi Mike,

 > FWIW, there is a library testing some new combinations of CD modifiers
 > and having some success.  As soon as I know more I will share (if they
 > don't first).

Did anything ever come of this? I would be interested in seeing any 
examples that resulted in improved relevancy.

 > The fairly mechanical change from GIST to GIN indexing is definitely a
 > small-effort thing. I think the other ideas listed here (and still
 > others from the past, like direct MARC indexing, and use of tsearch
 > weighting classes) are probably worth trying -- particularly the
 > relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
 > turn out to be too big.  It's worth listing them as ideas for
 > candidates to propose, though.

I was happy to see that "Optimize Evergreen: Convert PL/Perl-based 
PostgreSQL stored procedures to PL/SQL or PL/C" was one of the accepted 
GSoC projects. However, since I got a little lost in the technical 
details of this discussion, I was curious if, when this GSoC project is 
complete, we can can feel more comfortable about using 
search.relevance_ranking to tweak the relevancy without adversely 
affecting search performance.

I know there were two related GSoC ideas listed, and I wasn't sure if 
both needed to be done together to ultimately improve search speeds.

Thanks!

Kathy

--

Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 756-0172
(508) 755-3721 (fax)
klussier at masslnc.org
Twitter: http://www.twitter.com/kmlussier

On 3/6/2012 5:00 PM, Mike Rylander wrote:
> On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier<klussier at masslnc.org>  wrote:
>> Hi all,
>>
>> I mentioned this during an e-mail discussion on the list last month, but I
>> just wanted to hear from others in the Evergreen community about whether
>> there is a desire to improve the relevance ranking for search results in
>> Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can
>> look at things like the document length, word proximity, and unique word
>> count. We've found that we had to remove the modifiers for document length
>> and unique word count to prevent a problem where brief bib records were
>> ranked way too high in our search results.
>
> FWIW, there is a library testing some new combinations of CD modifiers
> and having some success.  As soon as I know more I will share (if they
> don't first).
>
>>
>> In our local discussions, we've thought the following enhancements could
>> improve the ranking of search results:
>>
>> * Giving greater weight to a record if the search terms appear in the title
>> or subject (ideally, we would like these field to be configurable.) This is
>> something that is tweakable in search.relevance_ranking, but my
>> understanding is that the use of these tweaks results in a major reduction
>> in search performance.
>>
>
> Indeed they do, however rewriting them in C to be super-fast would
> improve this situation.  It's primarily a matter of available time and
> effort.  It's also, however, pretty specialized work as you're dealing
> with Postgres at a very intimate level.
>
>> * Using some type of popularity metric to boost relevancy for popular
>> titles. I'm not sure what this metric should be (number of copies attached
>> to record? Total circs in last x months? Total current circs?), but we
>> believe some type of popularity measure would be particularly helpful in a
>> public library where searches will often be for titles that are popular. For
>> example, a search for "twilight" will most likely be for the Stephanie
>> Meyers novel and not this
>> http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
>> Rylander had indicated in a previous e-mail
>> (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
>> handle this through an overnight cron job without a negative impact on
>> search speeds.
>
> Right ... A regular stats-gathering job could certainly allow this,
> and (if the "QuqeryParser explain" branch gets merged to master so we
> have a standard search canonicalization function) logged query
> analysis is another option as well.
>
>>
>> Do others think these two enhancements would improve the search results in
>> Evergreen? Do you think there are other things we could do to improve
>> relevancy? My main concern would be that any changes might slow down search
>> speeds, and I would want to make sure that we could do something to retrieve
>> better search results without a slowdown.
>>
>
> I would prefer better results with a speed /increase/! :)  But, who wouldn't.
>
> I can offer at least one lower-hanging fruit idea: switch from GIST
> indexes to GIN indexes by default, as they're much faster these days.
>
>> Also, I was wondering if this type of project might be a good candidate for
>> a Google Summer of Code project.
>>
>
> The fairly mechanical change from GIST to GIN indexing is definitely a
> small-effort thing. I think the other ideas listed here (and still
> others from the past, like direct MARC indexing, and use of tsearch
> weighting classes) are probably worth trying -- particularly the
> relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
> turn out to be too big.  It's worth listing them as ideas for
> candidates to propose, though.
>