[OPEN-ILS-DEV] Automatic stemming in Evergreen

Tue Aug 14 09:59:40 EDT 2012

On Tue, Aug 14, 2012 at 9:16 AM, Dan Scott <dan at coffeecode.net> wrote:
> On Tue, Aug 14, 2012 at 06:22:15AM -0400, Kathy Lussier wrote:
>> Hi all,
>>
>> We've had difficulty finding records in our catalog due to the
>> automatic stemming that occurs when records are indexed in
>> Evergreen. As an example, a title on one of our summer readings
>> lists was "The Assist" by Neil Swidey. However, when users were
>> searching for "the assist" as a title search with the phrase
>> enclosed in quotations, they still had to page through several pages
>> of results before finding the title they needed. Many of the records
>> that ranked higher contained words like "assistance", "assistive",
>> "assisted", etc. because they were automatically stemmed at
>> indexing, and the stemmed version of the word (assist) was what was
>> stored in the index vector column. We've had many other examples
>> where this stemming has made it difficult to conduct searches.
>
> This particular example is quite a concern! I haven't noticed anything
> similar yet, since we moved to Evergreen 2.3ish last week, and nobody
> has brought a similar problem to my attention, but it might just be
> early days for us.

The parts of search relating to this haven't changed for 2.3, AFAIK.
Dan, were you thinking of something in particular that changed?

Relating particularly to this example, though, is the (now generally
unused) relevance adjustment code that targets exactly this problem by
boosting the score of (without restricting to) exact word matches.
One of the hopes with the "convert some stored procedures from perl or
plpgsql to C" GSoC project was to make that faster, but the expected
improvement didn't materialize quite as expected -- which, I want to
make clear, is not the fault of the student -- but I (and, it seems,
Dan, too, at least tangentially) will be looking at for
Evergreen.NEXT, along with other things like trigram indexing and some
other fun stuff.

>
>> In digging through IRC logs and other list messages regarding
>> stemming, people have mentioned that this stemming can be turned off
>> so that the full words are indexed rather than the stemmed versions
>> of a word. Can anybody tell me how this is done? I understand that
>> the records would need to be reingested, but is there a flag that
>> needs to be disabled to turn off the stemming or does it require
>> something else?
>
> The simplest way to do this in a new Evergreen instance is to change the
> configuration of the text search dictionary in
> Open-ILS/src/sql/Pg/000.english.pg91.fts-config.sql - for example,
> instead of using the snowball stemming algorithm as a basis for the full
> text search, just use the "simple" dictionary which returns the
> lowercase version of the incoming text:
>
> CREATE TEXT SEARCH DICTIONARY english_nostop (TEMPLATE=pg_catalog.simple);
>
> Note, however, that this is likely to cause other problems for
> searchers; in the default "concerto" sample set of records, for example,
> people will have to search for "concertos" to get matches for
> "concertos"; "concerto" won't result in a match (and vice versa).
>
>> Also, is there a way to use another dictionary for
>> the stemmer so that the stemming is somewhat less aggressive than is
>> used by the snowball stemmer? Overall, we like the concept of
>> stemming, particularly when it retrieves results for both singular
>> and plural versions of a word, but we've had many examples where
>> stemming seems to be throwing users off course.
>
> ispell support was added in the last few versions of PostgreSQL, which
> might be worth exploring. I plan to dig into the current state of PostgreSQL
> full-text search over the next few weeks, so the timing of your question
> is quite good!

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com