[OPEN-ILS-DEV] Automatic stemming in Evergreen

Tue Aug 14 09:16:09 EDT 2012

On Tue, Aug 14, 2012 at 06:22:15AM -0400, Kathy Lussier wrote:
> Hi all,
> 
> We've had difficulty finding records in our catalog due to the
> automatic stemming that occurs when records are indexed in
> Evergreen. As an example, a title on one of our summer readings
> lists was "The Assist" by Neil Swidey. However, when users were
> searching for "the assist" as a title search with the phrase
> enclosed in quotations, they still had to page through several pages
> of results before finding the title they needed. Many of the records
> that ranked higher contained words like "assistance", "assistive",
> "assisted", etc. because they were automatically stemmed at
> indexing, and the stemmed version of the word (assist) was what was
> stored in the index vector column. We've had many other examples
> where this stemming has made it difficult to conduct searches.

This particular example is quite a concern! I haven't noticed anything
similar yet, since we moved to Evergreen 2.3ish last week, and nobody
has brought a similar problem to my attention, but it might just be
early days for us.

> In digging through IRC logs and other list messages regarding
> stemming, people have mentioned that this stemming can be turned off
> so that the full words are indexed rather than the stemmed versions
> of a word. Can anybody tell me how this is done? I understand that
> the records would need to be reingested, but is there a flag that
> needs to be disabled to turn off the stemming or does it require
> something else? 

The simplest way to do this in a new Evergreen instance is to change the
configuration of the text search dictionary in
Open-ILS/src/sql/Pg/000.english.pg91.fts-config.sql - for example,
instead of using the snowball stemming algorithm as a basis for the full
text search, just use the "simple" dictionary which returns the
lowercase version of the incoming text:

CREATE TEXT SEARCH DICTIONARY english_nostop (TEMPLATE=pg_catalog.simple);

Note, however, that this is likely to cause other problems for
searchers; in the default "concerto" sample set of records, for example,
people will have to search for "concertos" to get matches for
"concertos"; "concerto" won't result in a match (and vice versa).

> Also, is there a way to use another dictionary for
> the stemmer so that the stemming is somewhat less aggressive than is
> used by the snowball stemmer? Overall, we like the concept of
> stemming, particularly when it retrieves results for both singular
> and plural versions of a word, but we've had many examples where
> stemming seems to be throwing users off course.

ispell support was added in the last few versions of PostgreSQL, which
might be worth exploring. I plan to dig into the current state of PostgreSQL
full-text search over the next few weeks, so the timing of your question
is quite good!