[OPEN-ILS-DEV] Automatic stemming in Evergreen
Mike Rylander
mrylander at gmail.com
Tue Aug 14 14:42:06 EDT 2012
On Aug 14, 2012 11:58 AM, "Benjamin Kalish" <bkalish at forbeslibrary.org>
wrote:
>
> As a librarian who frequently uses Evergreen at the reference desk, I
would prefer to have stemming used to retrieve results, but have exact
matches ranked highest, then close matches, and finally distant matches.
The search results for "assist" would then include the word "assist",
"assists", and "assistance" and all other derivative words, but matches for
"assist" would appear higher in the list.
>
> I imagine it would take some ingenuity to make this work, but it would be
a far better user experience than simply turning stemming off or on.
>
That's actually what I referred to earlier. It has a performance impact,
particularly for large result sets, but it exists today. The trick will be
making it faster.
--miker
> Benjamin Kalish
> Forbes Library / 413-587-1012 / bkalish at forbeslibrary.org
>
> Currently reading: The Big Sleep by Raymond Chandler
> Just Finished: Parable of the Talents by Octavia E. Butler.
>
>
>
> On Tue, Aug 14, 2012 at 9:16 AM, <
open-ils-dev-request at list.georgialibraries.org> wrote:
>>
>> 4. Automatic stemming in Evergreen (Kathy Lussier)
>> 5. Re: Automatic stemming in Evergreen (Dan Scott)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Mon, 13 Aug 2012 12:09:06 -0400
>> From: Mike Rylander <mrylander at gmail.com>
>> Subject: Re: [OPEN-ILS-DEV] EG 2.3.beta2 this Friday
>> To: Evergreen Development Discussion List
>> <open-ils-dev at list.georgialibraries.org>
>> Message-ID:
>> <CAO8ar=mbG3_BE5udC8Q+kq4m6F9TBrJM=
gunVQrHdCMk06ZLHg at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> On Mon, Aug 13, 2012 at 11:36 AM, Bill Erickson <erickson at esilibrary.com>
wrote:
>> > On Mon, Aug 13, 2012 at 11:02:17AM -0400, Mike Rylander wrote:
>> >> On Mon, Aug 13, 2012 at 9:35 AM, Bill Erickson <
erickson at esilibrary.com> wrote:
>> >> >
>> >> > Hi All,
>> >> >
>> >> > I'm planning on cutting 2.3.beta2 this Friday 8/17. Keep the bug
fixes rolling!
>> >> >
>> >> > After that comes the 2.3.rc1 release candidate on August 31st. I'm
not anticipating an RC2, but we can certainly squeeze one in if necessary.
>> >> >
>> >> > <policy>
>> >> > How do we characterize bug fixes that are allowed to be merged
between RC1 and GA? I assume we should limit to showstoppers only?
>> >> > </policy>
>> >>
>> >> IMO (and likely debatable), we shouldn't cut RC until showstoppers are
>> >> in, and post-RC bug fixes should be limited to low-impact or
>> >> high-priority fixes. Thoughts?
>> >
>> > I agree any showstoppers we are aware of should be addressed pronto,
before any RC is cut (and ideally before beta2).
>> >
>> > I'm treating RC as "this is what you are getting, unless we find
something truly messed up in there" ;). It occurs to me the interval
between RC1 and GA is fairly large, though, and there will be a (well
intentioned) desire to fill that space with bug fixes. Given that every
bug fix is a potential new bug, though, I'm inclined to lean toward only
allowing very high priority fixes.
>> >
>> > If it would help solidify that RC /is effectively/ GA, we could move
the RC date up closer to the GA date to allow for a little longer beta bug
fixing period.
>> >
>>
>> I like the sound of that, personally ... as long, I suppose, as
>> there's sufficient (for some definition thereof) time time left for
>> general testing RC. I don't know how that might be defined, though.
>> Is 1 week enough? 2? 2 weeks is nearly the same as the current plan,
>> though, so if the answer is 2 then there may not be much point in
>> changing it.
>>
>>
>> --
>> Mike Rylander
>> | Director of Research and Development
>> | Equinox Software, Inc. / Your Library's Guide to Open Source
>> | phone: 1-877-OPEN-ILS (673-6457)
>> | email: miker at esilibrary.com
>> | web: http://www.esilibrary.com
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Mon, 13 Aug 2012 16:28:06 +0000
>> From: Kivilahti Olli-Antti <olli-antti.kivilahti at jns.fi>
>> Subject: Re: [OPEN-ILS-DEV] Feature proposal: SSN-censoring
>> functionality
>> To: "open-ils-dev at list.georgialibraries.org"
>> <open-ils-dev at list.georgialibraries.org>
>> Message-ID:
>> <
8E329C36330E814B9DB19D099155DA3C3CEA143E at KS0105.ad.pohjoiskarjala.net>
>>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> The more I look at the list of issues you present, the more I am
>> convinced that storing SSNs as they currently are is a *bad* thing.
>> Perhaps storing them in general is a bad thing?
>>
>> I am not sure can we not store SSNs. Here they are collected wherever
you do business. Transmitted over phone to marketers, asked pretty much
where ever your identification is of question.
>> Currently every staff member here has access to every patrons' SSN.
>>
>> Patron Editing/Registration - Do you show the full value by default,
>> or provide a change/update interface that can load the full value
>> (provided you have permissions to see the un-sanitized current value)?
>>
>> We can access the full SSN along with the rest of the patron data in the
edit view of our current ILS. We were planning for Evergreen to display the
full SSN as well in the patron edit view (as is). Also the sanitized SSN
would be displayed as well with a description of the filter used to
sanitize the SSN.
>>
>> A better solution may be to define a new table for storing one or more
>> pieces of data for identifying the patron. Permissions can then be
>> used to allow enforcement of who is allowed to see the information,
>> complete with whether or not they can see the un-sanitized data. (This
>> could, in theory, allow Massachusetts libraries to store this data and
>> still comply with the law.)
>>
>> That definitely is an idea worth considering. To me it looks like an
ugly solution where we have multiple columns representing the same
identification data. But considering that it might be a significantly
easier approach to compatibility and security, ofcourse sounds great.
>>
>> "Sanitizing" the data as it comes out requires flagging it as to be
>> ignored going back in, unless non-sanitized data comes back. This will
>> really screw with the code paths for patron editing and retrieval.
>> Note that currently the *only* place I know of with protection for
>> US-style SSNs is the patron sidebar view. Hitting "Edit" shows you the
>> full thing anyway.
>>
>> If we take non-sanitized data for the patron editing view, we will push
it back as well.
>>
>> Reporter - It tends to bypass permissions.
>>
>> Maybe we could make another Evergreen database user, evergreen-secure,
to have access to the separate SSN-table?
>>
>>
>>
>> On 08/13/2012 06:36 PM, open-ils-dev-request at list.georgialibraries.org
<mailto:open-ils-dev-request at list.georgialibraries.org> wrote:
>>
>> Re: Feature proposal: SSN-censoring functionality
>>
>>
>>
>> --
>> Olli-Antti Kivilahti
>> Open Library 2013
>> Library of Joensuu
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20120813/d3b93cae/attachment-0001.htm
>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Mon, 13 Aug 2012 13:39:45 -0500
>> From: Justin Hopkins <justin at mobiusconsortium.org>
>> Subject: Re: [OPEN-ILS-DEV] Feature proposal: SSN-censoring
>> functionality
>> To: Evergreen Development Discussion List
>> <open-ils-dev at list.georgialibraries.org>
>> Message-ID:
>> <CFB3F535-B739-41D2-B8F5-34416230E848 at mobiusconsortium.org>
>> Content-Type: text/plain; charset=iso-8859-1
>>
>> I also believe that storing SSN's is a Very Bad Thing. This may be
because I'm coming from an overwhelmingly enterprise/academic environment
where FERPA is taken very seriously, but in general I treat SSN's as I
would credit card numbers or personal medical information. I don't see
reason for needing to use them in an ILS. Doing so immediately makes your
system worthwhile for any potential attackers and puts your organization in
the position to have to make a painful disclosure to your public should
there be a breach.
>>
>> Regards,
>> Justin Hopkins
>> IT & Web Services Coordinator
>> 573-808-2309
>> justin at mobiusconsortium.org
>>
>>
>>
>>
>> On Aug 13, 2012, at 10:30 AM, Thomas Berezansky wrote:
>>
>> > The more I look at the list of issues you present, the more I am
convinced that storing SSNs as they currently are is a *bad* thing. Perhaps
storing them in general is a bad thing?
>> >
>> > Disclaimer: MVLC has local laws preventing us from even considering
storing SSNs or Drivers License numbers, so we don't store that information
at this time.
>> >
>> > "Sanitizing" the data as it comes out requires flagging it as to be
ignored going back in, unless non-sanitized data comes back. This will
really screw with the code paths for patron editing and retrieval. Note
that currently the *only* place I know of with protection for US-style SSNs
is the patron sidebar view. Hitting "Edit" shows you the full thing anyway.
>> >
>> > A better solution may be to define a new table for storing one or more
pieces of data for identifying the patron. Permissions can then be used to
allow enforcement of who is allowed to see the information, complete with
whether or not they can see the un-sanitized data. (This could, in theory,
allow Massachusetts libraries to store this data and still comply with the
law.)
>> >
>> > Each piece of information to be stored in that manner could then have
one or more regular expressions for sanitizing. I would implement that as
one or more search/replace sets, with capture group(s) being used in the
replacement.
>> >
>> > If these data pieces are never exposed via things like pcrud then they
will be very difficult to get out of the database.
>> >
>> > Two remaining issues I see off the top of my head are:
>> >
>> > Reporter - It tends to bypass permissions.
>> >
>> > Patron Editing/Registration - Do you show the full value by default,
or provide a change/update interface that can load the full value (provided
you have permissions to see the un-sanitized current value)?
>> >
>> > Thomas Berezansky
>> > Merrimack Valley Library Consortium
>> >
>> >
>> > Quoting Kivilahti Olli-Antti <olli-antti.kivilahti at jns.fi>:
>> >
>> >> Goood moorning Evergreeners!
>> >>
>> >> As a fruit of our migration planning work, I present to you... the
schematics to an internationalizable SSN-censorer!
>> >>
>> >> Currently Evergreen only hides the sensitive letters from a
us-standard SSN number. It seems to be that almost every nation has their
own SSN-format. While we have been looking for ways to improve this
handicap in Evergreen we have given great concern to make this modification
as internationalizable as reasonable.
>> >> A challenge in designing a solution is the fact that even inside one
installation, we can have SSNs from multiple SSN-formats.
>> >> In our case it seems to be that such cases are rare. Maybe <2% of our
clientele has a SSN of other formats than Finnish (then Swedish or Russian,
and fringe cases). I believe the case is same in the states as well. For
this reason I am planning to have filtering done according to SSN-format
relevancy.
>> >> We have a list of SSN-filters, and the top one is the most used one.
If it won't match the given SSN, then we try the next relevant one, until a
match is found or error is given to add a new filter.
>> >> Here are the more detailed steps to complete the task, in other
words, the implementation plan.
>> >> http://193.65.112.189:8080/browse/EGDEV-28
>> >> The task is waiting for planning review and feedback from the
community.
>> >>
>> >> This functionality will be implemented by The Regional Library of
Joensuu as a high priority modification during our migration period to
Evergreen.
>> >>
>> >> --
>> >> Olli-Antti Kivilahti
>> >> Open Library 2013
>> >> Library of Joensuu
>> >>
>> >
>> >
>>
>>
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Tue, 14 Aug 2012 06:22:15 -0400
>> From: Kathy Lussier <klussier at masslnc.org>
>> Subject: [OPEN-ILS-DEV] Automatic stemming in Evergreen
>> To: Evergreen Development Discussion List
>> <open-ils-dev at list.georgialibraries.org>
>> Message-ID: <502A26D7.3090202 at masslnc.org>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>>
>> Hi all,
>>
>> We've had difficulty finding records in our catalog due to the automatic
>> stemming that occurs when records are indexed in Evergreen. As an
>> example, a title on one of our summer readings lists was "The Assist" by
>> Neil Swidey. However, when users were searching for "the assist" as a
>> title search with the phrase enclosed in quotations, they still had to
>> page through several pages of results before finding the title they
>> needed. Many of the records that ranked higher contained words like
>> "assistance", "assistive", "assisted", etc. because they were
>> automatically stemmed at indexing, and the stemmed version of the word
>> (assist) was what was stored in the index vector column. We've had many
>> other examples where this stemming has made it difficult to conduct
>> searches.
>>
>> In digging through IRC logs and other list messages regarding stemming,
>> people have mentioned that this stemming can be turned off so that the
>> full words are indexed rather than the stemmed versions of a word. Can
>> anybody tell me how this is done? I understand that the records would
>> need to be reingested, but is there a flag that needs to be disabled to
>> turn off the stemming or does it require something else? Also, is there
>>
>> a way to use another dictionary for the stemmer so that the stemming is
>> somewhat less aggressive than is used by the snowball stemmer? Overall,
>> we like the concept of stemming, particularly when it retrieves results
>> for both singular and plural versions of a word, but we've had many
>> examples where stemming seems to be throwing users off course.
>>
>> Has anybody else had similar issues?
>>
>> Thanks!
>> Kathy
>>
>> --
>> Kathy Lussier
>> Project Coordinator
>> Massachusetts Library Network Cooperative
>> (508) 343-0128
>> klussier at masslnc.org
>> Twitter: http://www.twitter.com/kmlussier
>>
>>
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Tue, 14 Aug 2012 09:16:09 -0400
>> From: Dan Scott <dan at coffeecode.net>
>> Subject: Re: [OPEN-ILS-DEV] Automatic stemming in Evergreen
>> To: Evergreen Development Discussion List
>> <open-ils-dev at list.georgialibraries.org>
>> Message-ID: <20120814131607.GA21888 at denials.laurentian.ca>
>> Content-Type: text/plain; charset=us-ascii
>>
>>
>> On Tue, Aug 14, 2012 at 06:22:15AM -0400, Kathy Lussier wrote:
>> > Hi all,
>> >
>> > We've had difficulty finding records in our catalog due to the
>> > automatic stemming that occurs when records are indexed in
>> > Evergreen. As an example, a title on one of our summer readings
>> > lists was "The Assist" by Neil Swidey. However, when users were
>> > searching for "the assist" as a title search with the phrase
>> > enclosed in quotations, they still had to page through several pages
>> > of results before finding the title they needed. Many of the records
>> > that ranked higher contained words like "assistance", "assistive",
>> > "assisted", etc. because they were automatically stemmed at
>> > indexing, and the stemmed version of the word (assist) was what was
>> > stored in the index vector column. We've had many other examples
>> > where this stemming has made it difficult to conduct searches.
>>
>> This particular example is quite a concern! I haven't noticed anything
>> similar yet, since we moved to Evergreen 2.3ish last week, and nobody
>> has brought a similar problem to my attention, but it might just be
>> early days for us.
>>
>> > In digging through IRC logs and other list messages regarding
>> > stemming, people have mentioned that this stemming can be turned off
>> > so that the full words are indexed rather than the stemmed versions
>> > of a word. Can anybody tell me how this is done? I understand that
>> > the records would need to be reingested, but is there a flag that
>> > needs to be disabled to turn off the stemming or does it require
>> > something else?
>>
>> The simplest way to do this in a new Evergreen instance is to change the
>> configuration of the text search dictionary in
>> Open-ILS/src/sql/Pg/000.english.pg91.fts-config.sql - for example,
>> instead of using the snowball stemming algorithm as a basis for the full
>> text search, just use the "simple" dictionary which returns the
>> lowercase version of the incoming text:
>>
>> CREATE TEXT SEARCH DICTIONARY english_nostop
(TEMPLATE=pg_catalog.simple);
>>
>> Note, however, that this is likely to cause other problems for
>> searchers; in the default "concerto" sample set of records, for example,
>> people will have to search for "concertos" to get matches for
>> "concertos"; "concerto" won't result in a match (and vice versa).
>>
>> > Also, is there a way to use another dictionary for
>> > the stemmer so that the stemming is somewhat less aggressive than is
>> > used by the snowball stemmer? Overall, we like the concept of
>> > stemming, particularly when it retrieves results for both singular
>> > and plural versions of a word, but we've had many examples where
>> > stemming seems to be throwing users off course.
>>
>> ispell support was added in the last few versions of PostgreSQL, which
>> might be worth exploring. I plan to dig into the current state of
PostgreSQL
>> full-text search over the next few weeks, so the timing of your question
>> is quite good!
>>
>>
>> End of Open-ils-dev Digest, Vol 77, Issue 13
>> ********************************************
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20120814/a25b8b3d/attachment-0001.htm>
More information about the Open-ils-dev
mailing list