[OPEN-ILS-DEV] Normalization Concerns, Present and Future

Tue Mar 8 11:24:09 EST 2011

Hello all,

We have recently accelerated our timetable to moving to 2.0, and this sudden immersion is what is causing our recent increase in bug reporting.  Yesterday morning I was exploring a problem with ISSN normalization, and I have spent the last day trying to understand not only the immediate issues, but also what I think is a more serious issue going forward.

First, the immediate problem.  There is currently code in the OpenILS::Application::Storage::Driver::Pg::QueryParser::query_plan::node::atom package which attempts to apply normalizations to the incoming query term.  Because of the way the normalizations are mapped specifically but can be applied generically (more on that later), the code does not allow the "same" normalization to be applied more than once.  Here, currently, sameness is defined by the normalizer "name", but since we can now have "params" for the normalizer, comparing the name is no longer enough to determine sameness.  In the case of ISSNs, the default config maps two "replace" normalizations to the ISSN entries, one to remove ' ' (space), the next to remove '-'.  As it stands, only one of the two gets applied to the incoming query, as the other is discarded for being the "same", so the query can fail, even when specifying 'identifier|issn:', as both are properly applied when the record is ingested.

While there are a number of ways to solve the immediate problem, I think we should not do so without considering the larger problem of how normalizations are being applied to the incoming query.  As it stands, if you do a 'class' level query (e.g. subject:foo), *all* the normalizations for that class are applied in sequence.  In the default configuration this doesn't cause immediate problems, as they all get the same normalizations anyway, with one exception.  That exception is the 'identifier' class.  In a stock config, if you do a search for 'identifier:foo', you get *both* the ISBN and the ISSN normalizations applied to your search term.  If you are searching for an identifier which is *not* one of these but happens to look like one, your query will fail, as your input will get these incorrect normalizations, while your index did not.

Add more normalizations for other subclasses and they would all get applied as well.  The potential for bad interactions seems pretty clear.

I am new to this code and may be missing some subtleties, but if this is more or less correct, what are our best options for fixing this?  I'll start by at least stating two of the more obvious options:

1) create a specific superfield for each class (e.g. subject|subject or title|title).  This field will get only the most generic normalizations, and queries without a subclass (e.g. subject: or title:) would use only these fields with the same generic normalizations applied to the term.  This is pretty simple, but could cause some significant bloat to an already heavy scheme, and would also mean that these more generic queries do not benefit from any particularly beneficial normalizations (e.g. an identifier:foo search where foo actual *is* an ISBN would not get the ISBN10/13 interchangeability).

2) apply all the distinct "normalization-groups" separately, then OR the resulting terms.  So, for example, a search for identifier:foo in the default setup would result in (as pseudo-query) "foo | translate_isbn1013('foo') | replace(replace('foo', '-',''), ' ', '')", that is, the unnormalized, or the ISBN normalized, or the ISSN normalized.  Cases where all the normalizations can be determined to be the "same" would be unchanged (e.g. subject:foo would still be "split_date_range(naco_normalize('foo'))"), as all the subject subclasses use this same "normalization-group", so there is nothing to OR together.  Under this scheme we maintain the special treatment of particular field_entries at the expense of query-time overhead, particularly as the normalizations diverge within a class.

Thoughts?

Thanks,
Dan

-- 
*********************************************************************************
Daniel Wells, Library Programmer Analyst dbw2 at calvin.edu
Hekman Library at Calvin College
616.526.7133