[OPEN-ILS-DEV] Normalization Concerns, Present and Future

Tue Mar 8 12:57:51 EST 2011

On Tue, Mar 8, 2011 at 11:24 AM, Dan Wells <dbw2 at calvin.edu> wrote:
> Hello all,
>
> We have recently accelerated our timetable to moving to 2.0, and this sudden immersion is what is causing our recent increase in bug reporting.  Yesterday morning I was exploring a problem with ISSN normalization, and I have spent the last day trying to understand not only the immediate issues, but also what I think is a more serious issue going forward.
>
> First, the immediate problem.  There is currently code in the OpenILS::Application::Storage::Driver::Pg::QueryParser::query_plan::node::atom package which attempts to apply normalizations to the incoming query term.  Because of the way the normalizations are mapped specifically but can be applied generically (more on that later), the code does not allow the "same" normalization to be applied more than once.  Here, currently, sameness is defined by the normalizer "name", but since we can now have "params" for the normalizer, comparing the name is no longer enough to determine sameness.  In the case of ISSNs, the default config maps two "replace" normalizations to the ISSN entries, one to remove ' ' (space), the next to remove '-'.  As it stands, only one of the two gets applied to the incoming query, as the other is discarded for being the "same", so the query can fail, even when specifying 'identifier|issn:', as both are properly applied when the record is ingested.
>

This should indeed be addressed.  Including params in the "sameness"
key seems the obvious way to address the problem.

> While there are a number of ways to solve the immediate problem, I think we should not do so without considering the larger problem of how normalizations are being applied to the incoming query.  As it stands, if you do a 'class' level query (e.g. subject:foo), *all* the normalizations for that class are applied in sequence.  In the default configuration this doesn't cause immediate problems, as they all get the same normalizations anyway, with one exception.  That exception is the 'identifier' class.  In a stock config, if you do a search for 'identifier:foo', you get *both* the ISBN and the ISSN normalizations applied to your search term.  If you are searching for an identifier which is *not* one of these but happens to look like one, your query will fail, as your input will get these incorrect normalizations, while your index did not.
>
> Add more normalizations for other subclasses and they would all get applied as well.  The potential for bad interactions seems pretty clear.
>
> I am new to this code and may be missing some subtleties, but if this is more or less correct, what are our best options for fixing this?  I'll start by at least stating two of the more obvious options:
>
> 1) create a specific superfield for each class (e.g. subject|subject or title|title).  This field will get only the most generic normalizations, and queries without a subclass (e.g. subject: or title:) would use only these fields with the same generic normalizations applied to the term.  This is pretty simple, but could cause some significant bloat to an already heavy scheme, and would also mean that these more generic queries do not benefit from any particularly beneficial normalizations (e.g. an identifier:foo search where foo actual *is* an ISBN would not get the ISBN10/13 interchangeability).
>
> 2) apply all the distinct "normalization-groups" separately, then OR the resulting terms.  So, for example, a search for identifier:foo in the default setup would result in (as pseudo-query) "foo | translate_isbn1013('foo') | replace(replace('foo', '-',''), ' ', '')", that is, the unnormalized, or the ISBN normalized, or the ISSN normalized.  Cases where all the normalizations can be determined to be the "same" would be unchanged (e.g. subject:foo would still be "split_date_range(naco_normalize('foo'))"), as all the subject subclasses use this same "normalization-group", so there is nothing to OR together.  Under this scheme we maintain the special treatment of particular field_entries at the expense of query-time overhead, particularly as the normalizations diverge within a class.
>
> Thoughts?

The drawbacks you mention in both of the above options you list are
among the reasons we don't do this currently.  In particular, a
tsquery containing many ORed conditions can significantly slower than
the currently constructed queries.

So, an obvious option is to not use whole classes in those cases.
Now, if knowing that you need to say "identifier|isbn" or
"identifier|issn" instead of simply "identifier", there's a mechanism
for providing aliases such that "identifier|isbn" can be spelled
"isbn" in the config.metabib_search_alias table.  In fact, it's
already there, just spelled "eg.isbn" (along with "eg.issn").  If you
want to search both in the same time, using the same value and only
the appropriate normalizations for the specific fields, you can write:
eg.isbn: foobar || eg.issn: foobar

With the understood restriction that class-wide searches apply all
unique normalizations (definition of "unique" to be addressed, as you
point out above), is this an unacceptable documentation-based
solution?

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com