[OPEN-ILS-DEV] Normalization Concerns, Present and Future

Tue Mar 8 16:14:07 EST 2011

-- 
*********************************************************************************
Daniel Wells, Library Programmer Analyst dbw2 at calvin.edu
Hekman Library at Calvin College
616.526.7133
>>> On 3/8/2011 at 12:57 PM, Mike Rylander <mrylander at gmail.com> wrote:
> 
> The drawbacks you mention in both of the above options you list are
> among the reasons we don't do this currently.  In particular, a
> tsquery containing many ORed conditions can significantly slower than
> the currently constructed queries.
> 
> So, an obvious option is to not use whole classes in those cases.
> Now, if knowing that you need to say "identifier|isbn" or
> "identifier|issn" instead of simply "identifier", there's a mechanism
> for providing aliases such that "identifier|isbn" can be spelled
> "isbn" in the config.metabib_search_alias table.  In fact, it's
> already there, just spelled "eg.isbn" (along with "eg.issn").  If you
> want to search both in the same time, using the same value and only
> the appropriate normalizations for the specific fields, you can write:
> eg.isbn: foobar || eg.issn: foobar
> 
> With the understood restriction that class-wide searches apply all
> unique normalizations (definition of "unique" to be addressed, as you
> point out above), is this an unacceptable documentation-based
> solution?

I understand that the specific searches will work in any case, but it makes me more than a little uncomfortable that adding normalizers to a specific field will silently break searches at the class level.  For instance, using the default config, *any* identifier which contains a '-' but is *not* an ISBN or an ISSN will no longer be findable when doing an 'identifier' class search.  Consider a more specific example, a search for an ISMN using the term 'identifier:979-0-060-11561-5' will not find the record containing that number, as the dashes will be removed from the query but still be present in the DB (as ISMN fields are not currently normalized), while 'identifier|ismn:979-0-060-11561-5' works fine.  In my opinion this is more than confusing, its broken.

In addition to the two options I mentioned previously, a third simple option would be to only allow normalizers at the class level, not the field level.  This would require a bit more care, perhaps cause some mild to moderate bloating, and would lead to some false positives in some cases, but false positives beat false negatives every time (especially if the false negatives amount to eliminating entire subclasses, as in the ISMN case).  Since we are already essentially normalizing at the class level for the vast majority of the fields, any transition should be pretty painless.

Dan