[OPEN-ILS-DEV] Normalization Concerns, Present and Future

Tue Mar 8 16:44:30 EST 2011

On Tue, Mar 8, 2011 at 4:14 PM, Dan Wells <dbw2 at calvin.edu> wrote:
>>>> On 3/8/2011 at 12:57 PM, Mike Rylander <mrylander at gmail.com> wrote:
>>
>> The drawbacks you mention in both of the above options you list are
>> among the reasons we don't do this currently.  In particular, a
>> tsquery containing many ORed conditions can significantly slower than
>> the currently constructed queries.
>>
>> So, an obvious option is to not use whole classes in those cases.
>> Now, if knowing that you need to say "identifier|isbn" or
>> "identifier|issn" instead of simply "identifier", there's a mechanism
>> for providing aliases such that "identifier|isbn" can be spelled
>> "isbn" in the config.metabib_search_alias table.  In fact, it's
>> already there, just spelled "eg.isbn" (along with "eg.issn").  If you
>> want to search both in the same time, using the same value and only
>> the appropriate normalizations for the specific fields, you can write:
>> eg.isbn: foobar || eg.issn: foobar
>>
>> With the understood restriction that class-wide searches apply all
>> unique normalizations (definition of "unique" to be addressed, as you
>> point out above), is this an unacceptable documentation-based
>> solution?
>
> I understand that the specific searches will work in any case, but it makes me more than a little uncomfortable that adding normalizers to a specific field will silently break searches at the class level.  For instance, using the default config, *any* identifier which contains a '-' but is *not* an ISBN or an ISSN will no longer be findable when doing an 'identifier' class search.  Consider a more specific example, a search for an ISMN using the term 'identifier:979-0-060-11561-5' will not find the record containing that number, as the dashes will be removed from the query but still be present in the DB (as ISMN fields are not currently normalized), while 'identifier|ismn:979-0-060-11561-5' works fine.  In my opinion this is more than confusing, its broken.
>

That's where we disagree.  Especially in the case of the identifier
class (which, given, means nothing special to the /software/, but what
we as humans put in there /is/ special) I think a strong case can be
made for the documentation approach.

What this really comes down to, remembering that the compromises made
were all intended with full knowledge that class-wide searches would
over-normalize some fields, is intended uses.  What, exactly, are you
trying to do that can't be done with the prescribed spelling of
searches I note above?

> In addition to the two options I mentioned previously, a third simple option would be to only allow normalizers at the class level, not the field level.  This would require a bit more care, perhaps cause some mild to moderate bloating, and would lead to some false positives in some cases, but false positives beat false negatives every time (especially if the false negatives amount to eliminating entire subclasses, as in the ISMN case).

I don't actually think you can do what you need to for all cases with
only class-level normalizations.  That was the conclusion I came to
when designing this bit, and without evidence to the contrary I'm
inclined to work on additive solutions.  Specifically, what of when we
need some normalizations for some fields in a class but explicitly
don't want those same normalizations for others.  Also, facets are not
classes, but fields, and will absolutely require field-level
normalizations.

>  Since we are already essentially normalizing at the class level for the vast majority of the fields, any transition should be pretty painless.

Except where it isn't, and for facets ... ;)

Actually, there's a fourth way, which (if not conceptually,
physically) removes existing bloat: allow /both/ class and field
normalization; only apply class normalization to class-wide searches
and all applicable normalizations to specific fields.

Of course, this falls down in the other direction: if you don't supply
all the field-level normalizers at the class level then you miss some
normalizations and therefore miss hits.

And, so ... here's a fifth way.  You can currently say things like
"identifier|isbn|issn:foobar" and get an ORed search across the two
fields, but(!) only one join clause is generated.  We basically do a
class-wide join/normalization and then restrict to those rows where
the field is in the named list.  However(!!), it should be relatively
simple to have QP transmute that example into the parse-tree
equivalent of "identifier|isbn:foobar || identifier|issn:foobar",
which does exactly what you want.  Now, if we add the ability to
assign multiple fields to an alias (almost trivial), we can have an
alias (say, "dbwells_idents") that translates to a list of fields, and
then you can write "dbwells_idents:foobar" and the right thing is done
in all cases /but/ the class-wide search.  Is the original "problem"
still there?  Yes, but we haven't removed functionality (field
normalizers), caused pain with others (facets), and provided a
workaround for a known issue.

But, I think it's important to note again, what exists now is a known
and understood design compromise, and except in the case of the
identifier class, is nearly a non-issue in practice.  And, for the
identifier class, class-wide searches don't make much sense, at least
IMO.

Finally, having said all that, we still do need to address the way
"unique normalizer" is defined in the code.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com