[OPEN-ILS-DEV] Normalization Concerns, Present and Future

Tue Mar 8 18:28:20 EST 2011

>> I understand that the specific searches will work in any case, but it makes 
>> me more than a little uncomfortable that adding normalizers to a specific 
>>field will silently break searches at the class level.  For instance, using 
>> the default config, *any* identifier which contains a '-' but is *not* an ISBN 
>> or an ISSN will no longer be findable when doing an 'identifier' class 
>> search.  Consider a more specific example, a search for an ISMN using the 
>> term 'identifier:979-0-060-11561-5' will not find the record containing that 
>> number, as the dashes will be removed from the query but still be present in 
>> the DB (as ISMN fields are not currently normalized), while 
>> 'identifier|ismn:979-0-060-11561-5' works fine.  In my opinion this is more than 
>> confusing, its broken.
>>
> 
> That's where we disagree.  Especially in the case of the identifier
> class (which, given, means nothing special to the /software/, but what
> we as humans put in there /is/ special) I think a strong case can be
> made for the documentation approach.
> 
> What this really comes down to, remembering that the compromises made
> were all intended with full knowledge that class-wide searches would
> over-normalize some fields, is intended uses.  What, exactly, are you
> trying to do that can't be done with the prescribed spelling of
> searches I note above?
> 

The capabilities of the system as-is are powerful, but it really seems to me to be a poor end-user experience.  When a search for 'foo:bar' systematically excludes (not just misses, but excludes*) results found by 'foo|baz:bar' (something which by all accounts looks like a more specific version of the first), and does so via invisible alteration of the query term, no amount of documentation is going to make that sensible to our patrons.  If we do not end up tweaking how the query works, we should strongly consider hiding or somehow disabling the class-level 'identifier' search OOTB.

* my 'excludes' distinction is trying to say that *no* version of the term will get the record to come up using that index, as it is impossible to get past the non-applicable normalizer

>> In addition to the two options I mentioned previously, a third simple option 
>> would be to only allow normalizers at the class level, not the field level.  
>> This would require a bit more care, perhaps cause some mild to moderate 
>> bloating, and would lead to some false positives in some cases, but false 
>> positives beat false negatives every time (especially if the false negatives 
>> amount to eliminating entire subclasses, as in the ISMN case).
> 
> I don't actually think you can do what you need to for all cases with
> only class-level normalizations.  That was the conclusion I came to
> when designing this bit, and without evidence to the contrary I'm
> inclined to work on additive solutions.  Specifically, what of when we
> need some normalizations for some fields in a class but explicitly
> don't want those same normalizations for others.

When normalizing for the purposes of search, all we are really trying to do is remove meaningless differences while preserving as much meaning as we can, right?  As evidenced by our current setup, this is usually best done in as general a way as possible.  For many other situations, normalizations which add to the index (rather than remove) will always be safe to apply on the index side and are unnecessary on the query side.  If the contents of two fields are truly so different that a reasonable normalization cannot be achieved via generalized methods, index additions, or simple 'duck-typing' of the data, then perhaps they are not in the same class to begin with.

> Actually, there's a fourth way, which (if not conceptually,
> physically) removes existing bloat: allow /both/ class and field
> normalization; only apply class normalization to class-wide searches
> and all applicable normalizations to specific fields.
> 
> Of course, this falls down in the other direction: if you don't supply
> all the field-level normalizers at the class level then you miss some
> normalizations and therefore miss hits.
> 

I think this would be a huge step in the right direction.  This makes it at least *possible* to use the class-level queries and find everything in the subclasses.  If we could consider combining this with a convention that field-level normalizers (for purposes of search) do not /remove/ the class-normalized version from their index entry (but add whatever they like), then I would really think we are golden for the vast majority of user-expectations (with a big one being "if I see 'foo-bar' in the record and type it in exactly, at least that record will come up").

Thoughts?

And, as always, thank you for your consideration.

Dan

-- 
*********************************************************************************
Daniel Wells, Library Programmer Analyst dbw2 at calvin.edu
Hekman Library at Calvin College
616.526.7133