[OPEN-ILS-DEV] ***SPAM*** Re: direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Mike Rylander mrylander at gmail.com
Mon Dec 14 09:42:36 EST 2009


On Mon, Dec 14, 2009 at 12:07 AM, Dan Scott <dan at coffeecode.net> wrote:
> On Sun, 2009-12-13 at 23:36 -0500, Warren Layton wrote:
>> On Sun, Dec 13, 2009 at 8:50 PM, Dan Scott <dan at coffeecode.net> wrote:
>> > That issue notwithstanding, I would be in favour of applying this patch
>> > to trunk at this time, and with a little more testing and confirmation
>> > of the fingerprinting goals, I would like to see it backported to the
>> > 1.6 series.
>>
>> Thanks for testing this patch, Dan, and suggesting it for trunk.
>>
>> (And I, too, am curious about the goals of fingerprinting and whether
>> non-ASCII is acceptable.)
>
> In my opinion, it has to be acceptable if we want to support metarecord
> grouping for Armenian and Czech and Russian and Nepalese - all languages
> that we've either had some translations contributed for, or which have
> had people working on getting Evergreen running (or both).
>

The purpose for removing characters outside the ascii range (well,
actually, the original design was for removing non-spacing marks in
NFD characters, but that seems impossible in JS) is to thunk to the
lowest common denominator -- think Chávez vs Chavez, the like of which
is extremely common in public library catalogs, especially when
merging records from institutions with different cataloging standards.

Since we're not aware of anyone actually making use of the mutability
of the fingerprinter (well, beyond me), I don't have too strong of a
argument against reimplementing in perl.  However, in order to make it
nominally possible to retain the functionality, I do feel pretty
strongly that the main body of the fingerprinting and weighting
"quality" logic should be segregated into its own file.

As for the default algorithm, I think removal of non-spacing combining
marks is pretty important.  Replacing the tr/// with lc() and adding
s/\p{M}+// will take care of that.

The quality metric includes a bump for language so that records in the
primary language of the catalog will end up (more often than not)
being used as the lead record in metarecords -- without the bump,
non-primary-language records would have an advantage simply because
they have more tags (Romanizations) and that would be suboptimal for
patrons.  So that, too, is pretty important, and one of the original
design reasons for the JS implementation.  Leaving English as the
default seems sane to me, as most adoption of Evergreen is still in
primarily-English-speaking countries, and for Armenian, Czech,
Russian, Nepalese and French-Canadian ;) catalogs, that quality
adjustment can be removed or adjusted appropriately -- local
modification being a main driver in the current JS implementation.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list