[OPEN-ILS-DEV] ***SPAM*** Re: ***SPAM*** Re: ***SPAM*** Re: direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Mike Rylander mrylander at gmail.com
Mon Dec 14 15:07:55 EST 2009


On Mon, Dec 14, 2009 at 1:36 PM, Dan Scott <dan at coffeecode.net> wrote:
> On Mon, 2009-12-14 at 09:42 -0500, Mike Rylander wrote:
>> On Mon, Dec 14, 2009 at 12:07 AM, Dan Scott <dan at coffeecode.net> wrote:
>> > On Sun, 2009-12-13 at 23:36 -0500, Warren Layton wrote:
>> >> On Sun, Dec 13, 2009 at 8:50 PM, Dan Scott <dan at coffeecode.net> wrote:
>> >> > That issue notwithstanding, I would be in favour of applying this patch
>> >> > to trunk at this time, and with a little more testing and confirmation
>> >> > of the fingerprinting goals, I would like to see it backported to the
>> >> > 1.6 series.
>> >>
>> >> Thanks for testing this patch, Dan, and suggesting it for trunk.
>> >>
>> >> (And I, too, am curious about the goals of fingerprinting and whether
>> >> non-ASCII is acceptable.)
>> >
>> > In my opinion, it has to be acceptable if we want to support metarecord
>> > grouping for Armenian and Czech and Russian and Nepalese - all languages
>> > that we've either had some translations contributed for, or which have
>> > had people working on getting Evergreen running (or both).
>> >
>>
>> The purpose for removing characters outside the ascii range (well,
>> actually, the original design was for removing non-spacing marks in
>> NFD characters, but that seems impossible in JS) is to thunk to the
>> lowest common denominator -- think Chávez vs Chavez, the like of which
>> is extremely common in public library catalogs, especially when
>> merging records from institutions with different cataloging standards.
>
> Right, getting to plain ASCII fingerprints cleanly (e.g. not dropping
> characters / strings entirely) where possible makes sense in that
> context, and I think it's a reasonable default for Evergreen.
>
>> Since we're not aware of anyone actually making use of the mutability
>> of the fingerprinter (well, beyond me), I don't have too strong of a
>> argument against reimplementing in perl.  However, in order to make it
>> nominally possible to retain the functionality, I do feel pretty
>> strongly that the main body of the fingerprinting and weighting
>> "quality" logic should be segregated into its own file.
>
> Any opinions on where/how this file would be located? Should it just be
> a separate Perl module that defines the appropriate subroutines that
> then get called by Ingest.pm - something like
> OpenILS::Application::Ingest::English.pm - and then we could provide a
> sample non-English configuration file / module that could be swapped in
> for the less Anglo-centric? Or perhaps language maintainers could
> maintain language-specific versions, or (more likely) versions with
> common requirements.
>

There needs to be just one entry point for this, so
OpenILS::Application::Ingest::Fingerprinter::default seems as good a
module name as any to me.  If we wanted to make that a config file
setting we could append the text in that setting to
OpenILS::Application::Ingest::Fingerprinter::, or failing that,
'default', stick that in a var called $fp_module and then
$fp_module->use() it.  Then call $fp_module->fingerprint($xml) and
$fp_module->quality($xml) as needed.

>> As for the default algorithm, I think removal of non-spacing combining
>> marks is pretty important.  Replacing the tr/// with lc() and adding
>> s/\p{M}+// will take care of that.
>
> That's okay, as long as it's a configurable option. Icelandic apparently
> treats such characters quite differently: o ó and ö are entirely
> different characters and shouldn't be folded together. Of course, I
> haven't heard anything from any libraries in Iceland yet about adopting
> Evergreen so that's an academic concern :)
>

If I'm understanding what you want (swapable modules), that's
supported by the config file option or the "local modification"
mechanism.  IOW, if an Icelandic library came up with a good
fingerprinter module, the above config-file-based dynamic loading
would make it a drop-in thing, and changing
OpenILS::Application::Ingest::Fingerprinter::default
(OpenILS/Application/Ingest/Fingerprinter/default.pm) locally as
needed would still be possible in the mean time.

Sane?

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list