[OPEN-ILS-DEV] direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Sun Dec 13 20:50:25 EST 2009

On Fri, 2009-12-11 at 11:18 -0500, Warren Layton wrote:
> On Tue, Dec 1, 2009 at 10:45 AM, Dan Scott <dan at coffeecode.net> wrote:
> >> If not, and if rolling the try/catch blocks of the script into the
> >> Perl Ingest function is fine, I can go ahead with that and post a
> >> patch here soon.
> >
> > I'd love to see that patch.
> 
> Hello again (and sorry for the delay).
> 
> I'm attaching my patch to include the biblio_fingerprint code directly
> in Ingest.pm. As suggested by Dan, I've included a
> "legacy_script_support" option in opensrf.xml that lets you run the
> old biblio_fingerprint.js script instead.
> 
> A few notes:
> * The Perl code is structured very similarly to the JavaScript code
> (lots of try/catch blocks). There may be a better way to write it...
> * For the quality value, I left most of the increments in (length of
> datafields, matching certain values of the 039 field, etc). The
> exception is the quality bump for language. For some reason that I
> didn't understand, the old script incremented the quality value for
> English records only. If that's still needed, it can be added to the
> Perl code.
> * The other script called by Ingest.pm, biblio_descriptor.js, will
> still be called, regardless of the legacy_script_support setting.
> 
> This is my first stab at this. It has solved the problems I was having
> with direct_ingest.pl for our records but comments and feedback are
> definitely welcome.

Thanks a ton, Warren.

First and foremost, I can confirm that running the patched version of
Ingest.pm against the problem record provided by Dan Wells succeeds,
whereas the biblio_fingerprint.js version breaks. Huzzah!

The initial results of running the records in Open-ILS/tests/datasets/
through pre-patch and post-patch (OpenSRF / Evergreen trunk on Ubuntu
Karmic 9.10) are that there's no significant difference in processing
speed. 

There is a significant difference in the fingerprints that are generated
- largely, that the Perl version creates fingerprints which contain
characters outside of the ASCII range. For a few examples:

>From hebrew.marc:
Via biblio_fingerprint.js: meromeadehberlin
Via pure Perl ingest: מרומישדהberlin

Via biblio_fingerprint.js: seferamribitshraiber
Via pure Perl ingest: seferḳitsurdineribithametsuyimshṭernbukh

>From lul_fre_500.mrc:
Via biblio_fingerprint.js: oeuvresdeflchierfl
Via pure Perl ingest: oeuvresdefléchierfléchier

>From music_5k.mrc:
Via biblio_fingerprint.js: destjlandriessen
Via pure Perl ingest: destïjlandriessen

In some cases, like the lul_fre_500.mrc example, the fingerprint appears
to be truncated at a non-ASCII character. I think the pure Perl approach
gives us the results we want, even with the non-ASCII characters - but
if we wanted to try to strip accents and maintain romanized versions, we
could apply the transformations found in _data_tag_to_full_rows().

That issue notwithstanding, I would be in favour of applying this patch
to trunk at this time, and with a little more testing and confirmation
of the fingerprinting goals, I would like to see it backported to the
1.6 series.

Thanks again, Warren!