[OPEN-ILS-DEV] direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Tue Dec 1 10:45:33 EST 2009

On Tue, 2009-12-01 at 10:24 -0500, Warren Layton wrote:
> On Mon, Nov 30, 2009 at 7:25 PM, Dan Scott <dan at coffeecode.net> wrote:
> > I suppose the next logical step would be to rewrite the
> > OpenILS::Application::Ingest fingerprint methods to avoid the
> > ScriptRunner server-side JavaScript biblio_fingerprint.js fun and see
> > if that resolves the problem.
> 
> Hi Dan,
> 
> Thanks for all of your messages about this issue -- I wasn't quite
> sure what to try next.
> 
> I've now done a very basic rewrite of the
> OpenILS::Application::Ingest::biblio_fingerprint function so that it
> doesn't use the server-side JavaScript. So far, it looks like it works
> with double-wide character codes. My fix is not yet as complete as the
> biblio_fingerprint.js script, though, so I'll wait until it's in a
> better state before submitting it.

Cool!

> And that's if replacing biblio_fingerprint.js with a Perl solution is
> acceptable. I'm worried that there's an advantage to having it run as
> an external script that I'm missing and that I might be throwing the
> baby out with the bath water. While the biblio_fingerprint.js script
> doesn't seem to be called from outside of Application::Ingest, I keep
> asking myself if there's a good reason for it to be separate that I
> have overlooked.

I believe the design point behind biblio_fingerprint.js and friends was
that they were thought to be librarian-friendly because they were
written in JavaScript, and therefore could be modified to reflect local
policies (e.g. generating a title fingerprint based on 245abc instead of
just 245a). I'm not sure how many libraries have taken advantage of
this, however.

> If not, and if rolling the try/catch blocks of the script into the
> Perl Ingest function is fine, I can go ahead with that and post a
> patch here soon.

I'd love to see that patch. 

The ideal way might be to make it (yet another) configurable option,
something like legacy_scripts in opensrf.xml, to enable the old
behaviour for sites that did adopt modified JS scripts - but with a
warning that the server-side JavaScript approach might not correctly
handle all Unicode characters.

And then phase 2 might be to push the configuration of the title /
author fingerprinting and quality calculations into the database. And
phase 3 would be to write a minimal configuration interface for it.

Thanks, Warren!