[OPEN-ILS-DEV] direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Mon Nov 30 14:20:12 EST 2009

On Fri, 2009-11-27 at 15:33 -0500, Warren Layton wrote:
> I'm trying to import a number of bib records with "special" characters
> in the MARC fields. I've gotten as far running direct_ingest.pl but
> I'm noticing that biblio_fingerprint.js chokes on a few of them.
> 
> Looking a bit closer, I noticed that biblio_fingerprint.js chops
> character codes down to two least significant hex digits. For example,
> biblio_fingerprint.js turns "&#x10C;" (Č) into "&#x0c" ("form feed"),
> which causes the direct_ingest.pl to skip the record and output the
> following error:
> 
>   "Couldn't process record: invalid character encountered while
> parsing JSON string"
> 
> Attached is a sample record that causes this problem for me (the
> tarball includes both the original MARCXML and the BRE file generated
> from it by marc2bre.pl). Any help would be appreciated! I can open a
> bug on Launchpad, too, if needed.
> 
> Cheers,
>  Warren

Thanks for putting together a reproducible test case, Warren.

One way of avoiding the full choke that direct_ingest.pl experiences is
to revert the following change that made marc2bre.pl used composed
characters in preference to decomposed characters:
http://svn.open-ils.org/trac/ILS/changeset/12985/trunk/Open-ILS/src/extras/import/marc2bre.pl

I don't think this is the right way to address the problem, though.
You've traced the real corruption to biblio_fingerprint.js; from there,
one could look at whether it's introduced by calling toLowerCase() (a
non-locale safe function) on (Č) which is where my immediate suspicion
lies) or whether it's a native limitation of the SpiderMonkey JavaScript
interpreter / JavaScript::SpiderMonkey Perl module... 

And then I wonder whether it just might be better to fast-forward this
part of the in-database ingest implementation so that we can define an
ordered list of author sources, title sources, and quality measurements
and run the import through the database - which would have the added
bonus of not always heavily favouring English sources in an
out-of-the-box implementation. I'm not as keen about reimplementing the
record_type identification portion of biblio_fingerprint.js, mind you.

But given the existing framework, we would be hitting the database once
in direct_ingest.pl for every record that we're subsequently going to
import via pg_loader.pl, rather than pushing the records into the
database and saying "ingest!" - which might be a better way of
approaching the problem.

Okay, I've diverged significantly from the original topic, in a
not-very-precise way. The next concrete thing I'll do is try pulling the
toLowerCase() call from biblio_fingerprint.js just to see if that's
where the processing death really is being introduced.

Dan