[OPEN-ILS-DEV] SPAM Re: direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Mon Nov 30 15:00:39 EST 2009

On Mon, 2009-11-30 at 14:20 -0500, Dan Scott wrote:
> On Fri, 2009-11-27 at 15:33 -0500, Warren Layton wrote:
> > I'm trying to import a number of bib records with "special" characters
> > in the MARC fields. I've gotten as far running direct_ingest.pl but
> > I'm noticing that biblio_fingerprint.js chokes on a few of them.
> > 
> > Looking a bit closer, I noticed that biblio_fingerprint.js chops
> > character codes down to two least significant hex digits. For example,
> > biblio_fingerprint.js turns "&#x10C;" (Č) into "&#x0c" ("form feed"),
> > which causes the direct_ingest.pl to skip the record and output the
> > following error:
> > 
> >   "Couldn't process record: invalid character encountered while
> > parsing JSON string"
> > 
> > Attached is a sample record that causes this problem for me (the
> > tarball includes both the original MARCXML and the BRE file generated
> > from it by marc2bre.pl). Any help would be appreciated! I can open a
> > bug on Launchpad, too, if needed.
> > 
> > Cheers,
> >  Warren
> 
> Thanks for putting together a reproducible test case, Warren.
> 
> One way of avoiding the full choke that direct_ingest.pl experiences is
> to revert the following change that made marc2bre.pl used composed
> characters in preference to decomposed characters:
> http://svn.open-ils.org/trac/ILS/changeset/12985/trunk/Open-ILS/src/extras/import/marc2bre.pl
> 
> I don't think this is the right way to address the problem, though.
> You've traced the real corruption to biblio_fingerprint.js; from there,
> one could look at whether it's introduced by calling toLowerCase() (a
> non-locale safe function) on (Č) which is where my immediate suspicion
> lies) or whether it's a native limitation of the SpiderMonkey JavaScript
> interpreter / JavaScript::SpiderMonkey Perl module... 
> 
> And then I wonder whether it just might be better to fast-forward this
> part of the in-database ingest implementation so that we can define an
> ordered list of author sources, title sources, and quality measurements
> and run the import through the database - which would have the added
> bonus of not always heavily favouring English sources in an
> out-of-the-box implementation. I'm not as keen about reimplementing the
> record_type identification portion of biblio_fingerprint.js, mind you.
> 
> But given the existing framework, we would be hitting the database once
> in direct_ingest.pl for every record that we're subsequently going to
> import via pg_loader.pl, rather than pushing the records into the
> database and saying "ingest!" - which might be a better way of
> approaching the problem.
> 
> Okay, I've diverged significantly from the original topic, in a
> not-very-precise way. The next concrete thing I'll do is try pulling the
> toLowerCase() call from biblio_fingerprint.js just to see if that's
> where the processing death really is being introduced.

Unfortunately, just removing the toLowerCase() call from
biblio_fingerprint.js isn't enough. We still get an error thrown.

Back to the drawing board, the JSON::XS Perl module documentation
strongly warns against relying on eval() to convert JSON to JavaScript,
and instead recommends using Douglas Crockford's json2.js parser [1]. Of
course, we're using eval() in the decodeJS() function in JSON_v1.js. So
perhaps plugging in json2.js at this point would be a quick test case
for trying to fix the existing biblio_fingerprint.js approach...

1.
http://search.cpan.org/~mlehmann/JSON-XS-2.26/XS.pm#JSON_and_ECMAscript

[OPEN-ILS-DEV] ***SPAM*** Re: direct_ingest.pl, biblio_fingerprint.js and Unicode chars

[OPEN-ILS-DEV] SPAM Re: direct_ingest.pl, biblio_fingerprint.js and Unicode chars