[OPEN-ILS-DEV] ***SPAM*** Re: direct_ingest.pl, biblio_fingerprint.js and Unicode chars

Dan Scott dan at coffeecode.net
Mon Nov 30 16:46:58 EST 2009


On Mon, 2009-11-30 at 15:00 -0500, Dan Scott wrote:
> On Mon, 2009-11-30 at 14:20 -0500, Dan Scott wrote:
> > On Fri, 2009-11-27 at 15:33 -0500, Warren Layton wrote:
> > > I'm trying to import a number of bib records with "special" characters
> > > in the MARC fields. I've gotten as far running direct_ingest.pl but
> > > I'm noticing that biblio_fingerprint.js chokes on a few of them.
> > > 
> > > Looking a bit closer, I noticed that biblio_fingerprint.js chops
> > > character codes down to two least significant hex digits. For example,
> > > biblio_fingerprint.js turns "Č" (Č) into "&#x0c" ("form feed"),
> > > which causes the direct_ingest.pl to skip the record and output the
> > > following error:
> > > 
> > >   "Couldn't process record: invalid character encountered while
> > > parsing JSON string"
> > > 
> > > Attached is a sample record that causes this problem for me (the
> > > tarball includes both the original MARCXML and the BRE file generated
> > > from it by marc2bre.pl). Any help would be appreciated! I can open a
> > > bug on Launchpad, too, if needed.
> > > 
> > > Cheers,
> > >  Warren
> > 
> > Thanks for putting together a reproducible test case, Warren.
> > 
> > One way of avoiding the full choke that direct_ingest.pl experiences is
> > to revert the following change that made marc2bre.pl used composed
> > characters in preference to decomposed characters:
> > http://svn.open-ils.org/trac/ILS/changeset/12985/trunk/Open-ILS/src/extras/import/marc2bre.pl
> > 
> > I don't think this is the right way to address the problem, though.
> > You've traced the real corruption to biblio_fingerprint.js; from there,
> > one could look at whether it's introduced by calling toLowerCase() (a
> > non-locale safe function) on (Č) which is where my immediate suspicion
> > lies) or whether it's a native limitation of the SpiderMonkey JavaScript
> > interpreter / JavaScript::SpiderMonkey Perl module... 
> > 
> > And then I wonder whether it just might be better to fast-forward this
> > part of the in-database ingest implementation so that we can define an
> > ordered list of author sources, title sources, and quality measurements
> > and run the import through the database - which would have the added
> > bonus of not always heavily favouring English sources in an
> > out-of-the-box implementation. I'm not as keen about reimplementing the
> > record_type identification portion of biblio_fingerprint.js, mind you.
> > 
> > But given the existing framework, we would be hitting the database once
> > in direct_ingest.pl for every record that we're subsequently going to
> > import via pg_loader.pl, rather than pushing the records into the
> > database and saying "ingest!" - which might be a better way of
> > approaching the problem.
> > 
> > Okay, I've diverged significantly from the original topic, in a
> > not-very-precise way. The next concrete thing I'll do is try pulling the
> > toLowerCase() call from biblio_fingerprint.js just to see if that's
> > where the processing death really is being introduced.
> 
> 
> Unfortunately, just removing the toLowerCase() call from
> biblio_fingerprint.js isn't enough. We still get an error thrown.
> 
> Back to the drawing board, the JSON::XS Perl module documentation
> strongly warns against relying on eval() to convert JSON to JavaScript,
> and instead recommends using Douglas Crockford's json2.js parser [1]. Of
> course, we're using eval() in the decodeJS() function in JSON_v1.js. So
> perhaps plugging in json2.js at this point would be a quick test case
> for trying to fix the existing biblio_fingerprint.js approach...
> 
> 1.
> http://search.cpan.org/~mlehmann/JSON-XS-2.26/XS.pm#JSON_and_ECMAscript
> 

For those still paying attention, I should note that we last switched
from composed to decomposed Unicode characters in marc2bre.pl back at
http://svn.open-ils.org/trac/ILS/changeset/10283/trunk/Open-ILS/src/extras/import/marc2bre.pl  



More information about the Open-ils-dev mailing list