[OPEN-ILS-DEV] Issues with direct_ingest.pl

Warren Layton warren.layton at gmail.com
Mon Nov 17 21:46:39 EST 2008


I recently imported nearly 300,000 Unicorn records into Evergreen,
many with accented characters (we have a lot of French and bilingual
English-French material).

However, 17 records got rejected by direct_ingest.pl. A handful of
these rejects are our fault. However, I think I'm hitting some sort of
corner case for most of the others.

What I suspect is happening is that direct_ingest.pl rejects records
that have an accented character between square brackets ("[" and "])
in a field. For example, a record with the following 260 subfield will
be rejected:

  <subfield code=\"b\">[Bibliothe&#x300;que nationale du Canada],</subfield>

However, if I remove _either_ the square brackets _or_ the "&#x300;",
the record will be successfully processed. My first guess was that the
square brackets ("[" and "]") would need to be escaped for JSON but
since all of these tags are in a string, the only character that needs
to be escaped are double-quotes (").

I should also note that other records with accented characters (e.g.,
"e&#x300;") and/or square brackets were processed just fine, but I
think those with accented characters _between_ square brackets, as
above, are all in the reject pile (not 100% sure, but I'll be
verifying this ASAP).

I'm attaching a single problem record that's already been passed
through marc2bre.pl in case anyone wants to try reproducing the
problem. Field 260, subfield b contains the problem. I can provide
others if necessary.

I'll continue debugging but I would appreciate any insights!

Cheers,
  Warren
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eg_sample.bre.gz
Type: application/x-gzip
Size: 834 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081117/846e83b9/attachment-0001.bin 


More information about the Open-ils-dev mailing list