[OPEN-ILS-DEV] SPAM Enabling UTF8 data in SIP2 (patch and RFC)

Mon Jan 4 13:50:45 EST 2010

I want to put this out there for discussion/inspection, as this is my
first time touching the SIP code.

The basic problem is that we purchased a brand-spanking-new 3M
self-check unit just before Christmas, plugged it in, and were happy to
see it working with the OpenNCIP / Evergreen SIP server right out of the
box.

That was the case, at least, until we ran some items through that had
accented characters in the title field (e.g. "Présentation et synthèse
d'une évaluation romande") - where we found the self-check threw an
error about "bad checksum - too many retries" and turned itself off.

Digging into the code, I found a few things:

OpenILS::SIP::Item.pm contained a line that tried to remove accents from
characters before returning the title; however, it assumed that the data
was in Normalization Form D (NFD) format, and our data is in
Normalization Form C (NFC) format. The simple fix to that was just to
wrap the text in a Unicode::Normalization::NFD() call and things were
okay, at least for text that could be represented as ASCII characters
with accents.

There were many other text fields that could be returned that would
possibly contain non-ASCII characters - the user's own name, user
address, library names, etc - and these were not being escaped the same
way as the book titles.

It bugged me more, however, that our self-check unit offered UTF8 as an
communication encoding option in the SIP configuration interface, yet
the Evergreen code was trying to dumb everything down to plain ASCII. In
the self-check unit trace logs, I could see the Unicode data coming
through, but the checksum was bad.

So I dug into the OpenNCIP code and found that it calculates the
checksum using Unicode characters (%U in the unpack() format) rather
than bytes (%C). Trying to travel back in time and jumping into the
minds of the original developers of the 3M spec, I guessed that they
probably didn't anticipate multi-byte characters (hey, it was all the
way back in 1997, according to atz) and tried changing the checksum
algorithm implementation to just calculate the checksum on a byte-wise
basis instead.

In addition, I opted to try to treat all of the text as UTF8 by
explicitly decoding & encoding it as UTF8, rather than trying to just
strip combining characters from the data. (I did a lot of reading on
Perl + Unicode over the holidays, good holiday fun!)

This combination made our self-check unit quite happy. It prints out
receipts with Unicode character data and doesn't complain about bad
checksums. So this code is already live at Laurentian University, for
what that's worth.

Attached, please find a patch for the OpenILS::SIP::* code that defaults
to enabling UTF8 data in a new subroutine clean_text() in OpenILS::SIP,
but by uncommenting two lines will revert to the previous "strip
combining characters" approach (with the addition of the NFD() call). I
have tried to apply the clean_text() call to all of the areas where one
would possibly pass Unicode data over the wire. 

Comments, suggestions, and enhancements are quite welcome. For example,
one could conceivably define a target encoding in oils_sip.xml (e.g.
"<encoding>iso-8859-1</encoding>") and have clean_text() invoke
encode($target_encoding, $text) rather than assuming UTF-8, as I imagine
there is a need to support older self-check units. An alternate
implementation of the NFD + strip combining characters approach for
really old SIP clients and gnarly data could be something like
encode('ascii', $text).

I opened a tracker artifact for the checksum algorithm in the OpenNCIP
project at https://sourceforge.net/support/tracker.php?aid=2925760 with
a patch and the gory details about the checksums and logs.

Some good Perl/Unicode reading, if you're interested in digging into
this subject:
1. http://juerd.nl/site.plp/perluniadvice
2. http://perldoc.perl.org/perlunitut.html
3. http://perldoc.perl.org/perlunicode.html
4. http://perldoc.perl.org/Unicode/Normalize.html
5. http://perldoc.perl.org/Encode.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clean_text.diff
Type: text/x-patch
Size: 6382 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20100104/95b1bc8b/attachment.bin 

[OPEN-ILS-DEV] ***SPAM*** Enabling UTF8 data in SIP2 (patch and RFC)

[OPEN-ILS-DEV] SPAM Enabling UTF8 data in SIP2 (patch and RFC)