[OPEN-ILS-DEV] SPAM Re: SPAM Enabling UTF8 data in SIP2 (patch and RFC)

Mon Jan 4 20:01:25 EST 2010

It's worse than all that.  Despite the compatibility of your shiny new 3M
hardware, the spec as written says:

In general the packet contains *only ASCII characters*... the default
> character set will be *English 850* (as defined in the Microsoft MS DOS
> manual)."

[page 14 of the v2.12 (April 11, 2006) document]

This is part of the bizarre underpinnings of SIP, relying on an obsolete
Microsoft codepage.  It offers coverage for common Western European
languages, but in my experience it breaks on, say, Lithuanian or Arabic.
 The spec goes on to say "If another character set is required, the SC and
the ACS must mutually define the character set," but it doesn't say how to
establish that.  So Evergreen isn't dumbing things down on it's own account.
 It actually is following the (dumb) spec.

As a result, anything we do that is not ASCII-only needs to be the
configurable exception, in order to avoid breakage with any *other* poor
bastards dutifully implementing the spec.  As your intuition suggested, I
would recommend doing any character conversion is exactly one place, and not
out in the leaf objects like Item.pm.

A subsequent page in the spec also says "Only displayable characters (no
control characters) should be included in print or display messages from the
ACS".  The question of what is displayable obviously depends on the
character set, so that seems to rule out things like the zero-width
non-joiner used in Arabic and Hebrew.  So what character set would you use
there?  UTF-8 minus some random pieces?  Obviously this is a point of
failure in the design.

It gets worse if you try to follow along with the terminal session-setting
for "language" and give the data different conversions based on that, since
the language doesn't directly affect the character set, but you really would
want something better than 850 for Arabic, right?  Without a means to
specify that, it sure feels a lot like making it up as you go along, and we
don't need a spec for that.

The NFD vs. NFC question seems tricky too, since it seems that both forms of
data exist in the wild.

--Joe Atzberger,
Equinox Software, Inc.

On Mon, Jan 4, 2010 at 1:50 PM, Dan Scott <dan at coffeecode.net> wrote:
>
> The basic problem is that we purchased a brand-spanking-new 3M
> self-check unit just before Christmas, plugged it in, and were happy to
> see it working with the OpenNCIP / Evergreen SIP server right out of the
> box.
>
> That was the case, at least, until we ran some items through that had
> accented characters in the title field (e.g. "Présentation et synthèse
> d'une évaluation romande") - where we found the self-check threw an
> error about "bad checksum - too many retries" and turned itself off.
>
> Digging into the code, I found a few things:
>
> OpenILS::SIP::Item.pm contained a line that tried to remove accents from
> characters before returning the title; however, it assumed that the data
> was in Normalization Form D (NFD) format, and our data is in
> Normalization Form C (NFC) format. The simple fix to that was just to
> wrap the text in a Unicode::Normalization::NFD() call and things were
> okay, at least for text that could be represented as ASCII characters
> with accents.
>
> There were many other text fields that could be returned that would
> possibly contain non-ASCII characters - the user's own name, user
> address, library names, etc - and these were not being escaped the same
> way as the book titles.
>
> It bugged me more, however, that our self-check unit offered UTF8 as an
> communication encoding option in the SIP configuration interface, yet
> the Evergreen code was trying to dumb everything down to plain ASCII. In
> the self-check unit trace logs, I could see the Unicode data coming
> through, but the checksum was bad.
>
> So I dug into the OpenNCIP code and found that it calculates the
> checksum using Unicode characters (%U in the unpack() format) rather
> than bytes (%C). Trying to travel back in time and jumping into the
> minds of the original developers of the 3M spec, I guessed that they
> probably didn't anticipate multi-byte characters (hey, it was all the
> way back in 1997, according to atz) and tried changing the checksum
> algorithm implementation to just calculate the checksum on a byte-wise
> basis instead.
>
> In addition, I opted to try to treat all of the text as UTF8 by
> explicitly decoding & encoding it as UTF8, rather than trying to just
> strip combining characters from the data. (I did a lot of reading on
> Perl + Unicode over the holidays, good holiday fun!)
>
> This combination made our self-check unit quite happy. It prints out
> receipts with Unicode character data and doesn't complain about bad
> checksums. So this code is already live at Laurentian University, for
> what that's worth.
>
> Attached, please find a patch for the OpenILS::SIP::* code that defaults
> to enabling UTF8 data in a new subroutine clean_text() in OpenILS::SIP,
> but by uncommenting two lines will revert to the previous "strip
> combining characters" approach (with the addition of the NFD() call). I
> have tried to apply the clean_text() call to all of the areas where one
> would possibly pass Unicode data over the wire.
>
> Comments, suggestions, and enhancements are quite welcome. For example,
> one could conceivably define a target encoding in oils_sip.xml (e.g.
> "<encoding>iso-8859-1</encoding>") and have clean_text() invoke
> encode($target_encoding, $text) rather than assuming UTF-8, as I imagine
> there is a need to support older self-check units. An alternate
> implementation of the NFD + strip combining characters approach for
> really old SIP clients and gnarly data could be something like
> encode('ascii', $text).
>
> I opened a tracker artifact for the checksum algorithm in the OpenNCIP
> project at https://sourceforge.net/support/tracker.php?aid=2925760 with
> a patch and the gory details about the checksums and logs.
>
> Some good Perl/Unicode reading, if you're interested in digging into
> this subject:
> 1. http://juerd.nl/site.plp/perluniadvice
> 2. http://perldoc.perl.org/perlunitut.html
> 3. http://perldoc.perl.org/perlunicode.html
> 4. http://perldoc.perl.org/Unicode/Normalize.html
> 5. http://perldoc.perl.org/Encode.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20100104/811737ea/attachment.htm 

[OPEN-ILS-DEV] ***SPAM*** Re: ***SPAM*** Enabling UTF8 data in SIP2 (patch and RFC)

[OPEN-ILS-DEV] SPAM Re: SPAM Enabling UTF8 data in SIP2 (patch and RFC)