[OPEN-ILS-DEV] Enabling UTF8 data in SIP2 (patch and RFC)

Dan Scott dan at coffeecode.net
Mon Jan 4 20:34:05 EST 2010


On Mon, 2010-01-04 at 20:01 -0500, Joe Atzberger wrote:
> It's worse than all that.  Despite the compatibility of your shiny new
> 3M hardware, the spec as written says:
> 
>         In general the packet contains only ASCII characters... the
>         default character set will be English 850 (as defined in the
>         Microsoft MS DOS manual)."
>  
> [page 14 of the v2.12 (April 11, 2006) document]
> 

Right, I quoted that same bit of the spec in the openncip tracker
artifact that I opened and linked to in the previous email:
https://sourceforge.net/support/tracker.php?aid=2925760

> This is part of the bizarre underpinnings of SIP, relying on an
> obsolete Microsoft codepage.  It offers coverage for common Western
> European languages, but in my experience it breaks on, say, Lithuanian
> or Arabic.  The spec goes on to say "If another character set is
> required, the SC and the ACS must mutually define the character set,"
> but it doesn't say how to establish that.  So Evergreen isn't dumbing
> things down on it's own account.  It actually is following the (dumb)
> spec.  

Right, the current Evergreen code is (incompletely, in that it is only
mangling the title, and not ensuring that the incoming data is NFD)
following the strictest interpretation of the SIP2 spec. However, my
"shiny new 3M hardware" has defined UTF8 (and many other options) as an
acceptable character set, and I would wager that other shiny new SIP
clients from other manufacturers have also extended the SIP2 spec to
support UTF8.

> As a result, anything we do that is not ASCII-only needs to be the
> configurable exception, in order to avoid breakage with any *other*
> poor bastards dutifully implementing the spec.  As your intuition
> suggested, I would recommend doing any character conversion is exactly
> one place, and not out in the leaf objects like Item.pm.  

Fair enough. So if I uncomment those two lines of code in
OpenILS::SIP::clean_text() in my patch as the default for now to restore
the NFD + s/\pM+//og lines, then we:

1. Get a closer-to-working strict interpretation of the spec (although
the encode('ascii', $text) route is probably the right direction to
pursue for a truly strict approach); and

2. Have an easy UTF8 option for sites that are using modern SIP clients
and that don't want to mangle their data, by commenting those lines
again.

> A subsequent page in the spec also says "Only displayable characters
> (no control characters) should be included in print or display
> messages from the ACS".  The question of what is displayable obviously
> depends on the character set, so that seems to rule out things like
> the zero-width non-joiner used in Arabic and Hebrew.  So what
> character set would you use there?  UTF-8 minus some random pieces?
>  Obviously this is a point of failure in the design.

Yep, I read that too. There are edge cases in any spec, and this spec
seems to be made up of razors. Given that 3M sells these units with a
Chinese localization, it wouldn't surprise me if they ended up ignoring
this clause entirely. That's probably an inherent problem with specs
controlled by a single entity that also happens to be one of the primary
vendors of products that implement the spec; they get to ignore and
extend whatever they like. We can still try to make our implementation
work with the hardware that we have on hand though, right?

> It gets worse if you try to follow along with the terminal
> session-setting for "language" and give the data different conversions
> based on that, since the language doesn't directly affect the
> character set, but you really would want something better than 850 for
> Arabic, right?  Without a means to specify that, it sure feels a lot
> like making it up as you go along, and we don't need a spec for that. 

Right. Which must be why the spec left open the possibility for the
client and server to use other character sets, and which is why I put in
the work to enable UTF8 in the first place.

> The NFD vs. NFC question seems tricky too, since it seems that both
> forms of data exist in the wild.  
> 

It shouldn't be too tricky, as long as we don't make assumptions and
use 
Unicode::Normalize::NFD() to convert the data to the necessary format if
we're doing things like s/\pM+//og - right?

So -- if I uncomment those two lines in OpenILS::SIP::clean_text() so
that the combining characters are stripped by default, would you find
this patch acceptable?



More information about the Open-ils-dev mailing list