[OPEN-ILS-DEV] Encoding UTF-8

Sun Nov 23 13:38:19 EST 2008

On Fri, 2008-11-21 at 12:58 -0800, Scott McKellar wrote:
> I don't think that the way uescape() encodes UTF-8 characters is correct.
> It creates ambiguities for anybody trying to reverse the encoding.
> 
> Consider the following strings:
> 
> const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
> const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };
> 
> The first is a two-byte UTF-8 character, and the second is a three-byte
> UTF-8 character.  I don't know what kind of characters they represent,
> if any, because I hand-crafted them for my example.  However each has
> a valid UTF-8 format, so far as I can tell from the Wikipedia page at:
> 
>     http://en.wikipedia.org/wiki/UTF-8
> 

Ah, the table in Wikipedia appears to be heavily wrong. To extract the
corresponding table from the UTF8 section of the Unicode standard:

> uescape() encodes each of these strings as "\u03ff".  If I'm trying to
> decode "\u03ff", how am I supposed to tell which was the original UTF-8
> character?  No doubt there's also a four-byte version that yields the
> same encoding, though I haven't tried to construct it.
> 
> Next, consider the following two strings:
> 
> const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3, '0', '\0' };
> const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C, 0xB0, '\0' };
> 
> The first is a three-byte UTF-8 character followed by a plain ASCII zero.
> The second is a four-byte UTF-8 character.  Again, both strings appear to
> be valid UTF-8, but uescape() encodes them the same: "\u33330".
> 
> The latter is a different kind of ambiguity, but it stems from the same
> cause.  When we encode a multibyte UTF-8 character, we strip off the
> length bits.  Then no one trying to decode the results has any way to
> guess what the original length was.
> 
> The site I have been relying on for information about JSON:
> 
>     http://www.json.org/
> 
> ...says that a UTF-8 character should be encoded as "\uxxxx", where each
> x is a hexadecimal digit.  Specifically it says FOUR hexadecimal digits.
> 
> The restriction to four hexadecimal characters works only for two-byte
> UTF-8 characters.  It works for three-byte UTF-8 characters if you discard 
> the length bits, leading to the kind of ambiguity described above.  It
> can't possibly work for four-byte UTF-8 characters, unless you break up
> the four bytes into two two-byte sequences and encode them separately.
> 
> The same site is notably silent on exactly how the encoding is to be done.
> 
> Supposedly JSON is defined by RFC 4627:
> 
>     http://tools.ietf.org/html/rfc4627
> 
> In section 2.5 it says:
> 
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
> 
> I haven't figured out exactly what that means, but apparently a UTF-8
> character longer than two bytes needs to be represented as two successive
> sets of four hex digits, each prefaced by "\u".  That's not what we're 
> doing.
> 
> Scott McKellar
> http://home.swbell.net/mck9/ct/
> 

I'm just driving past this topic long enough to point out a similar
discussion from about a year ago that Sam Ruby kicked off (albeit
concerning JSON parsing & decoding in other languages):
http://www.intertwingly.net/blog/2007/11/15/Astral-Plane-Characters-in-Json - the nice thing is that he offers an example and a reasonable test case (round-trip the beast).

Would it make sense to adopt an existing JSON parser/decoder C library
rather than trying to maintain our own library? There are a few with
GPL-compatible licenses that appear to have recent activity:

 * json-c (http://oss.metaparadigm.com/json-c/) (MIT license) - this one
comes with a few tests, which I was able to modify to test out the Faihu
character Sam Ruby talked about; the output doesn't look great though

 * mjson (http://sourceforge.net/projects/mjson/) (GPL license) 

 * YAJL (http://lloydforge.org/projects/yajl/) (New BSD license) - comes
with an extensive test suite, but a quick test of the json_reformat
binary with { "test": "\ud800\udf46" } results in { "test": "!" } (where
"!" represents decimal character 10346 - whereas I would have expected
it to have kept the escaped Unicode notation).

If we keep going down the road of an OpenSRF-specific JSON
implementation - or if we want to contribute some code to one of the
other libraries:
 * simplejson (C implementation of a Python extension) (Modified BSD
license)
http://simplejson.googlecode.com/svn/tags/simplejson-2.0.4/simplejson/_speedups.c has code for explicitly dealing with surrogate pairs; this might be the most promising template for encoding and decoding Unicode outside the basic multilingual plane

Dan