[OPEN-ILS-DEV] Encoding UTF-8
Dan Scott
dan at coffeecode.net
Sun Nov 23 13:38:19 EST 2008
On Fri, 2008-11-21 at 12:58 -0800, Scott McKellar wrote:
> I don't think that the way uescape() encodes UTF-8 characters is correct.
> It creates ambiguities for anybody trying to reverse the encoding.
>
> Consider the following strings:
>
> const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
> const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };
>
> The first is a two-byte UTF-8 character, and the second is a three-byte
> UTF-8 character. I don't know what kind of characters they represent,
> if any, because I hand-crafted them for my example. However each has
> a valid UTF-8 format, so far as I can tell from the Wikipedia page at:
>
> http://en.wikipedia.org/wiki/UTF-8
>
Ah, the table in Wikipedia appears to be heavily wrong. To extract the
corresponding table from the UTF8 section of the Unicode standard:
> uescape() encodes each of these strings as "\u03ff". If I'm trying to
> decode "\u03ff", how am I supposed to tell which was the original UTF-8
> character? No doubt there's also a four-byte version that yields the
> same encoding, though I haven't tried to construct it.
>
> Next, consider the following two strings:
>
> const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3, '0', '\0' };
> const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C, 0xB0, '\0' };
>
> The first is a three-byte UTF-8 character followed by a plain ASCII zero.
> The second is a four-byte UTF-8 character. Again, both strings appear to
> be valid UTF-8, but uescape() encodes them the same: "\u33330".
>
> The latter is a different kind of ambiguity, but it stems from the same
> cause. When we encode a multibyte UTF-8 character, we strip off the
> length bits. Then no one trying to decode the results has any way to
> guess what the original length was.
>
> The site I have been relying on for information about JSON:
>
> http://www.json.org/
>
> ...says that a UTF-8 character should be encoded as "\uxxxx", where each
> x is a hexadecimal digit. Specifically it says FOUR hexadecimal digits.
>
> The restriction to four hexadecimal characters works only for two-byte
> UTF-8 characters. It works for three-byte UTF-8 characters if you discard
> the length bits, leading to the kind of ambiguity described above. It
> can't possibly work for four-byte UTF-8 characters, unless you break up
> the four bytes into two two-byte sequences and encode them separately.
>
> The same site is notably silent on exactly how the encoding is to be done.
>
> Supposedly JSON is defined by RFC 4627:
>
> http://tools.ietf.org/html/rfc4627
>
> In section 2.5 it says:
>
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair. So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".
>
> I haven't figured out exactly what that means, but apparently a UTF-8
> character longer than two bytes needs to be represented as two successive
> sets of four hex digits, each prefaced by "\u". That's not what we're
> doing.
>
> Scott McKellar
> http://home.swbell.net/mck9/ct/
>
I'm just driving past this topic long enough to point out a similar
discussion from about a year ago that Sam Ruby kicked off (albeit
concerning JSON parsing & decoding in other languages):
http://www.intertwingly.net/blog/2007/11/15/Astral-Plane-Characters-in-Json - the nice thing is that he offers an example and a reasonable test case (round-trip the beast).
Would it make sense to adopt an existing JSON parser/decoder C library
rather than trying to maintain our own library? There are a few with
GPL-compatible licenses that appear to have recent activity:
* json-c (http://oss.metaparadigm.com/json-c/) (MIT license) - this one
comes with a few tests, which I was able to modify to test out the Faihu
character Sam Ruby talked about; the output doesn't look great though
* mjson (http://sourceforge.net/projects/mjson/) (GPL license)
* YAJL (http://lloydforge.org/projects/yajl/) (New BSD license) - comes
with an extensive test suite, but a quick test of the json_reformat
binary with { "test": "\ud800\udf46" } results in { "test": "!" } (where
"!" represents decimal character 10346 - whereas I would have expected
it to have kept the escaped Unicode notation).
If we keep going down the road of an OpenSRF-specific JSON
implementation - or if we want to contribute some code to one of the
other libraries:
* simplejson (C implementation of a Python extension) (Modified BSD
license)
http://simplejson.googlecode.com/svn/tags/simplejson-2.0.4/simplejson/_speedups.c has code for explicitly dealing with surrogate pairs; this might be the most promising template for encoding and decoding Unicode outside the basic multilingual plane
Dan
More information about the Open-ils-dev
mailing list