[OPEN-ILS-DEV] Encoding UTF-8
Scott McKellar
mck9 at swbell.net
Fri Nov 21 15:58:43 EST 2008
I don't think that the way uescape() encodes UTF-8 characters is correct.
It creates ambiguities for anybody trying to reverse the encoding.
Consider the following strings:
const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };
The first is a two-byte UTF-8 character, and the second is a three-byte
UTF-8 character. I don't know what kind of characters they represent,
if any, because I hand-crafted them for my example. However each has
a valid UTF-8 format, so far as I can tell from the Wikipedia page at:
http://en.wikipedia.org/wiki/UTF-8
uescape() encodes each of these strings as "\u03ff". If I'm trying to
decode "\u03ff", how am I supposed to tell which was the original UTF-8
character? No doubt there's also a four-byte version that yields the
same encoding, though I haven't tried to construct it.
Next, consider the following two strings:
const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3, '0', '\0' };
const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C, 0xB0, '\0' };
The first is a three-byte UTF-8 character followed by a plain ASCII zero.
The second is a four-byte UTF-8 character. Again, both strings appear to
be valid UTF-8, but uescape() encodes them the same: "\u33330".
The latter is a different kind of ambiguity, but it stems from the same
cause. When we encode a multibyte UTF-8 character, we strip off the
length bits. Then no one trying to decode the results has any way to
guess what the original length was.
The site I have been relying on for information about JSON:
http://www.json.org/
...says that a UTF-8 character should be encoded as "\uxxxx", where each
x is a hexadecimal digit. Specifically it says FOUR hexadecimal digits.
The restriction to four hexadecimal characters works only for two-byte
UTF-8 characters. It works for three-byte UTF-8 characters if you discard
the length bits, leading to the kind of ambiguity described above. It
can't possibly work for four-byte UTF-8 characters, unless you break up
the four bytes into two two-byte sequences and encode them separately.
The same site is notably silent on exactly how the encoding is to be done.
Supposedly JSON is defined by RFC 4627:
http://tools.ietf.org/html/rfc4627
In section 2.5 it says:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
I haven't figured out exactly what that means, but apparently a UTF-8
character longer than two bytes needs to be represented as two successive
sets of four hex digits, each prefaced by "\u". That's not what we're
doing.
Scott McKellar
http://home.swbell.net/mck9/ct/
More information about the Open-ils-dev
mailing list