[OPEN-ILS-DEV] Encoding UTF-8

Fri Nov 21 15:58:43 EST 2008

I don't think that the way uescape() encodes UTF-8 characters is correct.
It creates ambiguities for anybody trying to reverse the encoding.

Consider the following strings:

const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };

The first is a two-byte UTF-8 character, and the second is a three-byte
UTF-8 character.  I don't know what kind of characters they represent,
if any, because I hand-crafted them for my example.  However each has
a valid UTF-8 format, so far as I can tell from the Wikipedia page at:

    http://en.wikipedia.org/wiki/UTF-8

uescape() encodes each of these strings as "\u03ff".  If I'm trying to
decode "\u03ff", how am I supposed to tell which was the original UTF-8
character?  No doubt there's also a four-byte version that yields the
same encoding, though I haven't tried to construct it.

Next, consider the following two strings:

const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3, '0', '\0' };
const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C, 0xB0, '\0' };

The first is a three-byte UTF-8 character followed by a plain ASCII zero.
The second is a four-byte UTF-8 character.  Again, both strings appear to
be valid UTF-8, but uescape() encodes them the same: "\u33330".

The latter is a different kind of ambiguity, but it stems from the same
cause.  When we encode a multibyte UTF-8 character, we strip off the
length bits.  Then no one trying to decode the results has any way to
guess what the original length was.

The site I have been relying on for information about JSON:

    http://www.json.org/

...says that a UTF-8 character should be encoded as "\uxxxx", where each
x is a hexadecimal digit.  Specifically it says FOUR hexadecimal digits.

The restriction to four hexadecimal characters works only for two-byte
UTF-8 characters.  It works for three-byte UTF-8 characters if you discard 
the length bits, leading to the kind of ambiguity described above.  It
can't possibly work for four-byte UTF-8 characters, unless you break up
the four bytes into two two-byte sequences and encode them separately.

The same site is notably silent on exactly how the encoding is to be done.

Supposedly JSON is defined by RFC 4627:

    http://tools.ietf.org/html/rfc4627

In section 2.5 it says:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

I haven't figured out exactly what that means, but apparently a UTF-8
character longer than two bytes needs to be represented as two successive
sets of four hex digits, each prefaced by "\u".  That's not what we're 
doing.

Scott McKellar
http://home.swbell.net/mck9/ct/