[OPEN-ILS-DEV] Encoding UTF-8

Sun Nov 23 16:08:11 EST 2008

Dan beat me to some of this, but I figure I'll finish up the draft.
In any case, most of this is superseded by the new implementation.

On Fri, Nov 21, 2008 at 3:58 PM, Scott McKellar <mck9 at swbell.net> wrote:
> I don't think that the way uescape() encodes UTF-8 characters is correct.
> It creates ambiguities for anybody trying to reverse the encoding.
>
> Consider the following strings:
>
> const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
> const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };
>
> The first is a two-byte UTF-8 character, and the second is a three-byte
> UTF-8 character.  I don't know what kind of characters they represent,
> if any, because I hand-crafted them for my example.  However each has
> a valid UTF-8 format, so far as I can tell from the Wikipedia page at:
>
>    http://en.wikipedia.org/wiki/UTF-8
>
> uescape() encodes each of these strings as "\u03ff".  If I'm trying to
> decode "\u03ff", how am I supposed to tell which was the original UTF-8
> character?  No doubt there's also a four-byte version that yields the
> same encoding, though I haven't tried to construct it.
>

This is OK and expected. The bytes used to transfer the data are not
important (from the perspective of Unicode), only the decoded
codepoint values.  It's entirely acceptable to transmit 0xCF 0xBF or
0xE0 0x8F 0xBF for codepoint U+033F as long as they are both properly
constructed UTF-8 byte arrays (which, as you point out, they are).
All UTF-8 encoding routines which I've poked at always use the
shortest possible encoding for a given codepoint, simply because the
easiest way to detect the number of bytes to use is to consider how
many bits are needed to encode the codepoint.  Given that, it's just a
matter of masking the proper amount of most-significant bits and
splitting the data bits over the remaining space.  Of course, at a
binary level, that's lossy, but is conceptually the same as using
entity replacements in XML instead of direct byte encoding.  It's all
about the codepoint

> Next, consider the following two strings:
>
> const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3, '0', '\0' };
> const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C, 0xB0, '\0' };
>
> The first is a three-byte UTF-8 character followed by a plain ASCII zero.
> The second is a four-byte UTF-8 character.  Again, both strings appear to
> be valid UTF-8, but uescape() encodes them the same: "\u33330".
>
> The latter is a different kind of ambiguity, but it stems from the same
> cause.  When we encode a multibyte UTF-8 character, we strip off the
> length bits.  Then no one trying to decode the results has any way to
> guess what the original length was.
>

While I realize the example is contrived, the codepoint you are
constructing is not a valid codepoint according to

http://www.zvon.org/other/charSearch/PHP/search.php?request=033330&searchType=3
and
http://www.alanwood.net/unicode/fontsbyrange.html

That's not to say that other codepoints that are valid can't have
collisions of the 4/5 hex character type.  I went hunting for one and
found U+1236/U+1236E in some codepoint tables (though Zvon doesn't
know about U+1236E ...).  But, we have specs and conditions on our
side.  First, we are working with bibliographic data which is
restricted to a subset of the Unicode codepoint range (specifically,
within the 16b limitation of Javascript (and UTF-16)), and it's
required to always use combining characters, instead of precomposed
characters, which because of the codepoint layout in Unicode will
always be within the range our tools (uescape et al, and outside code)
impose.  Now, that's a bit of a fib -- it's true for USMARC, but might
not be true for UNIMARC, but there is a lot more work than just this
required to support UNIMARC in Evergreen.

> The site I have been relying on for information about JSON:
>
>    http://www.json.org/
>
> ...says that a UTF-8 character should be encoded as "\uxxxx", where each
> x is a hexadecimal digit.  Specifically it says FOUR hexadecimal digits.
>
> The restriction to four hexadecimal characters works only for two-byte
> UTF-8 characters.  It works for three-byte UTF-8 characters if you discard
> the length bits, leading to the kind of ambiguity described above.  It
> can't possibly work for four-byte UTF-8 characters, unless you break up
> the four bytes into two two-byte sequences and encode them separately.
>
> The same site is notably silent on exactly how the encoding is to be done.
>
> Supposedly JSON is defined by RFC 4627:
>
>    http://tools.ietf.org/html/rfc4627
>
> In section 2.5 it says:
>
>   To escape an extended character that is not in the Basic Multilingual
>   Plane, the character is represented as a twelve-character sequence,
>   encoding the UTF-16 surrogate pair.  So, for example, a string
>   containing only the G clef character (U+1D11E) may be represented as
>   "\uD834\uDD1E".
>
> I haven't figured out exactly what that means, but apparently a UTF-8
> character longer than two bytes needs to be represented as two successive
> sets of four hex digits, each prefaced by "\u".  That's not what we're
> doing.
>

Specifically, it's characters that need more than 16 bits of data
space, which for UTF-8 means some 3-byte characters, but not all.

Another thing to consider is that because we're working with XML in
most cases, characters that fall outside the ASCII range are turned
into entities before being stored.  This steps right around most of
the problems with direct UTF-8 encoding and JSON, but the exception to
this is non-bibliographic data which could be made to contain high
codepoints.  Short of building a surrogate pair map for high-codepoint
Unicode planes I don't see a way to support such characters.  While
it's conceivable that someone would want to put characters from "bad"
planes into a library name label (or other non-bib string), it is a
corner case restriction that I'm personally OK living with lacking
another solution.

Thanks, Scott, for digging into this.  And thanks for the new
implementation of the encoding routine!  It's much more readable and
direct.  I've always disliked the existing routine, and your
replacement seems like a big improvement.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com