[OPEN-ILS-DEV] Encoding UTF-8

Scott McKellar mck9 at swbell.net
Sun Nov 23 19:55:36 EST 2008


I didn't follow all of this, but I think you're saying:

1. The ambiguities between 2-byte and 3-byte characters are present,
but harmless because of some built-in redundancy.

2. Characters that require more than four hex digits to encode can lead
to ambiguities, but they are rare and improbable in practice, in part
for reasons that are specific to Evergreen.

3. Doing UTF-8 correctly, accounting even for the oddball cases, would
be harder than it would be worth.

I shall address these in more detail below.

--- On Sun, 11/23/08, Mike Rylander <mrylander at gmail.com> wrote:

> From: Mike Rylander <mrylander at gmail.com>
> Subject: Re: [OPEN-ILS-DEV] Encoding UTF-8
> To: mck9 at swbell.net, "Evergreen Development Discussion List" <open-ils-dev at list.georgialibraries.org>
> Date: Sunday, November 23, 2008, 3:08 PM
> Dan beat me to some of this, but I figure I'll finish up
> the draft.
> In any case, most of this is superseded by the new
> implementation.
> 
> On Fri, Nov 21, 2008 at 3:58 PM, Scott McKellar
> <mck9 at swbell.net> wrote:
> > I don't think that the way uescape() encodes UTF-8
> characters is correct.
> > It creates ambiguities for anybody trying to reverse
> the encoding.
> >
> > Consider the following strings:
> >
> > const unsigned char utf_2a[] = { 0xCF, 0xBF,
> '\0' };
> > const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF,
> '\0' };
> >
> > The first is a two-byte UTF-8 character, and the
> second is a three-byte
> > UTF-8 character.  I don't know what kind of
> characters they represent,
> > if any, because I hand-crafted them for my example. 
> However each has
> > a valid UTF-8 format, so far as I can tell from the
> Wikipedia page at:
> >
> >    http://en.wikipedia.org/wiki/UTF-8
> >
> > uescape() encodes each of these strings as
> "\u03ff".  If I'm trying to
> > decode "\u03ff", how am I supposed to
> tell which was the original UTF-8
> > character?  No doubt there's also a four-byte
> version that yields the
> > same encoding, though I haven't tried to construct
> it.
> >
> 
> This is OK and expected. The bytes used to transfer the
> data are not
> important (from the perspective of Unicode), only the
> decoded
> codepoint values.  It's entirely acceptable to transmit
> 0xCF 0xBF or
> 0xE0 0x8F 0xBF for codepoint U+033F as long as they are
> both properly
> constructed UTF-8 byte arrays (which, as you point out,
> they are).
> All UTF-8 encoding routines which I've poked at always
> use the
> shortest possible encoding for a given codepoint, simply
> because the
> easiest way to detect the number of bytes to use is to
> consider how
> many bits are needed to encode the codepoint.  Given that,
> it's just a
> matter of masking the proper amount of most-significant
> bits and
> splitting the data bits over the remaining space.  Of
> course, at a
> binary level, that's lossy, but is conceptually the
> same as using
> entity replacements in XML instead of direct byte encoding.
>  It's all
> about the codepoint

I had assumed that 0xCF 0xBF and 0xE0 0x8F 0xBF would represent two
different and distinct characters.  For example, the first might be
the Hebrew aleph, and the second might be the Mandarin symbol for
"lampshade".  If so, then whoever tries to decode "\u3FFF" runs the
risk of turning an aleph into a lampshade, or vice versa.

Your argument seems to assert that those two byte sequences, if they
mean anything, must mean the *same* thing, so that it doesn't matter
which one you choose when you decode "\u3FFF".

This makes sense to me under the following consideration.  If you strip
off the various overhead bits, and any leading zero bits that remain,
you are left with the same bit pattern: 1111111111.  That's what you
call the code point, the underlying numerical payload, although it may
be packaged in more than one way.

Have I got that right?

The fact remains that four hex characters can encode no more than
65536 distinct values.  More than half of those are not valid two-byte
UTF-8 characters, so really there are only about 16K distinct values 
available.  UTF-8 has more multibyte characters than that.  That's why it 
uses three- and four-byte sequences.  Those long characters cannot be
expressed in four hex digits.  If we ever see one, we'll mangle it.

> > Next, consider the following two strings:
> >
> > const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3,
> '0', '\0' };
> > const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C,
> 0xB0, '\0' };
> >
> > The first is a three-byte UTF-8 character followed by
> a plain ASCII zero.
> > The second is a four-byte UTF-8 character.  Again,
> both strings appear to
> > be valid UTF-8, but uescape() encodes them the same:
> "\u33330".
> >
> > The latter is a different kind of ambiguity, but it
> stems from the same
> > cause.  When we encode a multibyte UTF-8 character, we
> strip off the
> > length bits.  Then no one trying to decode the results
> has any way to
> > guess what the original length was.
> >
> 
> While I realize the example is contrived, the codepoint you
> are
> constructing is not a valid codepoint according to
> 
> http://www.zvon.org/other/charSearch/PHP/search.php?request=033330&searchType=3
> and
> http://www.alanwood.net/unicode/fontsbyrange.html
> 
> That's not to say that other codepoints that are valid
> can't have
> collisions of the 4/5 hex character type.  I went hunting
> for one and
> found U+1236/U+1236E in some codepoint tables (though Zvon
> doesn't
> know about U+1236E ...).  But, we have specs and conditions
> on our
> side.  First, we are working with bibliographic data which
> is
> restricted to a subset of the Unicode codepoint range
> (specifically,
> within the 16b limitation of Javascript (and UTF-16)), and
> it's
> required to always use combining characters, instead of
> precomposed
> characters, which because of the codepoint layout in
> Unicode will
> always be within the range our tools (uescape et al, and
> outside code)
> impose.  Now, that's a bit of a fib -- it's true
> for USMARC, but might
> not be true for UNIMARC, but there is a lot more work than
> just this
> required to support UNIMARC in Evergreen.

I would like to think that OSRF would be usable in a context other than
just Evergreen.  I would also like to think that a library could use
Evergreeb in Beijing, or Osaka, or Seoul, or wherever it is that they
use characters from the astral planes.

> > The site I have been relying on for information about
> JSON:
> >
> >    http://www.json.org/
> >
> > ...says that a UTF-8 character should be encoded as
> "\uxxxx", where each
> > x is a hexadecimal digit.  Specifically it says FOUR
> hexadecimal digits.
> >
> > The restriction to four hexadecimal characters works
> only for two-byte
> > UTF-8 characters.  It works for three-byte UTF-8
> characters if you discard
> > the length bits, leading to the kind of ambiguity
> described above.  It
> > can't possibly work for four-byte UTF-8
> characters, unless you break up
> > the four bytes into two two-byte sequences and encode
> them separately.
> >
> > The same site is notably silent on exactly how the
> encoding is to be done.
> >
> > Supposedly JSON is defined by RFC 4627:
> >
> >    http://tools.ietf.org/html/rfc4627
> >
> > In section 2.5 it says:
> >
> >   To escape an extended character that is not in the
> Basic Multilingual
> >   Plane, the character is represented as a
> twelve-character sequence,
> >   encoding the UTF-16 surrogate pair.  So, for
> example, a string
> >   containing only the G clef character (U+1D11E) may
> be represented as
> >   "\uD834\uDD1E".
> >
> > I haven't figured out exactly what that means, but
> apparently a UTF-8
> > character longer than two bytes needs to be
> represented as two successive
> > sets of four hex digits, each prefaced by
> "\u".  That's not what we're
> > doing.
> >
> 
> Specifically, it's characters that need more than 16
> bits of data
> space, which for UTF-8 means some 3-byte characters, but
> not all.
> 
> Another thing to consider is that because we're working
> with XML in
> most cases, characters that fall outside the ASCII range
> are turned
> into entities before being stored.  This steps right around
> most of
> the problems with direct UTF-8 encoding and JSON, but the
> exception to
> this is non-bibliographic data which could be made to
> contain high
> codepoints.  Short of building a surrogate pair map for
> high-codepoint
> Unicode planes I don't see a way to support such
> characters.  While
> it's conceivable that someone would want to put
> characters from "bad"
> planes into a library name label (or other non-bib string),
> it is a
> corner case restriction that I'm personally OK living
> with lacking
> another solution.

I haven't figured out exactly how surrogate pairs work, but I wouldn't
expect to need some kind of monster map from one character set to 
another.  I would expect that surrogate pairs are just a way to package
arbitrary bit patterns (well, arbitrary within certain syntactic 
constraints) into a series of hex digits.  I would expect to do no more
than a little bit-twiddling.  Would we really need to look things up in a
map?

> Thanks, Scott, for digging into this.  And thanks for the
> new
> implementation of the encoding routine!  It's much more
> readable and
> direct.  I've always disliked the existing routine, and
> your
> replacement seems like a big improvement.
> 
> -- 
> Mike Rylander

In a separate reply, Dan Scott suggested that we look at other people's
libraries for dealing with UTF-8.  That's not a bad idea, if we want to
change things.  However for performance reasons I'd like to continue
to append the escaped text to a growing_buffer.  Nobody else's code will
do that without modification or adaptation on our part.

One library he identified was mjson, at:

    http://sourceforge.net/projects/mjson/

I looked briefly at it.  So far as I could tell from the source code,
it makes no attempt to escape UTF-8 characters.  It just copies them
without translation.  I haven't looked at the other sites yet.

Of course a change to the encoding may require corresponding changes to
the decoding.  I'm not sure where that happens.

I don't think I'm ready to volunteer for a full-bore implementation of
UTF-8.  It's clearly not an immediate priority.  However I'll keep it in
the back of my mind.

Scott McKellar
http://home.swbell.net/mck9/ct/



More information about the Open-ils-dev mailing list