[OPEN-ILS-DEV] Encoding UTF-8

Mike Rylander mrylander at gmail.com
Sun Nov 23 21:25:06 EST 2008


Welcome to the world of character encoding! ;) (fun stuffs inline)

On Sun, Nov 23, 2008 at 7:55 PM, Scott McKellar <mck9 at swbell.net> wrote:
> I didn't follow all of this, but I think you're saying:
>
> 1. The ambiguities between 2-byte and 3-byte characters are present,
> but harmless because of some built-in redundancy.
>
> 2. Characters that require more than four hex digits to encode can lead
> to ambiguities, but they are rare and improbable in practice, in part
> for reasons that are specific to Evergreen.
>
> 3. Doing UTF-8 correctly, accounting even for the oddball cases, would
> be harder than it would be worth.
>
> I shall address these in more detail below.
>
> --- On Sun, 11/23/08, Mike Rylander <mrylander at gmail.com> wrote:
>
>> From: Mike Rylander <mrylander at gmail.com>
>> Subject: Re: [OPEN-ILS-DEV] Encoding UTF-8
>> To: mck9 at swbell.net, "Evergreen Development Discussion List" <open-ils-dev at list.georgialibraries.org>
>> Date: Sunday, November 23, 2008, 3:08 PM
>> Dan beat me to some of this, but I figure I'll finish up
>> the draft.
>> In any case, most of this is superseded by the new
>> implementation.
>>
>> On Fri, Nov 21, 2008 at 3:58 PM, Scott McKellar
>> <mck9 at swbell.net> wrote:
>> > I don't think that the way uescape() encodes UTF-8
>> characters is correct.
>> > It creates ambiguities for anybody trying to reverse
>> the encoding.
>> >
>> > Consider the following strings:
>> >
>> > const unsigned char utf_2a[] = { 0xCF, 0xBF,
>> '\0' };
>> > const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF,
>> '\0' };
>> >
>> > The first is a two-byte UTF-8 character, and the
>> second is a three-byte
>> > UTF-8 character.  I don't know what kind of
>> characters they represent,
>> > if any, because I hand-crafted them for my example.
>> However each has
>> > a valid UTF-8 format, so far as I can tell from the
>> Wikipedia page at:
>> >
>> >    http://en.wikipedia.org/wiki/UTF-8
>> >
>> > uescape() encodes each of these strings as
>> "\u03ff".  If I'm trying to
>> > decode "\u03ff", how am I supposed to
>> tell which was the original UTF-8
>> > character?  No doubt there's also a four-byte
>> version that yields the
>> > same encoding, though I haven't tried to construct
>> it.
>> >
>>
>> This is OK and expected. The bytes used to transfer the
>> data are not
>> important (from the perspective of Unicode), only the
>> decoded
>> codepoint values.  It's entirely acceptable to transmit
>> 0xCF 0xBF or
>> 0xE0 0x8F 0xBF for codepoint U+033F as long as they are
>> both properly
>> constructed UTF-8 byte arrays (which, as you point out,
>> they are).
>> All UTF-8 encoding routines which I've poked at always
>> use the
>> shortest possible encoding for a given codepoint, simply
>> because the
>> easiest way to detect the number of bytes to use is to
>> consider how
>> many bits are needed to encode the codepoint.  Given that,
>> it's just a
>> matter of masking the proper amount of most-significant
>> bits and
>> splitting the data bits over the remaining space.  Of
>> course, at a
>> binary level, that's lossy, but is conceptually the
>> same as using
>> entity replacements in XML instead of direct byte encoding.
>>  It's all
>> about the codepoint
>
> I had assumed that 0xCF 0xBF and 0xE0 0x8F 0xBF would represent two
> different and distinct characters.  For example, the first might be
> the Hebrew aleph, and the second might be the Mandarin symbol for
> "lampshade".  If so, then whoever tries to decode "\u3FFF" runs the
> risk of turning an aleph into a lampshade, or vice versa.
>

As long as they both boil down to 0b11111111111111 then it doesn't
matter how many bytes you use.  I suppose you could say 0xFC 0x80 0x80
0x83 0xBF 0xBF and it should decode to 0x3FFF.  Numerically, after
decoding, they are the same.

> Your argument seems to assert that those two byte sequences, if they
> mean anything, must mean the *same* thing, so that it doesn't matter
> which one you choose when you decode "\u3FFF".
>

Yep.  UTF-8 is just the way the codepoint number is packaged.
Specifically, it's meant to avoid the need for an extra byte (as with
UTF-16, aka wchar, aka UCS-2) per character, or 3 in the case of
UTF-32.

> This makes sense to me under the following consideration.  If you strip
> off the various overhead bits, and any leading zero bits that remain,
> you are left with the same bit pattern: 1111111111.  That's what you
> call the code point, the underlying numerical payload, although it may
> be packaged in more than one way.
>
> Have I got that right?
>

Exactly.

> The fact remains that four hex characters can encode no more than
> 65536 distinct values.  More than half of those are not valid two-byte
> UTF-8 characters, so really there are only about 16K distinct values
> available.  UTF-8 has more multibyte characters than that.  That's why it
> uses three- and four-byte sequences.  Those long characters cannot be
> expressed in four hex digits.  If we ever see one, we'll mangle it.
>

Well, there are also spacing and non-spacing combining marks, which
act as modifiers to the preceding character.  These, of course, can be
combined in various and sundry ways, and are how many characters and
glyphs are actually built.

Lacking a way to produce surrogate pairs (bit sequences that live in a
reserved range and say "further decode me by sticking me to an
adjacent pile of similarly decoded bits") we would mangle any legal
character we see that has a larger codepoint than 0xD7FF (where the
surrogate pair reserved range starts, IIRC).  Dan found some code for
a python module that seems to provide this capability, which is
probably what we need to do in the long run.  Or we can just say "use
surrogate pairs instead of direct encoding beyond codepoint 0xD7FF"
... which is essentially what we do today, actually.  While UTF-8 can
directly encode codepoints up to 31 bits long, all of our data sources
(Postgres from the database end, Javascript et al from the client
side) all use surrogate pairs.

Hrm... I'm starting to wonder now if this is a problem at all ...

>> > Next, consider the following two strings:
>> >
>> > const unsigned char utf_3b[] = { 0xE3, 0x8C, 0xB3,
>> '0', '\0' };
>> > const unsigned char utf_4b[] = { 0xF0, 0xB3, 0x8C,
>> 0xB0, '\0' };
>> >
>> > The first is a three-byte UTF-8 character followed by
>> a plain ASCII zero.
>> > The second is a four-byte UTF-8 character.  Again,
>> both strings appear to
>> > be valid UTF-8, but uescape() encodes them the same:
>> "\u33330".
>> >
>> > The latter is a different kind of ambiguity, but it
>> stems from the same
>> > cause.  When we encode a multibyte UTF-8 character, we
>> strip off the
>> > length bits.  Then no one trying to decode the results
>> has any way to
>> > guess what the original length was.
>> >
>>
>> While I realize the example is contrived, the codepoint you
>> are
>> constructing is not a valid codepoint according to
>>
>> http://www.zvon.org/other/charSearch/PHP/search.php?request=033330&searchType=3
>> and
>> http://www.alanwood.net/unicode/fontsbyrange.html
>>
>> That's not to say that other codepoints that are valid
>> can't have
>> collisions of the 4/5 hex character type.  I went hunting
>> for one and
>> found U+1236/U+1236E in some codepoint tables (though Zvon
>> doesn't
>> know about U+1236E ...).  But, we have specs and conditions
>> on our
>> side.  First, we are working with bibliographic data which
>> is
>> restricted to a subset of the Unicode codepoint range
>> (specifically,
>> within the 16b limitation of Javascript (and UTF-16)), and
>> it's
>> required to always use combining characters, instead of
>> precomposed
>> characters, which because of the codepoint layout in
>> Unicode will
>> always be within the range our tools (uescape et al, and
>> outside code)
>> impose.  Now, that's a bit of a fib -- it's true
>> for USMARC, but might
>> not be true for UNIMARC, but there is a lot more work than
>> just this
>> required to support UNIMARC in Evergreen.
>
> I would like to think that OSRF would be usable in a context other than
> just Evergreen.  I would also like to think that a library could use
> Evergreeb in Beijing, or Osaka, or Seoul, or wherever it is that they
> use characters from the astral planes.

Oh, we can deal with Kanji, Simplified Chinese and Korean now.  Those
languages have a surprisingly small number of glyphs, and they all fit
in the lower 64k.  THey achieve this by using a base glyph plus
non-spacing combining marks for added strokes.

For instance, try out http://0xcc.net/jsescape/ with

什麽是 or 什么是  (Those may not display correctly unless your mail client
deals with unicode properly, natch.)

I've no idea what that text says (I suspect "Unicode" since it seems
to be from a translation list) but they codepoints are all well within
the 16b range.

>
>> > The site I have been relying on for information about
>> JSON:
>> >
>> >    http://www.json.org/
>> >
>> > ...says that a UTF-8 character should be encoded as
>> "\uxxxx", where each
>> > x is a hexadecimal digit.  Specifically it says FOUR
>> hexadecimal digits.
>> >
>> > The restriction to four hexadecimal characters works
>> only for two-byte
>> > UTF-8 characters.  It works for three-byte UTF-8
>> characters if you discard
>> > the length bits, leading to the kind of ambiguity
>> described above.  It
>> > can't possibly work for four-byte UTF-8
>> characters, unless you break up
>> > the four bytes into two two-byte sequences and encode
>> them separately.
>> >
>> > The same site is notably silent on exactly how the
>> encoding is to be done.
>> >
>> > Supposedly JSON is defined by RFC 4627:
>> >
>> >    http://tools.ietf.org/html/rfc4627
>> >
>> > In section 2.5 it says:
>> >
>> >   To escape an extended character that is not in the
>> Basic Multilingual
>> >   Plane, the character is represented as a
>> twelve-character sequence,
>> >   encoding the UTF-16 surrogate pair.  So, for
>> example, a string
>> >   containing only the G clef character (U+1D11E) may
>> be represented as
>> >   "\uD834\uDD1E".
>> >
>> > I haven't figured out exactly what that means, but
>> apparently a UTF-8
>> > character longer than two bytes needs to be
>> represented as two successive
>> > sets of four hex digits, each prefaced by
>> "\u".  That's not what we're
>> > doing.
>> >
>>
>> Specifically, it's characters that need more than 16
>> bits of data
>> space, which for UTF-8 means some 3-byte characters, but
>> not all.
>>
>> Another thing to consider is that because we're working
>> with XML in
>> most cases, characters that fall outside the ASCII range
>> are turned
>> into entities before being stored.  This steps right around
>> most of
>> the problems with direct UTF-8 encoding and JSON, but the
>> exception to
>> this is non-bibliographic data which could be made to
>> contain high
>> codepoints.  Short of building a surrogate pair map for
>> high-codepoint
>> Unicode planes I don't see a way to support such
>> characters.  While
>> it's conceivable that someone would want to put
>> characters from "bad"
>> planes into a library name label (or other non-bib string),
>> it is a
>> corner case restriction that I'm personally OK living
>> with lacking
>> another solution.
>
> I haven't figured out exactly how surrogate pairs work, but I wouldn't
> expect to need some kind of monster map from one character set to
> another.  I would expect that surrogate pairs are just a way to package
> arbitrary bit patterns (well, arbitrary within certain syntactic
> constraints) into a series of hex digits.  I would expect to do no more
> than a little bit-twiddling.  Would we really need to look things up in a
> map?
>

Right.  "Map" above implied the wrong thing ... It's a "double
encoding" of sorts, born of the limitation built into UTF-16, and
codified in many languages (most unfortunately for us: Javascript, the
birthplace of JSON).  If you have a codepoint (after turning the UTF-8
into an unsigned long) that won't fit in the valid 2-byte range then
you take the bit squence and repackage it as 2 codpoints in a
particular way.

But, again, because OpenSRF is just shuffling bits around, I wonder if
just being able to encode surrogate pairs if we happen to run into a
larger-than-"allowed" UTF-8 sequence.  My gut is telling me that we
won't need the decoding end.  Testing will be required.

>> Thanks, Scott, for digging into this.  And thanks for the
>> new
>> implementation of the encoding routine!  It's much more
>> readable and
>> direct.  I've always disliked the existing routine, and
>> your
>> replacement seems like a big improvement.
>>
>> --
>> Mike Rylander
>
> In a separate reply, Dan Scott suggested that we look at other people's
> libraries for dealing with UTF-8.  That's not a bad idea, if we want to
> change things.  However for performance reasons I'd like to continue
> to append the escaped text to a growing_buffer.  Nobody else's code will
> do that without modification or adaptation on our part.
>

Indeed.  I can't see us moving away from custom code, given the
momentum of and our dependency on objson.

> One library he identified was mjson, at:
>
>    http://sourceforge.net/projects/mjson/
>
> I looked briefly at it.  So far as I could tell from the source code,
> it makes no attempt to escape UTF-8 characters.  It just copies them
> without translation.  I haven't looked at the other sites yet.

I looked at http://simplejson.googlecode.com/svn/tags/simplejson-2.0.4/simplejson/_speedups.c
(search for "surrogate") and it looks promising.

>
> Of course a change to the encoding may require corresponding changes to
> the decoding.  I'm not sure where that happens.
>

I'm not sure that's true.  It may be, but I'm not certain.
Decomposing all data we output to surrogate pairs may be all that's
needed, working under the assumption that consumers will handle
surrogate pairs correctly if they need to process the data textually.

> I don't think I'm ready to volunteer for a full-bore implementation of
> UTF-8.  It's clearly not an immediate priority.  However I'll keep it in
> the back of my mind.
>

Awesome, thanks!

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list