[OPEN-ILS-DEV] Encoding UTF-8

Sun Nov 23 13:49:26 EST 2008

Ugh, my half-finished reply was sent accidentally. Please see below for
the addition of the UTF-8 valid encoding table from the Unicode standard
(which is rather different from the Wikipedia entry).

On Fri, 2008-11-21 at 12:58 -0800, Scott McKellar wrote:
> I don't think that the way uescape() encodes UTF-8 characters is correct.
> It creates ambiguities for anybody trying to reverse the encoding.
> 
> Consider the following strings:
> 
> const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
> const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };
> 
> The first is a two-byte UTF-8 character, and the second is a three-byte
> UTF-8 character.  I don't know what kind of characters they represent,
> if any, because I hand-crafted them for my example.  However each has
> a valid UTF-8 format, so far as I can tell from the Wikipedia page at:
> 
>     http://en.wikipedia.org/wiki/UTF-8

Wikipedia led you astray; the canonical table from the Unicode 5.0
standard (page 103, chapter 3, available in PDF form from
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) says:

            Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points         First Byte  Second Byte Third Byte Fourth Byte
U+0000..U+007F      00..7F
U+0080..U+07FF      C2..DF     80..BF
U+0800..U+0FFF      E0         A0..BF       80..BF
U+1000..U+CFFF      E1..EC     80..BF       80..BF
U+D000..U+D7FF      ED         80..9F       80..BF
U+E000..U+FFFF      EE..EF     80..BF       80..BF
U+10000..U+3FFFF    F0         90..BF       80..BF     80..BF
U+40000..U+FFFFF    F1..F3     80..BF       80..BF     80..BF
U+100000..U+10FFFF  F4         80..8F       80..BF     80..BF

So "E08FBF" is an invalid UTF-8 character, in any case.