[OPEN-ILS-DEV] Encoding UTF-8
Dan Scott
dan at coffeecode.net
Sun Nov 23 13:49:26 EST 2008
Ugh, my half-finished reply was sent accidentally. Please see below for
the addition of the UTF-8 valid encoding table from the Unicode standard
(which is rather different from the Wikipedia entry).
On Fri, 2008-11-21 at 12:58 -0800, Scott McKellar wrote:
> I don't think that the way uescape() encodes UTF-8 characters is correct.
> It creates ambiguities for anybody trying to reverse the encoding.
>
> Consider the following strings:
>
> const unsigned char utf_2a[] = { 0xCF, 0xBF, '\0' };
> const unsigned char utf_3a[] = { 0xE0, 0x8F, 0xBF, '\0' };
>
> The first is a two-byte UTF-8 character, and the second is a three-byte
> UTF-8 character. I don't know what kind of characters they represent,
> if any, because I hand-crafted them for my example. However each has
> a valid UTF-8 format, so far as I can tell from the Wikipedia page at:
>
> http://en.wikipedia.org/wiki/UTF-8
Wikipedia led you astray; the canonical table from the Unicode 5.0
standard (page 103, chapter 3, available in PDF form from
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) says:
Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
So "E08FBF" is an invalid UTF-8 character, in any case.
More information about the Open-ils-dev
mailing list