[OPEN-ILS-DEV] Encoding UTF-8

Scott McKellar mck9 at swbell.net
Sun Nov 23 23:57:20 EST 2008


--- On Sun, 11/23/08, Mike Rylander <mrylander at gmail.com> wrote:

<snip>
 
> I looked at
> http://simplejson.googlecode.com/svn/tags/simplejson-2.0.4/simplejson/_speedups.c
> (search for "surrogate") and it looks promising.

I looked at it briefly -- it looks like a more mature effort than
mjson.  But I haven't yet looked closely enough to figure out how it
works.

<snip>

> Decomposing all data we output to surrogate pairs may be
> all that's
> needed, working under the assumption that consumers will
> handle
> surrogate pairs correctly if they need to process the data
> textually.

I know how to do the following -- in fact I'm already doing it:

1. Recognize a header byte.

2. Determine from the header byte how many continuation bytes there are.

3. Eliminate the length bits from the header, and the most significant
two bits from each of the continuation bytes, and reassemble the
payload as a binary number.  That's the code point.

4. If the code point is small enough to fit into four hex digits,
then render it in the format "\uxxxx".

What I don't know is what to do if the code point is too big to fit into
four hex digits.  What we do today is mangle it into five hex digits,
in the format "\uxxxxx", or conceivably even six hex digits.  The result
is misformed JSON, as I understand it.

What we *should* do is to format it as two groups of four hex digits each,
i.e. "\uxxxx\uxxxx", which is called a "surrogate pair."

I expect that it's just a matter of bit twiddling, and it may be very
simple.  If the code point is 0x12345678, then the surrogate pair may
just be "\u1234\u5678", or conceivably "\u5678\u1234", or maybe something
else entirely.  I just need to know what the rules are.

Scott McKellar
http://home.swbell.net/mck9/ct/



More information about the Open-ils-dev mailing list