[OPEN-ILS-DEV] Yet another function for uescaping UTF-8

Scott McKellar mck9 at swbell.net
Fri Nov 28 00:31:43 EST 2008


A few days ago I submitted some experimental code for formatting UTF-8
characters into JSON.  It was a drop-in replacement for the
buffer_append_uescape function, and produced almost identical results.

Now I have a new version of that code.  Since the first version is not
in the repository trunk, I am attaching a full file rather than a patch.
The associated header file doesn't need to change.

This new version differs in the following way:

When it encounters a code point too big to fit into 16 bits (after
stripping out the packaging bits), it formats it into a surrogate pair
of four hex digits each, rather than a single set of five or six hex
digits.

In addition, this new version no longer uses buffer_fadd() to format
hex values.

The code for constructing surrogate pairs is a slightly simplified version
of a code snippet found at:

    http://www.unicode.org/faq/utf_bom.html

The code snippet seems to come from a pretty authoritative source. and
my modifications were minimal, consisting mostly of collecting a couple
of constant expressions into constant values.

In the case of the G clef character (U+1D11E), I verified that my code
translates it to the correct surrogate pair ("\uD834\uDD1E").

Unfortunately that's the only character for which I know both the code
point and the corresponding surrogate pair.  My Google fu has failed me.
If someone can provide a sample of code points and the corresponding
surrogate pairs, I can do some more testing to make sure that I'm getting
the right answers.

Scott McKellar
http://home.swbell.net/mck9/ct/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osrf_utf8.c
Type: text/x-csrc
Size: 19295 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081127/2b028aca/attachment.c 


More information about the Open-ils-dev mailing list