[OPEN-ILS-DEV] Yet another function for uescaping UTF-8
Scott McKellar
mck9 at swbell.net
Fri Nov 28 00:31:43 EST 2008
A few days ago I submitted some experimental code for formatting UTF-8
characters into JSON. It was a drop-in replacement for the
buffer_append_uescape function, and produced almost identical results.
Now I have a new version of that code. Since the first version is not
in the repository trunk, I am attaching a full file rather than a patch.
The associated header file doesn't need to change.
This new version differs in the following way:
When it encounters a code point too big to fit into 16 bits (after
stripping out the packaging bits), it formats it into a surrogate pair
of four hex digits each, rather than a single set of five or six hex
digits.
In addition, this new version no longer uses buffer_fadd() to format
hex values.
The code for constructing surrogate pairs is a slightly simplified version
of a code snippet found at:
http://www.unicode.org/faq/utf_bom.html
The code snippet seems to come from a pretty authoritative source. and
my modifications were minimal, consisting mostly of collecting a couple
of constant expressions into constant values.
In the case of the G clef character (U+1D11E), I verified that my code
translates it to the correct surrogate pair ("\uD834\uDD1E").
Unfortunately that's the only character for which I know both the code
point and the corresponding surrogate pair. My Google fu has failed me.
If someone can provide a sample of code points and the corresponding
surrogate pairs, I can do some more testing to make sure that I'm getting
the right answers.
Scott McKellar
http://home.swbell.net/mck9/ct/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osrf_utf8.c
Type: text/x-csrc
Size: 19295 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081127/2b028aca/attachment.c
More information about the Open-ils-dev
mailing list