[OPEN-ILS-DEV] Yet another function for uescaping UTF-8

Dan Scott dan at coffeecode.net
Fri Nov 28 11:52:09 EST 2008


On Thu, 2008-11-27 at 21:31 -0800, Scott McKellar wrote:
> A few days ago I submitted some experimental code for formatting UTF-8
> characters into JSON.  It was a drop-in replacement for the
> buffer_append_uescape function, and produced almost identical results.
> 
> Now I have a new version of that code.  Since the first version is not
> in the repository trunk, I am attaching a full file rather than a patch.
> The associated header file doesn't need to change.

Very cool, Scott. I'll drop it into a local build and give it a run in
the current OpenSRF / Evergreen trunk environment.

> 
> This new version differs in the following way:
> 
> When it encounters a code point too big to fit into 16 bits (after
> stripping out the packaging bits), it formats it into a surrogate pair
> of four hex digits each, rather than a single set of five or six hex
> digits.
> 
> In addition, this new version no longer uses buffer_fadd() to format
> hex values.
> 
> The code for constructing surrogate pairs is a slightly simplified version
> of a code snippet found at:
> 
>     http://www.unicode.org/faq/utf_bom.html
> 
> The code snippet seems to come from a pretty authoritative source. and
> my modifications were minimal, consisting mostly of collecting a couple
> of constant expressions into constant values.
> 
> In the case of the G clef character (U+1D11E), I verified that my code
> translates it to the correct surrogate pair ("\uD834\uDD1E").
> 
> Unfortunately that's the only character for which I know both the code
> point and the corresponding surrogate pair.  My Google fu has failed me.
> If someone can provide a sample of code points and the corresponding
> surrogate pairs, I can do some more testing to make sure that I'm getting
> the right answers.

I started generating some examples for you using Python; maybe the
attached script will be helpful to you in generating other ranges, but
here's a snippet of what the script generates for the Ancient Greek
Numbers range (http://www.utf8-chartable.de/unicode-utf8-table.pl gives
lots of alternate representations):

65856 "\ud800\udd40" GREEK ACROPHONIC ATTIC ONE QUARTER
65857 "\ud800\udd41" GREEK ACROPHONIC ATTIC ONE HALF
65858 "\ud800\udd42" GREEK ACROPHONIC ATTIC ONE DRACHMA
65859 "\ud800\udd43" GREEK ACROPHONIC ATTIC FIVE
65860 "\ud800\udd44" GREEK ACROPHONIC ATTIC FIFTY
65861 "\ud800\udd45" GREEK ACROPHONIC ATTIC FIVE HUNDRED
65862 "\ud800\udd46" GREEK ACROPHONIC ATTIC FIVE THOUSAND
65863 "\ud800\udd47" GREEK ACROPHONIC ATTIC FIFTY THOUSAND
65864 "\ud800\udd48" GREEK ACROPHONIC ATTIC FIVE TALENTS
65865 "\ud800\udd49" GREEK ACROPHONIC ATTIC TEN TALENTS
65866 "\ud800\udd4a" GREEK ACROPHONIC ATTIC FIFTY TALENTS
65867 "\ud800\udd4b" GREEK ACROPHONIC ATTIC ONE HUNDRED TALENTS
65868 "\ud800\udd4c" GREEK ACROPHONIC ATTIC FIVE HUNDRED TALENTS
65869 "\ud800\udd4d" GREEK ACROPHONIC ATTIC ONE THOUSAND TALENTS
65870 "\ud800\udd4e" GREEK ACROPHONIC ATTIC FIVE THOUSAND TALENTS
65871 "\ud800\udd4f" GREEK ACROPHONIC ATTIC FIVE STATERS
65872 "\ud800\udd50" GREEK ACROPHONIC ATTIC TEN STATERS
65873 "\ud800\udd51" GREEK ACROPHONIC ATTIC FIFTY STATERS
65874 "\ud800\udd52" GREEK ACROPHONIC ATTIC ONE HUNDRED STATERS
65875 "\ud800\udd53" GREEK ACROPHONIC ATTIC FIVE HUNDRED STATERS
65876 "\ud800\udd54" GREEK ACROPHONIC ATTIC ONE THOUSAND STATERS
65877 "\ud800\udd55" GREEK ACROPHONIC ATTIC TEN THOUSAND STATERS
65878 "\ud800\udd56" GREEK ACROPHONIC ATTIC FIFTY THOUSAND STATERS
65879 "\ud800\udd57" GREEK ACROPHONIC ATTIC TEN MNAS
65880 "\ud800\udd58" GREEK ACROPHONIC HERAEUM ONE PLETHRON
65881 "\ud800\udd59" GREEK ACROPHONIC THESPIAN ONE
65882 "\ud800\udd5a" GREEK ACROPHONIC HERMIONIAN ONE
65883 "\ud800\udd5b" GREEK ACROPHONIC EPIDAUREAN TWO
65884 "\ud800\udd5c" GREEK ACROPHONIC THESPIAN TWO
65885 "\ud800\udd5d" GREEK ACROPHONIC CYRENAIC TWO DRACHMAS
65886 "\ud800\udd5e" GREEK ACROPHONIC EPIDAUREAN TWO DRACHMAS
65887 "\ud800\udd5f" GREEK ACROPHONIC TROEZENIAN FIVE
65888 "\ud800\udd60" GREEK ACROPHONIC TROEZENIAN TEN
65889 "\ud800\udd61" GREEK ACROPHONIC TROEZENIAN TEN ALTERNATE FORM
65890 "\ud800\udd62" GREEK ACROPHONIC HERMIONIAN TEN
65891 "\ud800\udd63" GREEK ACROPHONIC MESSENIAN TEN
65892 "\ud800\udd64" GREEK ACROPHONIC THESPIAN TEN
65893 "\ud800\udd65" GREEK ACROPHONIC THESPIAN THIRTY
65894 "\ud800\udd66" GREEK ACROPHONIC TROEZENIAN FIFTY
65895 "\ud800\udd67" GREEK ACROPHONIC TROEZENIAN FIFTY ALTERNATE FORM
65896 "\ud800\udd68" GREEK ACROPHONIC HERMIONIAN FIFTY
65897 "\ud800\udd69" GREEK ACROPHONIC THESPIAN FIFTY
65898 "\ud800\udd6a" GREEK ACROPHONIC THESPIAN ONE HUNDRED
65899 "\ud800\udd6b" GREEK ACROPHONIC THESPIAN THREE HUNDRED
65900 "\ud800\udd6c" GREEK ACROPHONIC EPIDAUREAN FIVE HUNDRED
65901 "\ud800\udd6d" GREEK ACROPHONIC TROEZENIAN FIVE HUNDRED
65902 "\ud800\udd6e" GREEK ACROPHONIC THESPIAN FIVE HUNDRED
65903 "\ud800\udd6f" GREEK ACROPHONIC CARYSTIAN FIVE HUNDRED
65904 "\ud800\udd70" GREEK ACROPHONIC NAXIAN FIVE HUNDRED
65905 "\ud800\udd71" GREEK ACROPHONIC THESPIAN ONE THOUSAND
65906 "\ud800\udd72" GREEK ACROPHONIC THESPIAN FIVE THOUSAND
65907 "\ud800\udd73" GREEK ACROPHONIC DELPHIC FIVE MNAS
65908 "\ud800\udd74" GREEK ACROPHONIC STRATIAN FIFTY MNAS
65909 "\ud800\udd75" GREEK ONE HALF SIGN
65910 "\ud800\udd76" GREEK ONE HALF SIGN ALTERNATE FORM
65911 "\ud800\udd77" GREEK TWO THIRDS SIGN
65912 "\ud800\udd78" GREEK THREE QUARTERS SIGN
65913 "\ud800\udd79" GREEK YEAR SIGN
65914 "\ud800\udd7a" GREEK TALENT SIGN
65915 "\ud800\udd7b" GREEK DRACHMA SIGN
65916 "\ud800\udd7c" GREEK OBOL SIGN
65917 "\ud800\udd7d" GREEK TWO OBOLS SIGN
65918 "\ud800\udd7e" GREEK THREE OBOLS SIGN
65919 "\ud800\udd7f" GREEK FOUR OBOLS SIGN
65920 "\ud800\udd80" GREEK FIVE OBOLS SIGN
65921 "\ud800\udd81" GREEK METRETES SIGN
65922 "\ud800\udd82" GREEK KYATHOS BASE SIGN
65923 "\ud800\udd83" GREEK LITRA SIGN
65924 "\ud800\udd84" GREEK OUNKIA SIGN
65925 "\ud800\udd85" GREEK XESTES SIGN
65926 "\ud800\udd86" GREEK ARTABE SIGN
65927 "\ud800\udd87" GREEK AROURA SIGN
65928 "\ud800\udd88" GREEK GRAMMA SIGN
65929 "\ud800\udd89" GREEK TRYBLION BASE SIGN
65930 "\ud800\udd8a" GREEK ZERO SIGN


Also, as a sanity check, I threw in a chunk of the musical symbols
range:

119060 "\ud834\udd14" MUSICAL SYMBOL BRACE
119061 "\ud834\udd15" MUSICAL SYMBOL BRACKET
119062 "\ud834\udd16" MUSICAL SYMBOL ONE-LINE STAFF
119063 "\ud834\udd17" MUSICAL SYMBOL TWO-LINE STAFF
119064 "\ud834\udd18" MUSICAL SYMBOL THREE-LINE STAFF
119065 "\ud834\udd19" MUSICAL SYMBOL FOUR-LINE STAFF
119066 "\ud834\udd1a" MUSICAL SYMBOL FIVE-LINE STAFF
119067 "\ud834\udd1b" MUSICAL SYMBOL SIX-LINE STAFF
119068 "\ud834\udd1c" MUSICAL SYMBOL SIX-STRING FRETBOARD
119069 "\ud834\udd1d" MUSICAL SYMBOL FOUR-STRING FRETBOARD
119070 "\ud834\udd1e" MUSICAL SYMBOL G CLEF
119071 "\ud834\udd1f" MUSICAL SYMBOL G CLEF OTTAVA ALTA
119072 "\ud834\udd20" MUSICAL SYMBOL G CLEF OTTAVA BASSA
119073 "\ud834\udd21" MUSICAL SYMBOL C CLEF
119074 "\ud834\udd22" MUSICAL SYMBOL F CLEF
119075 "\ud834\udd23" MUSICAL SYMBOL F CLEF OTTAVA ALTA
119076 "\ud834\udd24" MUSICAL SYMBOL F CLEF OTTAVA BASSA
119077 "\ud834\udd25" MUSICAL SYMBOL DRUM CLEF-1
119078 "\ud834\udd26" MUSICAL SYMBOL DRUM CLEF-2
119079 "\ud834\udd27" 

The "G CLEF" matches up, so it looks trustworthy to me.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: surrogates.py
Type: text/x-python
Size: 477 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081128/4fd902c7/attachment.py 


More information about the Open-ils-dev mailing list