[OPEN-ILS-DEV] Another new function for uescaping UTF-8 strings

Sun Nov 23 00:30:59 EST 2008

The attached files contain a drop-in replacement for the 
buffer_append_uescape function that I submitted a few days ago.  I regard
this new one as experimental, at least for now.

They also offer some byte-testing functions, and the equivalent macros, 
that may be useful in other code that deals with UTF-8 strings.

The new function buffer_append_utf8() differs from buffer_append_uescape()
in the following ways:

1. It treats 0xF7 as a control character, which it is.

2. It is more finicky about recognizing the header byte of multibyte
characters.  For example 0xF6 is not a valid UTF-8 header byte.

3. When it sees a nul byte in the middle of a multibyte character, it 
stops.  In the same situation, the older buffer_append_uescape() and
uescape() functions accumulate the nul byte into the hex codes they
build and then keep going, risking not only misbehavior but undefined
behavior.

4. When it finds invalid UTF-8 characters in the input string, it skips
over the invalid UTF-8 until it finds a valid character, and then
continues to translate the rest.  In other words it excises the garbage
and translates the rest intact.

---------

The file osrf_utf8.c includes an array of bitmasks that it uses to look
up the characteristics of each byte.  Not trusting myself to do that
much tedious typing by hand, I wrote a program to write the list of 
bitmasks.  The macros are broadly similar to the standard C functions 
isprint(), isalpha(), and so forth.

There is also a collection of functions, equivalent to the macros, with
the same names except using double underscores.  These may never find a
use, but they're there in case anyone ever needs a function pointer for
some reason.

The logic uses a finite state machine (FSM) to examine and dispatch each
byte in the input stream.  Because it needs to branch on the current
state as well as the type of each character, this logic is a little
slower than buffer_append_uescape().  However pretty much any 
implementation of the same behavior would probably incur some such extra
overhead in some form.

-------------

Please note that this new function does *not* address the concerns I wrote
about in my previous post -- namely the fact that both uescape() and
buffer_append_uescape() create ambiguities that no decoding scheme can
resolve.  However, because the FSM logic systematically recognizes
every possible situation, it should be a straightforward matter to 
implement a new set of rules, once we decide what those rules should be.

Scott McKellar
http://home.swbell.net/mck9/ct/

Developer's Certificate of Origin 1.1 By making a contribution to
this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license indicated
in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source license
and I have the right under that license to submit that work with
modifications, whether created in whole or in part by me, under the
same open source license (unless I am permitted to submit under a
different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person
who certified (a), (b) or (c) and I have not modified it; and

(d) In the case of each of (a), (b), or (c), I understand and agree
that this project and the contribution are public and that a record
of the contribution (including all personal information I submit
with it, including my sign-off) is maintained indefinitely and may
be redistributed consistent with this project or the open source
license indicated in the file.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osrf_utf8.c
Type: text/x-csrc
Size: 17248 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081122/06069e9d/attachment-0001.c 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osrf_utf8.h
Type: text/x-chdr
Size: 1833 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081122/06069e9d/attachment-0001.h