[OPEN-ILS-DEV] PATCH: utils.[ch] (buffer_append_uescape)

Scott McKellar mck9 at swbell.net
Mon Nov 17 19:20:06 EST 2008


This patch adds a new function buffer_append_uescape(), a streamlined 
replacement for uescape().  The optimizations are as follows:

1. As described in an earlier post, the new function appends the uescaped
text to an existing growing_buffer, rather than into an internal one that
it must create and destroy.  It thereby saves two mallocs and two frees.
In addition, the calling code doesn't have to do a buffer_add().

2. Because it doesn't create an internal growing_buffer, it doesn't need 
to do a strlen() to determine how big a buffer to allocate.

3. Since the new version doesn't have a boolean parameter, it doesn't have 
to test the boolean.

4. For characters < 128 (i.e. ASCII characters), I rearranged the order of
the tests so that non-control characters are recognized immediately.  In 
uescape() we first go through a switch/case looking for several specific 
control characters like '\b' and '\n'.  In practice most characters are
not control characters, so this rearrangement saves a few CPU cycles.

5. For control characters, uescape() uses buffer_fadd() to format the
character into hex.  Now, buffer_fadd is slow because it uses vsnprintf() 
twice, once to determine a buffer size and once to do the formatting.  In
turn vsnprintf() is slow because it has to parse the format string.  In
this case we don't need vsnprintf() because we already know exactly how
big the buffer needs to be, and the formatting is simple.  I eliminated 
buffer_fadd() and formatted the hex manually with a little bit-twiddling.

Some of these optimizations could be applied to uescape(), but I haven't
bothered, partly because I wanted a clean comparison for benchmarking
purposes and partly because I expect uescape() to disappear from use
(though I am leaving it available).

--------------

Benchmark results:

For a six-character text string, the new function is faster than the old
one by a factor of about 2.7.  This speedup is attributable mostly to the
elimination of mallocs and frees.

For a long text string (276 characters) the new function is faster by a
factor of about 1.7.  Here the saving in mallocs and frees is diluted by
the extra work of processing more bytes.  I haven't tried to determine
how much of the speedup is attributable to the rearrangement of 
character tests.

For a string consisting entirely of 31 control characters, the new function
is faster by a factor of about 6.2, attributable in part to the elimination
of buffer_fadd().  This figure probably overstates the speedup, due to an
artifact of the benchmark.  Because the control characters expand to six times their original size, uescape() needs to expand its internal
growing_buffer, incurring a malloc and free on every call.  The new
function reuses the same growing_buffer, which once expanded doesn't need
to be expanded again.  I haven't tried to compensate for this artifact to
reflect a more realistic input string.

In case you want to play around with benchmarking, I attach my test
program tuencode.c.

--------------

buffer_append_uescape() returns:

    0 if successful;

    -1 if the pointer to growing_buffer is NULL;

    A positive integer if the function finds a character that is not valid
    UTF-8.  In this case the return value is the position of the invalid
    character in the input string, counting from 1.

The first byte of a valid UTF-8 character should begin with a binary
110, 1110, or 1111.  Bytes starting with binary 10 are reserved for the
second, third, or fourth byte.  If a byte starting with binary 10 occurs
where we expect the first byte of a multibyte sequence, that's an error.
Both uescape() and buffer_append_uescape() will detect it.

uescape() responds by returning NULL, thus discarding all of the input
processed so far, even though it was valid.  buffer_append_uescape()
responds a little differently -- it just stops, leaving the previously
translated characters in place.  I could nake it behave more like the
original by truncating the growing_buffer back to the starting point, but
it hardly seems worth the trouble.

Probably the best response to invalid UTF-8 is to skip over the invalid 
bytes and resume the translation with the first valid character.  At least
in this round I haven't been that ambitious.

There other kinds of invalid UTF-8 that neither function will catch.
For example, a first byte of 11000000 or 11110111 is invalid, according
to Wikipedia.

Please note that I haven't tried to test the error handling.  In fact I
haven't tested the UTF-8 handling at all,  It should work the same as
before, because I didn't change any of it except for the response to an
error.  You probably have some UTF-8 text readily to hand for testing.
I would have to tediously build test cases by hand, and it would be hard
to be sure that I had covered all the possibilities.  However I will do so
if you ask.

-------------

Scott McKellar
http://home.swbell.net/mck9/ct/

The following applies not only to these patches but to the patches I 
submitted over the last few days (when I forgot to include the DCO).

Developer's Certificate of Origin 1.1 By making a contribution to
this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license indicated
in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source license
and I have the right under that license to submit that work with
modifications, whether created in whole or in part by me, under the
same open source license (unless I am permitted to submit under a
different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person
who certified (a), (b) or (c) and I have not modified it; and

(d) In the case of each of (a), (b), or (c), I understand and agree
that this project and the contribution are public and that a record
of the contribution (including all personal information I submit
with it, including my sign-off) is maintained indefinitely and may
be redistributed consistent with this project or the open source
license indicated in the file.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utils_c_10.patch
Type: text/x-patch
Size: 3112 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081117/b7e43c50/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utils_h_7.patch
Type: text/x-patch
Size: 408 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081117/b7e43c50/attachment-0001.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tuencode.c
Type: text/x-csrc
Size: 2715 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20081117/b7e43c50/attachment.c 


More information about the Open-ils-dev mailing list