[OPEN-ILS-DEV] More unicode patches for SIPServer

Tue Jun 21 00:27:08 EDT 2011

On Mon, Jun 20, 2011 at 03:12:25PM -0400, Joe Atzberger wrote:
> On Fri, Jun 17, 2011 at 4:06 PM, Dan Scott <dan at coffeecode.net> wrote:
> 
> > I offer two more patches (being used here in production at Laurentian
> > University) for SIPServer:
> >
> > 1) Takes a bit more care in using decode_utf8() / encode_utf8()
> > consistently; the generally recommended approach is to decode input and
> > encode output.
> >
> > 2) Restore the old OpenNCIP checksum algorithm for handling Unicode over
> > the wire. This algorithm is necessary for our 3M V-series self-check to
> > work with the Unicode encoding enabled on the unit; I rather
> > painstakingly worked it out in late 2009 and submitted it to OpenNCIP
> > back then.
> >
> 
> 
> After spending many an hour on both the Koha and EG versions of SIP checksum
> code, the problems that exist are quite tricky and afaict, dependent on the
> SIPserver OS/perl setup.  Therefore saying a given code works with 3M
> hardware is necessary but not sufficient.

And your work has been greatly appreciated. Luckily, for checksums,
you've provided the basis of how we can shake out the remaining OS/Perl
differences in t/0001_checksum.t. Right now all test cases fail
miserably, but that's simply because Sip::Checksum::checksum() is
returning the integer value of the checksum rather than the hex value;
I've pushed a branch user/dbs/move_checksum_hexification to the
SIPServer working repo to move the hexification out of Sip::write_msg()
and back into SIP::Checksum::checksum() where it really belongs.

Unfortunately, it looks like even with that change, right now most of
the test cases fail miserably. Not surprisingly, if I add testcases from
our production server logs (64-bit Debian Lenny talking to 3M V-series
self-check), those new testcases pass with flying colours. The new
testcases also pass on 32-bit Debian Squeeze and 64-bit Fedora 15, while
the old testcases also fail there.

Even when I roll all of the code back to the last commit that you made
to the repo, those old testcases still fail on 64-bit Debian Lenny,
32-bit Debian Lenny, 32-bit Debian Squeeze, and 64-bit Fedora 15. Your
commit message mentions that the testcases appear to be invalid and that
they came from some "guide"; I don't see any samples in the SIP2
Protocol Definition so I guess that's some other resource? In any case,
I think I concur with your original commit message, and would go further
to suggest that they should absolutely be removed from the unit tests
and replaced with actual examples from the wild - ideally identifying
what make and model of self-check or SIP client the examples came from.

> The main problems for us are:
> 
>    - 3M specifications in the Implementer's Handbook explicitly depend on
>    "ASCII values" and assume to know the underlying representation of values in
>    binary, including representation depth.

Right, it seems clear that 3M, at least, simply chose to use byte
semantics to calculate the checksum in the case of canonical decomposed
UTF8 Unicode as a "mutually defined character set".

>    - we have an insufficient body of actual tests (examples) for known-good
>    checksum calculation on strings like those in actual use: long ones, with
>    Unicode.
> 
> More tests, including long lines with Unicode are required.  I would be most
> happy to have some provided or verified by 3M, if possible.  The sad thing
> is that it *should* be possible to just have a webpage calculate in
> javascript even, but we end up still not knowing if it is right-enough.

Happily I can add plenty of examples from our logs where this is working
perfectly. I'll need to use a suitable dummy account to run through the
operations, but so far anything that I've pulled from our logs passes
with flying colours (not surprisingly).

> > I have also attached "test_checksum.pl" to demonstrate the
> > observable difference between the old checksum and the new (I suppose
> > the right thing to do would be to roll this into the actual unit tests,
> > if there is general agreement that the checksum matches the reality of
> > more than just 3M V-series self-checks with Unicode encoding enabled).
> >
> 
> Right, my approach would be to begin building a proper CPAN module (in a
> namespace like Business::3MSIP) that would start out with just the
> dependencies and yet-to-be-established checksumming tests.  That way, we get
> the benefit of *all* the CPAN-testers different OS and perl configurations,
> without having to set them up ourselves or rely on users reporting.  (The
> nice thing about testing checksums is that it doesn't require a full running
> SIPserver.)

It would be good to have this available for a number of reasons, not
just testing. Not sure if anyone's got the time to do it, though. (And
hear, hear on the unit tests not requiring a full running SIPServer -
they're the only unit tests I have ever run, for exactly that reason).

> > If we discover that the Unicode checksum-handling differs between
> > various self-checks, then we may need to add yet another configuration
> > file option (sigh) to enable switching between the appropriate
> > algorithms. But hopefully %16C just works :)
> >
> 
> Certainly some old janky SIP clients do not speak Unicode.  One problem is
> that they won't have a way of reporting ASCII-centricness, because they
> expect it to be the default.  For EG users, I think these relics may be
> negligible since EG has always used Unicode throughout.

The option already exists for Evergreen users to force everything to
ASCII for those janky SIP clients via the SIP config file (and that is
in fact the default value in the Evergreen example SIP config file). The
onus remains on the system admin to set the encoding to "UTF-8".
Perhaps, for broader usage, that option needs to move into the core SIP
code.

> I'm also guessing that the timing of encode/decode relative to checksum
> calculation is pretty important.  Since the specs were written without
> regard for combining vs. composed characters, we are sorta on our own here.
>  All this trouble is for a data-integrity feature designed for damn serial
> cables and plainly unnecessary over already checksummed TCP.

Yep, you certainly have a point. We get TCP checksumming for free, and
that helps to ensure that line noise didn't introduce any corruption,
but I think you might be putting too much faith in TCP; see for example
http://www.evanjones.ca/tcp-checksums.html. The SIP spec recognizes that
error-checking may occur at other layers of the stack when it says "The
protocol allows extra error detection to be enabled, over and above any
error detection provided by the communications medium’s protocol." I
think it's still worthwhile trying to get this right.

> In short, Dan, I'm not arguing these changes are wrong.  I trust they are
> right *on your systems*.  But I'm certain we don't have enough test coverage
> to conclude they are right for all systems currently in production, let
> alone all systems we intend to support.

It's not clear to me what you're suggesting then. I feel like I'm being
asked to prove that my changes don't break any self-check system or SIP
client in existence, when all that I can prove is that my changes fix
utter breakage on the one system to which I actually have access. I can
add unit tests that match our environment, but I certainly can't
guarantee that those mirror every other system that has opted to offer
Unicode over SIP.