[OPEN-ILS-GENERAL] Accession numbers, copy notes and diacritics in call numbers

Tue Aug 2 03:18:48 EDT 2011

Dear Dan and Thomas,

Thank you very much for your recommendations!

As to the sorting issues - I believe that now we have a number of hints 
to investigate, so we can further proceed in our efforts :-)... I do 
appreciate you have taken the trouble to test things for us and thus 
facilitated testing in our own installation. Thanks a lot for that :-)!

(And, yes, you are right - I apologize for not mentioning it in my 
message - we are on EG 2.0.)

Linda

Dne 28.7.2011 16:19, Dan Scott napsal(a):
> Hi Linda:
>
> 2011/7/18 Linda Jansova<skolkova at chello.cz>:
>
> <snip>
>
>> 3) In case we use diacritics in call numbers, is there a way to make the
>> Shelf Browser show first the particular letters letters without diacritics
>> and then those with diacritics? E.g. to have the letter "Č" after the letter
>> "C" instead of somewhere at the beginning of the virtual shelf? (We are
>> especially interested in Czech alphabet - the correct order of letters is
>> available at http://en.wikipedia.org/wiki/Czech_alphabet.) Or maybe is it
>> better do do without and use only call numbers with decent ;-) letters?
> You would think that the answer to this would be simple, but it looks
> like the answer will require us to check a number of points.
>
> First, as of Evergreen 2.0 we generate call number "sort keys" that
> are used for sorting purposes. (I can't remember but I think you're on
> 2.0 by now?). I'm guessing that you still have your library set to the
> default call number classification scheme, so the first thing we'll
> need to do is check to ensure that the sort keys that we generate
> aren't doing something bad to diacritic characters.
>
> Second, sorting depends on the glibc locale environment on your
> database server - the collating sequence for the locale your database
> has been created in should match the behaviour of the "sort" command
> from the command line on the database server for each given locale.
>
> For example, if I create the text file "sortme" that contains the
> following lines:
>
> C
> Č
> C
> Č
>
> I can then test the collation behaviour of different locales. On my
> laptop running Fedora 15, for example, using the 'C' locale that we
> expect the database to be using:
>
> $ LANG=C sort sortme
> C
> C
> Č
> Č
>
> And then if I switch to the cs_CZ UTF8 locale:
>
> $ LANG=cs_CZ.utf8 sort sortme
> C
> C
> Č
> Č
>
> But if I switch to the cs_CZ ISO locale:
>
> $ LANG=cs_CZ.iso88592 sort sortme
> Č
> Č
> C
> C
>
> So - it looks like on my environment I would expect to get the
> appropriate sorting sequence for call numbers using the recommended
> 'C' locale for the database. It's possible that your database server's
> locale environment does not match this behaviour, of course.
>
> Finally, the actual call number sort key is then wrapped in an
> oils_text_as_bytea(label_sortkey) function when we sort the results of
> a call number browse, which converts the sortkey at run time to a
> bytea data type. Checking out the definition of bytea strings, I had
> the sinking feeling that this was the reason for your problems. To
> paste from the PostgreSQL documentation:
>
> "operations on binary strings process the actual bytes, whereas the
> processing of character strings depends on locale settings. In short,
> binary strings are appropriate for storing data that the programmer
> thinks of as "raw bytes", whereas character strings are appropriate
> for storing text."
>
> In addition,
>
> But then, when I tested this theory on a PostgreSQL 9.0 database
> created with the C locale, that theory of doom doesn't seem to hold
> up:
>
> -- Show that we are using the right database locale
> SHOW LC_COLLATE;
>
>   lc_collate
> ------------
>   C
> (1 row)
>
> -- Create a test table to try out our theory
> CREATE TABLE test_bytea(input TEXT, output BYTEA);
>
> -- Insert our sample data
> INSERT INTO test_bytea(input) VALUES ('C'), ('Č'), ('C'), ('Č');
>
> -- Create the BYTEA version of the text in the output column
> UPDATE test_bytea SET output = oils_text_as_bytea(input);
>
> -- Get the table as sorted by the untouched text
> SELECT input, output FROM test_bytea ORDER BY input ASC;
>
>   input | output
> -------+--------
>   C     | \x43
>   C     | \x43
>   Č     | \xc48c
>   Č     | \xc48c
> (4 rows)
>
> -- Now get the table as sorted by the BYTEA version of the strings
> SELECT input, output FROM test_bytea ORDER BY output ASC;
>
>   input | output
> -------+--------
>   C     | \x43
>   C     | \x43
>   Č     | \xc48c
>   Č     | \xc48c
> (4 rows)
>
> And... as you can see, I'm still getting the expected sort order that
> you want - even though the string has been converted to a BYTEA column
> (and the oils_text_as_bytea() function throws in an UPPER() call that
> also normally raises a warning flag that non-ASCII data may be
> destroyed).
>
> So - I'm having trouble reproducing the problem that you're seeing
> with these simple tests. Maybe you can have your systems people try
> out these tests on your database server to see if they get the same
> results?
>