[OPEN-ILS-GENERAL] Accession numbers, copy notes and diacritics in call numbers

Dan Scott dan at coffeecode.net
Thu Jul 28 10:19:37 EDT 2011


Hi Linda:

2011/7/18 Linda Jansova <skolkova at chello.cz>:

<snip>

> 3) In case we use diacritics in call numbers, is there a way to make the
> Shelf Browser show first the particular letters letters without diacritics
> and then those with diacritics? E.g. to have the letter "Č" after the letter
> "C" instead of somewhere at the beginning of the virtual shelf? (We are
> especially interested in Czech alphabet - the correct order of letters is
> available at http://en.wikipedia.org/wiki/Czech_alphabet.) Or maybe is it
> better do do without and use only call numbers with decent ;-) letters?

You would think that the answer to this would be simple, but it looks
like the answer will require us to check a number of points.

First, as of Evergreen 2.0 we generate call number "sort keys" that
are used for sorting purposes. (I can't remember but I think you're on
2.0 by now?). I'm guessing that you still have your library set to the
default call number classification scheme, so the first thing we'll
need to do is check to ensure that the sort keys that we generate
aren't doing something bad to diacritic characters.

Second, sorting depends on the glibc locale environment on your
database server - the collating sequence for the locale your database
has been created in should match the behaviour of the "sort" command
from the command line on the database server for each given locale.

For example, if I create the text file "sortme" that contains the
following lines:

C
Č
C
Č

I can then test the collation behaviour of different locales. On my
laptop running Fedora 15, for example, using the 'C' locale that we
expect the database to be using:

$ LANG=C sort sortme
C
C
Č
Č

And then if I switch to the cs_CZ UTF8 locale:

$ LANG=cs_CZ.utf8 sort sortme
C
C
Č
Č

But if I switch to the cs_CZ ISO locale:

$ LANG=cs_CZ.iso88592 sort sortme
Č
Č
C
C

So - it looks like on my environment I would expect to get the
appropriate sorting sequence for call numbers using the recommended
'C' locale for the database. It's possible that your database server's
locale environment does not match this behaviour, of course.

Finally, the actual call number sort key is then wrapped in an
oils_text_as_bytea(label_sortkey) function when we sort the results of
a call number browse, which converts the sortkey at run time to a
bytea data type. Checking out the definition of bytea strings, I had
the sinking feeling that this was the reason for your problems. To
paste from the PostgreSQL documentation:

"operations on binary strings process the actual bytes, whereas the
processing of character strings depends on locale settings. In short,
binary strings are appropriate for storing data that the programmer
thinks of as "raw bytes", whereas character strings are appropriate
for storing text."

In addition,

But then, when I tested this theory on a PostgreSQL 9.0 database
created with the C locale, that theory of doom doesn't seem to hold
up:

-- Show that we are using the right database locale
SHOW LC_COLLATE;

 lc_collate
------------
 C
(1 row)

-- Create a test table to try out our theory
CREATE TABLE test_bytea(input TEXT, output BYTEA);

-- Insert our sample data
INSERT INTO test_bytea(input) VALUES ('C'), ('Č'), ('C'), ('Č');

-- Create the BYTEA version of the text in the output column
UPDATE test_bytea SET output = oils_text_as_bytea(input);

-- Get the table as sorted by the untouched text
SELECT input, output FROM test_bytea ORDER BY input ASC;

 input | output
-------+--------
 C     | \x43
 C     | \x43
 Č     | \xc48c
 Č     | \xc48c
(4 rows)

-- Now get the table as sorted by the BYTEA version of the strings
SELECT input, output FROM test_bytea ORDER BY output ASC;

 input | output
-------+--------
 C     | \x43
 C     | \x43
 Č     | \xc48c
 Č     | \xc48c
(4 rows)

And... as you can see, I'm still getting the expected sort order that
you want - even though the string has been converted to a BYTEA column
(and the oils_text_as_bytea() function throws in an UPPER() call that
also normally raises a warning flag that non-ASCII data may be
destroyed).

So - I'm having trouble reproducing the problem that you're seeing
with these simple tests. Maybe you can have your systems people try
out these tests on your database server to see if they get the same
results?


More information about the Open-ils-general mailing list