[OPEN-ILS-GENERAL] Language problems in Evergreen
Dan Scott
dan at coffeecode.net
Fri Mar 11 22:34:12 EST 2011
On 11 March 2011 11:53, Dan Scott <dan at coffeecode.net> wrote:
> On 10 March 2011 10:48, Dan Scott <dan at coffeecode.net> wrote:
>> On 10 March 2011 02:36, Tigran Zargaryan <tigran at flib.sci.am> wrote:
>>> Hi Dan,
>>> reporting about the current situation with patron registration and search
>>> using non-Latin scripts..
>>> After installing OpenSrf 2.0 and EG 2.0 we are able to register and save
>>> patrons in Armenian.
>>> But when searching the patrons by name or last name (e.g. using Armenian
>>> script) the system is replying 'No patrons found matching search criteria'.
>>> Same search by patron bar code is giving a positive result.
>>> So now we can save the patrons, but retrieval by non-Latin script is
>>> resulting to null.
>
> A further update on this: when I built a Debian Squeeze image using
> Evergreen 2.0.3 and OpenSRF 2.0.0-rc2, I was able to reproduce your
> problem.
>
> We traced the specific problem in this instance down to the Unicode
> data not being properly encoded in the Perl module before being sent
> to the database. Using the Encode Perl module and its encode_utf8()
> method to properly encode the incoming arguments resolved the problem.
> So that fix will go into 2.0.4 - or, if you want to apply it to your
> local server now (and I wouldn't blame you!), you can grab the latest
> version of Open-ILS/src/perlmods/OpenILS/Application/Storage/Publisher/actor.pm
> from SVN and copy it to
> /openils/lib/perl5/OpenILS/Application/Storage/Publisher/actor.pm,
> then restart the open-ils.storage service and all should suddenly
> start working for you.
>
> Note for self in the future: encode_utf8() (normally paired with
> decode_utf8()) is your friend!
>
> Note that my inability to reproduce this problem on Fedora suggests
> that Fedora might be doing some automatic encoding of the data that
> the Debian-based distributions don't do. Interesting...
>
And one more update. Thanks to the assistance of Mike Rylander, Dan
Wells, and others in the #evergreen IRC channel, we eventually
determined that the database locale (as determined by the "SHOW
LC_CTYPE;" command at the psql prompt) was behind some of the
inconsistent testing results.
As of PostgreSQL 8.4, you can set the database locale when you create
the database, using the --lc-ctype and --lc-collate flags to the
createdb command. The recommended database locale for performance
reasons is "C", essentially meaning no locale. However, the default
locale is typically en-US.UTF-8 (or some other locale variation, key
being that it is a UTF-8 locale). When the database locale was
*.UTF-8, the problems that you reported occurred because the database
LOWER() function was unable to correctly convert the input string to
lowercase; the encode_utf8() call turned out to fix the problem in
that case, but for the wrong reasons. Introducing the encode_utf8()
call broke the search results on databases with locale = C.
At the suggestion of Mike Rylander, I created a new database function
that uses the Perl lc() function to convert incoming text to lower
case and redefined the indexes that used the built-in database LOWER()
function to use the new evergreen.lowercase() function. This means
that when we use the Perl lc() function in business logic, we can
count on the database providing the same results when we call the
corresponding evergreen.lowercase() database function to provide a
match. I'll admit to still being a little nervous about this
direction, but it will at the very least provide a consistent result
and solve the problem that we're seeing.
As an addendum, I have updated the install documentation in the wiki
to include the commands required to set the LC_CTYPE and LC_COLLATE
variables to "C" when the database is created.
And hopefully that's the last of these problems :)
More information about the Open-ils-general
mailing list