[OPEN-ILS-DEV] Local/Collation problems during import

Chris Roosendaal christian.roosendaal at gmail.com
Thu Jul 21 11:29:06 EDT 2011


Hi all,

We have problems with the import of MARC records to Evergreen
database. The process is taking a very long time, so we need you
advice how to improve it.

Below are the initial configuration for our servers:
Production Server:
OS: Centos 5.4 (kernel 2.6.18-238.12.1.el5)
Processors: 8
RAM: 24Gb
HDD: 864Gb with drbd mirroring

Testing Server:
OS: Red Hat Enterprise Linux (kernel 2.6.18-164.el5)
Processors: 4
RAM: 8Gb
HDD: 882Gb without mirroring

We use version Evergreen 2.0.3.
MARC Records amount: 1001542
Records are split in 9 batches

First of all we've started import on the testing server and production
server at the same time to compare the performance.
The postgresql configuration and the source MARC XML data are the same
in both cases (it's important, I guess).

Testing server:
Timesheet: We need 4 days for importing non serial records and about 5
hours for serial to evergreen database.
Results: After import has been finished we have checked the quality of
import and have found that everything is all right despite search with
diacritics: there isn't possibility to find record with accents in the
staff client, but it can be extracted with id from database in normal
way.

Production server:
Timesheet: We need 7 days for non serial records and 24 hours for
serial records for the exactly same import operations but the server
is more powerful than testing.
Results: After the import was complete on this server we have tested
the search with diacritics - and it's working for us!

Sure, in this case the differences are only in the search indexes
creation process, so we've investigated how Evergreen search engine is
working on and have found some interesting points.
As Evergreen uses internal search solution tsearch2 which depends from
LOCALE/COLLATE settings on search query processing, so we have checked
this settings first:

Production Server:
-bash-3.2$ psql -l
                                    List of databases
      Name      |  Owner   | Encoding |  Collation  |    Ctype    |
Access privileges
----------------+----------+----------+-------------+-------------+-----------------------
 evergreen      | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |

Testing Server:
bash$ psql -l
                                    List of databases
      Name      |  Owner   | Encoding |  Collation  |    Ctype    |
Access privileges
----------------+----------+----------+-------------+-------------+-----------------------
 evergreen      | postgres | UTF8     | C           | C           |
 evergreen.work | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |

I guess the most simple way to solve this problem is to create
completely new database with utf-8 unicode support for
Locale/Collation in this way:
createdb -T template0 --lc-ctype=en_US.UTF-8 --lc-collate=en_US.UTF-8
-E UNICODE evergreen

and to import complete dump from production server:
psql -U evergreen -d evergreen -f ./evergreen.production.full.dump

After the dump import process will be finished the Evergreen's staff
client can find all documents with diacritics without any
difficulties.
Is this the right way to solve the problem and is it normal situation
for missing diacritics support if we not using locale with
initialization of database? We also have concerns about the long-term
effects of this solution. Will future upgrades to new versions of
Evergreen and Postgresql bring problems if we will keep using a
different Locale/Collation?

Can it be true that import performance can be decreased to 2-3 times
with LOCALE settings, as tsearch2 needs to find the character with
diacritics and replace it by the same character without?

As we use version 2.0.3, have these problems already been solved in
newer versions of Evergreen? We found this thread with a similar
problem but not for tsearch2 search engine, only for simple
selects:http://georgialibraries.markmail.org/search/?q=collate#query:collate+page:2+mid:bbqa66nt7ukvawzg+state:results
We not sure that this solution can solve our problem with diacritics
search in bibliographical records, or maybe you have another solution
following the same path?


Thank you very much!

Regards,
Vyacheslav Tykhonov and Chris Roosendaal
IISH Amsterdam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20110721/35a37f79/attachment.htm>


More information about the Open-ils-dev mailing list