[OPEN-ILS-DEV] slow(!) import

Jason Stephenson jstephenson at mvlc.org
Thu Jun 6 11:07:41 EDT 2013


Quoting Joe Thornton <jthornton at uhls.lib.ny.us>:

> Three things:
>
> --  The process I started three days ago to import 160,000 records using
> the method on the Evergreen site is still running.

I found it took days to load a large number of bibliographic records  
using the regular tools, particularly if you're trying to load them  
all at once.

> --  Maybe an unfair comparison, but we use VuFind as an alternative
> interface to Horizon, and a full import of all 550k records takes about 45
> minutes.

I suspect VuFind is not doing as much with the records as it imports  
them as Evergreen does.


> --  It's surprising to me that there isn't a faster method. We're looking
> seriously at Evergreen as a replacement for Horizon, but this would be a
> problem. I'll try Dan's and then Jason's methods (again, thank you very
> much) and hope that they're significantly faster. If I had the time and
> ability (unfortunately I have neither) I'd take a shot at it myself.

As I recall on Horizon reindexing bib records for the OPAC was quite a  
slow process. We'd often use special scripts to do it in a temporary  
directory when doing a full index so that old indexes could be  
searched. Also, there was the auto indexer that had to be run to pick  
up changes in bibliographic records for opac searching. Evergreen has  
no separate processes for that, it is all handled in the database.

With Evergreen, I found some interesting things that would improve  
performance of large bib loads. These may be related to our  
hardware/network configuration but here they are:

1. Setting the ingest.metarecord_mapping.skip_on_insert,  
ingest.disable_authority_linking, ingest.assume_inserts_only, internal  
flags to enabled = TRUE helps. You'll find these in the  
config.internal_flag table.

2. Doing the load from a computer other than the database server  
seemed to be faster.

3. JDBC batch inserts seemed faster than doing the equivalent with Perl DBI.

4. Break the bib records up into batches of 10,000 records. (The  
Horizon bib export program has options that make this relatively  
easy.) Get the number of cores on your database server and subtract 1.  
Run that number of batches simultaneously into the database server.  
(The software that I shared the link of will help with doing this.)

It went from taking days to load our 900,000+ bib records to taking  
overnight with those changes.

HtH,
Jason

>
> Thanks again.
> Joe
>
> Joe Thornton
> Manager, Automation Services
> Upper Hudson Library System
> 28 Essex Street
> Albany, NY 12206
> 518-437-9880 x230
>
>
>
> On Tue, Jun 4, 2013 at 3:26 PM, Joe Thornton <jthornton at uhls.lib.ny.us>wrote:
>
>> I'm new to Evergreen and to this list so I apologize in advance if this
>> issue has been discussed already (I did look).
>>
>> I installed Evergreen successfully on a test server with 16GB RAM and
>> about 200GB of disk -- in two partitions.
>>
>> We have:
>>
>> Debian 7
>> Postgres 9.1 (not on a remote server)
>> Evergreen 2.4
>>
>> To migrate bib records from our SirsiDynix Horizon database I used this
>> document:
>> http://docs.evergreen-ils.org/2.4/_migrating_your_bibliographic_records.html
>>
>> The process was interrupted a few times by serious errors, but eventually
>> I ended up with 550k bib records in the staging_records_import table.
>>
>> The real problems started when I ran SELECT staging_importer();
>>
>> The first time it stopped after many hours because it ran out of disk
>> space. Postgres was using the smaller partition for data so I changed it to
>> use the larger partition (~135GB) and restarted the job. This time it ran
>> over the weekend and then ran out of disk space again.
>>
>> Although this seems very strange to me, I started it again and this time
>> the staging_records_import table has about 160k records in it.
>>
>> I started SELECT staging_importer(); yesterday (about 24 hours ago) and
>> it's still running and has used more than 50GB of disk so far.
>>
>> Am I missing a step (or steps), or is this normal?
>>
>> Thanks,
>>
>> Joe Thornton
>> Manager, Automation Services
>> Upper Hudson Library System
>> 28 Essex Street
>> Albany, NY 12206
>> 518-437-9880 x230
>>
>



-- 
Jason Stephenson
Assistant Director for Technology Services
Merrimack Valley Library Consortium


More information about the Open-ils-dev mailing list