[OPEN-ILS-DEV] slow(!) import

Tue Jun 4 16:24:53 EDT 2013

On Tue, Jun 04, 2013 at 03:26:40PM -0400, Joe Thornton wrote:
> I'm new to Evergreen and to this list so I apologize in advance if this
> issue has been discussed already (I did look).
> 
> I installed Evergreen successfully on a test server with 16GB RAM and about
> 200GB of disk -- in two partitions.
> 
> We have:
> 
> Debian 7
> Postgres 9.1 (not on a remote server)
> Evergreen 2.4
> 
> To migrate bib records from our SirsiDynix Horizon database I used this
> document:
> http://docs.evergreen-ils.org/2.4/_migrating_your_bibliographic_records.html
> 
> The process was interrupted a few times by serious errors, but eventually I
> ended up with 550k bib records in the staging_records_import table.
> 
> The real problems started when I ran SELECT staging_importer();
> 
> The first time it stopped after many hours because it ran out of disk
> space. Postgres was using the smaller partition for data so I changed it to
> use the larger partition (~135GB) and restarted the job. This time it ran
> over the weekend and then ran out of disk space again.
> 
> Although this seems very strange to me, I started it again and this time
> the staging_records_import table has about 160k records in it.
> 
> I started SELECT staging_importer(); yesterday (about 24 hours ago) and
> it's still running and has used more than 50GB of disk so far.
> 
> Am I missing a step (or steps), or is this normal?

I haven't run a migration of that size with those exact tools for a
number of years, but have run some smaller migrations over the past
years. One thing that's probably killing your import is that the
staging_importer() function defined in the docs imports all 550K records
in one shot. That's likely to build up a lot of bloat; for the last
migration that I did (last summer), I ran the imports in chunks of 10K
records each, and Galen Charlton in the IRC channel said that he
typically works with chunks of 1K records at a time.

Something like the following might work for you:

CREATE OR REPLACE FUNCTION staging_importer(from_rec INT, to_rec INT)
RETURNS INT AS $$
DECLARE stage RECORD;
BEGIN
FOR stage IN SELECT * FROM staging_records_import WHERE id BETWEEN
from_rec AND to_rec ORDER BY id LOOP
      INSERT INTO biblio.record_entry (marc, last_xact_id) VALUES
(stage.marc, 'IMPORT');
      UPDATE staging_records_import SET dest =
currval('biblio.record_entry_id_seq');
    END LOOP;
END;
$$ LANGUAGE plpgsql;

And then you can call it from a SQL script like:

SELECT staging_importer(0, 1000);
SELECT staging_importer(1001, 2000);
... etc

Worth a shot! Let us know if you try it and it helps out, we could tweak
the docs accordingly.