[OPEN-ILS-DEV] slow(!) import

Thu Jun 6 13:02:19 EDT 2013

On Thu, Jun 06, 2013 at 10:48:32AM -0400, Joe Thornton wrote:
> Thanks very much to everyone who replied to my plea for help.
> 
> Elliot  -- Thanks for pointing me to the right list.
> 
> Jason: I'm just kicking the tires. I was sure we'd need a much bigger
> server if we decide to use Evergreen in production, but I thought  16GB of
> RAM and 200GB of disk would be okay for testing. And at this point it's
> just 160k bib records. I haven't even gotten to items or patrons yet.
> Thanks for sharing your git project. I'm not a git user but I'll give it a
> try if I don't find a more familiar solution.
> 
> Dan: Thanks -- I'll probably give your suggestion a try, but 550,000
> records 1,000 at a time will certainly take a while.

I think it will take a lot less time than importing them all in one
shot. Just generate an SQL file that imports each block of 1,000 records
in a sequence of statements.

The problem is that the database can get pretty bloated with a huge
transaction. Smaller transactions might give autovacuum a chance to run
in parallel with the import. Also, have you tweaked the database
configuration at all for things like maintenance_work_mem and work_mem?

> Three things:
> 
> --  The process I started three days ago to import 160,000 records using
> the method on the Evergreen site is still running.

Yeah. That's ridiculous and not representative of a normal Evergreen
experience. We imported our ~2 million records more than 4 years ago in
less than a weekend. Things have changed since then: ingest is happening
entirely in-database now instead of outside of the database, which might
introduce some more overhead; but PostgreSQL has also gotten faster and
certainly there's lots better hardware available now...

> --  Maybe an unfair comparison, but we use VuFind as an alternative
> interface to Horizon, and a full import of all 550k records takes about 45
> minutes.

To some extent that's Solr vs PostgreSQL, and in some ways it's not so
much PostgreSQL's fault as it is that we do an awful lot to shred each
record out into potentially hundreds of rows across many tables, all of
which are indexed. For example, you can search by specific
field/subfield in Evergreen, which adds 25 - 125 rows per record
(average of about 40 rows per record in our sample data). VuFind doesn't
have to meet the same requirements as Evergreen does for searching
records; but yeah, Solr is also just plain fast.

> --  It's surprising to me that there isn't a faster method. We're looking
> seriously at Evergreen as a replacement for Horizon, but this would be a
> problem. I'll try Dan's and then Jason's methods (again, thank you very
> much) and hope that they're significantly faster. If I had the time and
> ability (unfortunately I have neither) I'd take a shot at it myself.

Parallelizing the import, as Jason suggested, certainly can help. I
haven't tried this for quite a while, but a really simplistic way of
parallelizing the import would be to take that script with many 1,000
record chunks and splitting it into n-1 separate scripts (where "n" is
the number of cores on your database server--again, credit due to
Jason), then run each script from a separate psql session. You might run
into deadlocks this way, but it's worth a shot.