[OPEN-ILS-GENERAL] How to have multiple CPU's used, importing a million bib records in EG 2.0?

Mon May 2 19:37:16 EDT 2011

Hi Repke:

On Mon, May 2, 2011 at 3:40 PM, Repke de Vries <repke at xs4all.nl> wrote:
> Hi,
>
> though we worked out our combination of calling marc2bre followed by pg_loader followed by psql as successive, separate steps  - we are dead in the water 'cause PosgreSQL alone (last step) takes four hours for 100K records: meaning 40 hours for all of our one million bib records.

Right, you don't want to do each of those steps separately, that will
elongate the process.

<snip>

> Would connecting the steps with UNIX pipes and feeding it the big chunk of one million records, do it?
> So: marc2sre [our calling parameters] [input = the one million bib records] | pg_loader [our calling parameters] | psql
>
> We had that "Unix pipes" advice a couple of times but it seems counter-intuitive: isn't the net result still one large file that goes into PostgreSQL and therefore using one single instead of multiple CPU's ?

Well - rather than having to finish creating each big file at each
step of the process, psql can import each record immediately as it
works its way through the pipe. So there's no delay when you pipe the
commands; the import into the database begins immediately.

To make it parallel, if you have 4 CPUs available, open 4 terminal
sessions and run marc2*re against distinct subsets of the records in
each terminal session. And as each psql connection will (more or less)
tie up one of your database server's CPU cores, you make better use of
your database server's processing power this way. So if bibs1.mrc
contains the first 100,000 MARC records, and bibs2.mrc contains the
next 100,000 MARC records, you would run the following commands (with
additional parameters to your taste of course):

marc2bre bibs1.mrc | pg_loader | psql
marc2bre bibs2.mrc | pg_loader | psql

I'll admit, though, that I haven't done a major import in the 2.x
series and there are several configuration flags that you might want
to consider changing during import that I don't have a complete handle
on. For example, during import, you'll want to set the 'enabled' value
of the 'ingest.assume_inserts_only' config.internal_flag to TRUE to
avoid some DELETE statements that do nothing but waste time during the
initial import (but set it to FALSE after!).

I don't think there is yet a cohesive list of all of these flags, what
effect they have, and what you should set them to for the best import
performance; that would be a great contribution to the documentation.
(And there are probably other functions that we devs could teach to
make use of 'ingest.assume_inserts_only' to improve performance as
well - for example, biblio.extract_located_uris() would be able to
skip two DELETE statements per bib record).

Dan