[OPEN-ILS-GENERAL] How to have multiple CPU's used, importing a million bib records in EG 2.0?

Tue May 3 13:39:45 EDT 2011

Hi Dan

eight piped processes are running fine right now - as per your suggestion.

However: on tying together with Unix pipe:

> So there's no delay when you pipe the
> commands; the import into the database begins immediately.

the evidence is somewhat circumstantial but our observation is that importing does *not* begin immediately and pg_loader seems to be the culprit:  it does *not* pass on to psql but keeps piling up 'till marc2bre has finished and only then starts feeding the database through psql.

The evidence being that though the pg_loader counter starts writing to the screen right away, it is only at the end of the MARC bib data chunk that (what we assume to be the) the pg_loader message "SET .. BEGIN .. Writing file" appears. 
Given CPU activity on the database server, only after that the stream of COPY statements starts getting the data imported in the db.  //Times eight because of running in parallel.  For the moment we don't fine tune any of the flags you mention.//  

Question: are we overlooking some extra parameter to pg_loader that should have it start passing on / importing in the database right away? 
Here is how we are calling [1] and thanks,

Repke, IISH

[1]
PERLLIB=/openils/lib/perl5 perl /usr/src/Evergreen-ILS-2.0.3/Open-ILS/src/extras/import/marc2bre.pl --marctype XML --db_host  xx --db_name xx --db_user xx --db_pw xx bibliographical.xml.1.xml | perl /usr/src/Evergreen-ILS-2.0.3/Open-ILS/src/extras/import/pg_loader.pl -or bre -a bre | psql -h xx -d xx -U xx

Op 3 mei 2011, om 01:37 heeft Dan Scott het volgende geschreven:

> Hi Repke:
> 
> On Mon, May 2, 2011 at 3:40 PM, Repke de Vries <repke at xs4all.nl> wrote:
>> Hi,
>> 
>> though we worked out our combination of calling marc2bre followed by pg_loader followed by psql as successive, separate steps  - we are dead in the water 'cause PosgreSQL alone (last step) takes four hours for 100K records: meaning 40 hours for all of our one million bib records.
> 
> Right, you don't want to do each of those steps separately, that will
> elongate the process.
> 
> <snip>
> 
>> Would connecting the steps with UNIX pipes and feeding it the big chunk of one million records, do it?
>> So: marc2sre [our calling parameters] [input = the one million bib records] | pg_loader [our calling parameters] | psql
>> 
>> We had that "Unix pipes" advice a couple of times but it seems counter-intuitive: isn't the net result still one large file that goes into PostgreSQL and therefore using one single instead of multiple CPU's ?
> 
> Well - rather than having to finish creating each big file at each
> step of the process, psql can import each record immediately as it
> works its way through the pipe. So there's no delay when you pipe the
> commands; the import into the database begins immediately.
> 
> To make it parallel, if you have 4 CPUs available, open 4 terminal
> sessions and run marc2*re against distinct subsets of the records in
> each terminal session. 
<snip>