[OPEN-ILS-DEV] ***SPAM*** Problems with batch loading bibs in 2.0 RC3

Mike Rylander mrylander at gmail.com
Fri Jan 28 12:50:25 EST 2011


On Fri, Jan 28, 2011 at 10:33 AM, John Craig
<jc-mailinglist at alphagconsulting.com> wrote:
> Hi Folks,
>
> With the new in-DB ingest functionality, I'm having a devil of a time
> getting a batch load of bibs to go in without problems of one sort or
> another. I've processed the records using the marc2bre.pl and the
> pg_loader.pl utilities (a side question--is the parallel_pg_loader no longer
> an option--keeping all the SQL in RAM means that in the past, I've hit a
> limit with files above a certain size).

This still works fine, you're just "parallelizing" one output file per run.

> So, I've got my SQL (which is, now,
> of course, just COPY commands to load biblio.record_entry rows).
>
> In trying to do it all in one big file, the ~230K records did not go in
> within a reasonable amount of time. This is a process test, so I looked for
> ways to speed it up.
>
> Since the whole process seemed to be CPU-bound, I tried breaking the file up
> into five chunks of 50K each (the last one somewhat smaller, obviously).
> Trying to run those in all at once seemed to go all right for a while (two
> to 4 CPUs were fully utilized), but in the long run, resulted in two
> different problems that were both rather surprising. So, over night, I got
> one of 4 batches of 50K records to load.
>
> o   Duplicate TCNs
>      My assumption was that the marc2bre.pl utility would ensure uniqueness
> across multiple MARC files, but as I thought more about it, I don't see how
> it could--unless it saved the TCNs out to a file it read for the next
> batch--not much use for trying to get things processed in parallel. I find
> the TCN to be a pain for data loading, and one that provides little utility,
> so my thought was I'd consider just dropping the unique index and then
> reassign it a unique value based on the biblio.record_entry.id after the
> load.
>


After the marc2bre.pl step, use the unix split command to split the
bre file and run each through the [parallel_]pg_loader.pl.



> o   Deadlocks
>      This occurred multiple times in various tests with 2 to 4 files loading
> simultaneously. The process is apparently complex enough that it can
> deadlock even though the function calls are presumably doing the processing
> in the same order for each record. Has anyone done tests with multiple
> sessions using some kind of configuration that avoided this problem? I don't
> see how this is going to be much different if people are using the Vandelay
> tools to do multiple bib loads simultaneously.
>
>     Based on some info from the IRC channel, I had set
> ingest.assume_inserts_only & ingest.metarecord_mapping.skip_on_insert to
> true. That may have made the deadlocking scenarios less likely, but it did
> not prevent deadlocks from causing one of the 4 processes that were running
> over night from dying. Moreover, in a production environment, you can't very
> well have either of these options set to these values, can you?
>

Edit each file output file from the pg_loader tool to set the
appropriate internal flags at the top, and reset them at the bottom,
before the commit.  This will make those flags active for only those
loading transactions.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list