[OPEN-ILS-DEV] Updating a large set of bib records

Fri Apr 25 08:58:33 EDT 2014

On 04/24/2014 06:12 PM, Kyle Tomita wrote:
> Hi Everyone,
> 
>  
> 
> I am tasked with importing a large set of bibliographic marc records
> (under 1 million). 
> 
>  
> 
> I have leveraged work from Jason Stephenson,
> http://git.mvlcstaff.org/?p=jason/backstage.git;a=summary. 
> 
>  
> 
> The import script has been modified to instead of doing the update, to
> create sql files with the update commands.  These files have about
> 10,000 records per file.  This bypasses checking with database and just
> creates the update scripts off of the marc records.

That's a common approach, but not one that I recommend.

When I do batches I either use fork() in Perl or use Java threads. I
will also use the Perl DBI module or Java's JDBC. Using batch SQL
statements with threads in Java has given me the best load performance,
even faster than doing COPY statements with files on the server.

I typically have the data files and software on a server other than the
database server. That is, I do the load over the network rather than on
the database server itself.

If the OpenSRF services are stopped or the load happens in the middle of
the night when the system is not busy, I'll have the software
simultaneously run a number of batches equal to the number of CPU cores
on the database server minus 1. If I'm doing this during the day,
particularly when libraries are open, I'll either not batch the updates
or do no more than half the number of cores on the CPU.

Massive bib loads or updates are best done when the system is not busy.

> 
>  
> 
> These files are then batch processed. 
> 
>  
> 
> This process ignores overlay profiles, which was deemed not needed for
> this process. 

Depending on what's in the incoming records, you may be able to match on
901$c or whatever. I add this for the benefit of others reading the
thread, mainly. The Backstage program you linked above will match on
901$c, since the files Backstage sends are cleaned up versions of
records that originated in our database.

I have done loads with matches on ISBNs and such, but I never used
overlay profiles for that. I wrote my own matching code using the
metabib identifier field tables and others. Because of the way many
records get cataloged, such matching is inexact. Even overlay profiles
will match on more than 1 record.

What I've done when matching records is not import the incoming record
when it matched an existing record. That was for the case of importing
records from a new member library joining our consortium, so if you have
updates for your database, you'll want to handle that differently.

> 
>  
> 
> Before the update, triggers on the biblio.record_entry are turned off,
> particularly the reingest.  We run a full reingest after all the records
> have been updated. 

I did that for our migration. I don't do that with most of the updates
from Backstage I run today. They are typically only a few thousand
records at a time these days.

There are some other flags you can adjust that will further improve the
load speed. For instance, I set enabled to true on the following during
our migration in 2011:

ingest.metarecord_mapping.skip_on_insert
ingest.disable_authority_linking
ingest.assume_inserts_only

Running the query below in your Evergreen database will reveal all of
the ingest-related settings that you might want to turn on or turn off
depending on your situation:

select name from config.internal_flag where name like 'ingest.%';

If you do mess with any of those flags, you'll need to be sure to do the
appropriate steps after your load finishes. If you forget, search will
not give the results you want.

You will also want to remember to set the flags back to their original
values when you're done.

> 
>  
> 
> My reason for posting this is to get feedback from others who are
> charged with updating a large set of bib records (over 500,000) about
> the way in which they succeeded and also pitfalls. 

It is a fairly common question. Unfortunately, there isn't really a
one-size-fits-all solution. The changes to marc_stream_importer to allow
it to load records from files instead of just over a network port is a
step in that direction.

I still typically end up writing something unique for each batch of
records that I have to deal with. I find that I very often have to do
special scrubbing of each set of records. Records from different sources
often have their own quirks that match up with what we do at MVLC.

Depending on the time that I have to devote to the project and the
number of incoming records, I do not usually split the load up into
batches and just run the whole thing through in sequence. I will
typically do this such that each record is its own database transaction
with appropriate error handlers so that one failed record doesn't stop
the whole batch.

The timeliness of getting the records loaded (i.e. "by next Wednesday"
or some such) typically overrides the raw speed of loading the records.
It is rare that I need to worry about loading X number of records per
minute.

> 
>  
> 
> Kyle Tomita
> 
> Developer II, Catalyst IT Services
> 
> Beaverton Office
> 
>  
> 
>  
> 
>  
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
>  
> 

-- 
Jason Stephenson
Assistant Director for Technology Services
Merrimack Valley Library Consortium
1600 Osgood ST, Suite 2094
North Andover, MA 01845
Phone: 978-557-5891
Email: jstephenson at mvlc.org