[OPEN-ILS-GENERAL] Marc_stream_importer for batch loading

Fri Mar 14 16:29:30 EDT 2014

We have recently come up with a good way to load electronic resource 
records that I wanted to share.

We have been struggling with how to load our electronic resource marc 
records into Evergreen.  We constantly receive files from vendors and 
our cataloger loads them through Vandelay.  Sometimes the records match 
on-file records and just add an 856 link.  Other records are new and 
need to be added.  Vandelay is a great tool because you can setup match 
criteria and overlay profiles.

The only problem is Vandelay will timeout with a file of more than 500 
records.  We have tried splitting the files into 500-record chunks, but 
the overhead in queuing up the files, especially when you split a 
20,000-record file into 40 pieces, can add up.

The solution we have been happy with is an updated version of 
marc_stream_importer.pl that Bill Erickson recently worked on (LP# 
1279998).  Bill added support for overlay 1 match, overlay best match, 
and import non-matching records.  By default marc_stream_importer 
assumes you have supplied a record ID in a 901 $c.  This version now 
supports all the vandelay options but can be run from the command line 
which also means you can script the loading of records.

Here is how I load a file:

marc_stream_importer.pl  --spoolfile /home/opensrf/file-7 --user xxx 
--password xxx --source 102 --merge-profile 2 --queue
  11391 --auto-overlay-best-match --import-no-match --nodaemon

The record source and merge profile are specified on the command line. 
The queue contains the record match set.  If there are no errors, 
marc_stream_importer will empty the queue.

I can find the record ID's of records added or updated in the log files:

#!/usr/bin/perl

@imported = `grep queue=11391 
/var/log/evergreen/prod/2014/03/14/activity.log`;

foreach $line (@imported) {
     if ($line =~ /imported_as= ischanged/) {next};
     $line =~ s/.*(imported_as=[0-9]+) .*/\1/;
     print $line;
}

Marc_stream_importer, like Vandelay, still has problems loading more 
than 500 records at a time.  I was getting 'out of shared memory errors 
(see LP#1271661).  The good news is that files can be easily split using 
yaz-marcdump and then the commands can be stacked in a shell script.

Here is how to split a file into 500-record files:

yaz-marcdump -i marc -o marc -s file- -C 500 mybigfile.mrc > /dev/null

Then it's just a matter of creating a shell script to run through the 
files one at a time piping the output to a log file so I can verify the 
records loaded.  Over the last 4 nights I was able to load 4 files of 
5900 records each.

-- 
Martha Driscoll
Systems Manager
North of Boston Library Exchange
Danvers, Massachusetts
www.noblenet.org