[OPEN-ILS-GENERAL] Marc_stream_importer for batch loading

Janet Schrader jschrader at cwmars.org
Fri Mar 14 17:46:31 EDT 2014


Hi Martha,

This is good news. I hope I can accomplish this.

I have been loading files of 1,000 records each. It takes about 35-40 minutes per file but they don't time out. I do the queue first to get some idea of how many will match and on what because sometimes I want to overlay and preserve the 856s and sometimes I want to just add the new 856s. Then I start the load. While that load is processing, I do another queue. Still I was only loading 5 or 6 files a day so this will definitely speed up the process. Two libraries in our consortium want me to load EBSCO records, 112,000 for each library. 

If you select to overlay 1 match and import non-matching do you get a list of the records that didn't load? Those would be ones with 2 or more matches.




Thanks,
Janet

Janet Schrader
C/W MARS Inc.
Supervisor of Bibliographic Services
67 Millbrook Street, Suite 201
Worcester, MA 01606
tel: 508-755-3323 ext. 25
fax: 508-757-7801
jschrader at cwmars.org



-----Original Message-----
From: open-ils-general-bounces at list.georgialibraries.org [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Martha Driscoll
Sent: Friday, March 14, 2014 4:30 PM
To: open-ils-general at list.georgialibraries.org
Subject: [OPEN-ILS-GENERAL] Marc_stream_importer for batch loading

We have recently come up with a good way to load electronic resource records that I wanted to share.

We have been struggling with how to load our electronic resource marc records into Evergreen.  We constantly receive files from vendors and our cataloger loads them through Vandelay.  Sometimes the records match on-file records and just add an 856 link.  Other records are new and need to be added.  Vandelay is a great tool because you can setup match criteria and overlay profiles.

The only problem is Vandelay will timeout with a file of more than 500 records.  We have tried splitting the files into 500-record chunks, but the overhead in queuing up the files, especially when you split a 20,000-record file into 40 pieces, can add up.

The solution we have been happy with is an updated version of marc_stream_importer.pl that Bill Erickson recently worked on (LP# 1279998).  Bill added support for overlay 1 match, overlay best match, and import non-matching records.  By default marc_stream_importer assumes you have supplied a record ID in a 901 $c.  This version now supports all the vandelay options but can be run from the command line which also means you can script the loading of records.

Here is how I load a file:

marc_stream_importer.pl  --spoolfile /home/opensrf/file-7 --user xxx --password xxx --source 102 --merge-profile 2 --queue
  11391 --auto-overlay-best-match --import-no-match --nodaemon

The record source and merge profile are specified on the command line. 
The queue contains the record match set.  If there are no errors, marc_stream_importer will empty the queue.

I can find the record ID's of records added or updated in the log files:

#!/usr/bin/perl

@imported = `grep queue=11391
/var/log/evergreen/prod/2014/03/14/activity.log`;

foreach $line (@imported) {
     if ($line =~ /imported_as= ischanged/) {next};
     $line =~ s/.*(imported_as=[0-9]+) .*/\1/;
     print $line;
}

Marc_stream_importer, like Vandelay, still has problems loading more than 500 records at a time.  I was getting 'out of shared memory errors (see LP#1271661).  The good news is that files can be easily split using yaz-marcdump and then the commands can be stacked in a shell script.

Here is how to split a file into 500-record files:

yaz-marcdump -i marc -o marc -s file- -C 500 mybigfile.mrc > /dev/null

Then it's just a matter of creating a shell script to run through the files one at a time piping the output to a log file so I can verify the records loaded.  Over the last 4 nights I was able to load 4 files of
5900 records each.

--
Martha Driscoll
Systems Manager
North of Boston Library Exchange
Danvers, Massachusetts
www.noblenet.org


More information about the Open-ils-general mailing list