[OPEN-ILS-DEV] More on marc2bre.pl

Wed Oct 1 08:52:07 EDT 2008

On Tue, Sep 30, 2008 at 12:31 PM, Andrews, Mark J.
<MarkAndrews at creighton.edu> wrote:
> Picking up on Dan Well's note about revising marc2bre, I've been playing
> with an older VMWare image of Evergreen (v1.2.1.3 or something like that).
> Why?  'Cause this image was built with 20 GB of (virtual) disk space, plenty
> to handle a copy of Creighton's bibs (669,000) and associated items (1.1
> million).  There are other, newer VMWare images, but they don't have so much
> space.
>
>
>
> Lots of space give me room to run scripts against, say, a 1 GB input file
> and get ginormous (1, 2, 4, 5, 7 or more GB) output files out the other
> end.
>
>
>
> My problem at the moment is (I'm guessing the file name, but you'll know
> what I mean) pg_loader_bre.ql (or something like that) contains duplicate
> "bre" records.  The target table in PostgreSQL has at least one column set
> to "no dups," so the import fails.  Dan Scott suggested I grep around the
> duplicate records.  That's always an option, but I reasoned it'd be quicker
> to create a clean export file from the source system, and then have that
> clean file to process on the target side.
>
>
>
> I found a way to tell the export program on the source side to put an
> integer into the tag and subfield of my choice.  This integer value simply
> numbers the bib records in the output file from first to last.  That way I
> have a guaranteed (sic?) unique ID number in the source file.  However, I
> discovered on import that there is still some other field declared as
> unique, which causes PostgreSQL to do what it does, and stop the import when
> it finds a duplicate key.  Hmmm, what to do?
>
>
>
> I suggest the processing script (somehow) identify duplicate records, write
> them to an exception file, and skip to the next record.  This is potentially
> difficult because the import scripts, several *.sql files, contain related
> records to *.bre records.  So a duplicate *.bre record would be skipped,
> along with any related records in other files.  I wonder how to do this?
>

There are command line options meant to help with that by allowing you
to supply a file that contains the list of TCNs that are spoken for,
and to offset the start of a particular files IDs by a set amount, but
since I just committed Dan's new version of marc2bre.pl I would
recommend you grab a copy of that.  It's more explicit in the options,
generally cleaner and all around better than the old (serviceable, but
cantankerous) version that has evolved over the last 3 years or so.

The copy in Dan's recent email is the same as what was committed to
trunk, minus a few formatting changes (more of which are needed ... it
was decided to move to space-indenting instead of tabs, and many files
are still suffering).  Take a look inside the new script at the option
comments.  I believe a combination of idfield/idsubfield and
tcnfield/tcnsubfield should be all you need, assuming that the tcn's
are unique in the sources system.  You should be able to use the same
value for idfield and tcnfield.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com