[OPEN-ILS-DEV] More on marc2bre.pl
Mike Rylander
mrylander at gmail.com
Wed Oct 1 08:52:07 EDT 2008
On Tue, Sep 30, 2008 at 12:31 PM, Andrews, Mark J.
<MarkAndrews at creighton.edu> wrote:
> Picking up on Dan Well's note about revising marc2bre, I've been playing
> with an older VMWare image of Evergreen (v1.2.1.3 or something like that).
> Why? 'Cause this image was built with 20 GB of (virtual) disk space, plenty
> to handle a copy of Creighton's bibs (669,000) and associated items (1.1
> million). There are other, newer VMWare images, but they don't have so much
> space.
>
>
>
> Lots of space give me room to run scripts against, say, a 1 GB input file
> and get ginormous (1, 2, 4, 5, 7 or more GB) output files out the other
> end.
>
>
>
> My problem at the moment is (I'm guessing the file name, but you'll know
> what I mean) pg_loader_bre.ql (or something like that) contains duplicate
> "bre" records. The target table in PostgreSQL has at least one column set
> to "no dups," so the import fails. Dan Scott suggested I grep around the
> duplicate records. That's always an option, but I reasoned it'd be quicker
> to create a clean export file from the source system, and then have that
> clean file to process on the target side.
>
>
>
> I found a way to tell the export program on the source side to put an
> integer into the tag and subfield of my choice. This integer value simply
> numbers the bib records in the output file from first to last. That way I
> have a guaranteed (sic?) unique ID number in the source file. However, I
> discovered on import that there is still some other field declared as
> unique, which causes PostgreSQL to do what it does, and stop the import when
> it finds a duplicate key. Hmmm, what to do?
>
>
>
> I suggest the processing script (somehow) identify duplicate records, write
> them to an exception file, and skip to the next record. This is potentially
> difficult because the import scripts, several *.sql files, contain related
> records to *.bre records. So a duplicate *.bre record would be skipped,
> along with any related records in other files. I wonder how to do this?
>
There are command line options meant to help with that by allowing you
to supply a file that contains the list of TCNs that are spoken for,
and to offset the start of a particular files IDs by a set amount, but
since I just committed Dan's new version of marc2bre.pl I would
recommend you grab a copy of that. It's more explicit in the options,
generally cleaner and all around better than the old (serviceable, but
cantankerous) version that has evolved over the last 3 years or so.
The copy in Dan's recent email is the same as what was committed to
trunk, minus a few formatting changes (more of which are needed ... it
was decided to move to space-indenting instead of tabs, and many files
are still suffering). Take a look inside the new script at the option
comments. I believe a combination of idfield/idsubfield and
tcnfield/tcnsubfield should be all you need, assuming that the tcn's
are unique in the sources system. You should be able to use the same
value for idfield and tcnfield.
--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker at esilibrary.com
| web: http://www.esilibrary.com
More information about the Open-ils-dev
mailing list