[OPEN-ILS-DEV] Keeping bib / auth / MFHD / record identifiers in sync

Dan Scott dan at coffeecode.net
Thu Nov 19 11:01:27 EST 2009


Thanks for the reply, Mike. The brakes are on until we can discuss this
further. A first attempt, below...

On Thu, 2009-11-19 at 08:34 -0500, Mike Rylander wrote:
<snip>
> >
> > 1a) Is there general interest in having such a thing (setting the 001
> > for authority records, bib records, and MFHD records to the system
> > record ID as stored in the database) as an option in Evergreen? My
> > suspicion is a strong "yes".
> >
> 
> Actually, no.  Setting /a/ field would be good -- pinning it to 001
> would be bad.  Some on why is interspersed below.

<snip>

> > 3) Are there any obvious technical flaws in the record-ID-as-TCN-value
> > approach? The TCN doesn't seem to be used much internally in Evergreen,
> > other than as a limited means of trying to prevent duplicate bib
> > records. I could see this as being a separate option for the database
> > schema, again, as sites that are entirely dependent on OCLC for their
> > TCN values probably don't want the record-ID-as-TCN-value approch.
> >
> 
> It's far from sites that are entirely dependent.  And, it's
> essentially every public library installation.  I share your distaste
> for TCN, but most non-Conifer libraries are making use of it
> specifically.

Okay. It would be useful to know what other sites are doing with their
TCNs. Does ILL hinge on this, for example?

> > 4) If I do implement this, I assume a corresponding nicety would be to
> > make marc2?re.pl modify incoming records in the same way, so that the
> > triggers could be dropped for large imports.
> >
> 
> I disagree.  This is part of Evergreen ingest if it exists, so must be
> done even for batches of records.

Okay - so the existing code that's in marc2bre.pl to create the 901
should really be move into Ingest.pm?

> > 5) Are there alternate implementation approaches to consider?
> >
> 
> Well, the first alternate approach is to extend the mechanism that is
> already in place -- use the 901.  We modify the 901 on export today
> and have specialized tools to use it on import.  Setting up the
> existing code to manage this on import/update would be simple and
> allow you to accomplish what you want (or, what I understand you to
> want: an in-record copy of the internal EG id) without forcing all
> sites to choose between having the id or the tcn.  If we do this, we
> should make it required and just do it.

I don't see how that helps us (where "us" == Conifer, in this case) with
the TCN conflicts we frequently experience when we import or edit a
record (based on the 001 and/or 035 I suppose). This is what prompted my
thinking in the first place. Perhaps the problem is entirely of our own
making, and the answer for us is to simply set the 001 to the bib record
ID across our database.

For the MFHD records, the 004 of the MFHD record needs to point to the
001 of the bib record on export. But during our migration in May, the
001 was stripped from approximately half of our bib records when
ingested via the marc2bre / direct_ingest / parallel_pg_loader route,
probably due to the following line in marc2bre.pl:

$rec->delete_field($_) for ($rec->field('901', $tcn_field, $id_field,
@trash_fields));

There was undoubtedly some way to avoid this fate (I can't say that I
understand why that script deletes the TCN and ID fields), but we had
conflicting TCN values within a single set of incoming records, let
alone the collisions between record IDs for two sets of sequentially
numbered records that both started at 1.

Geez, maybe we should have hired somebody to migrate our data. /me ducks

> In addition to the code essentially existing today in another place,
> we already make use of the 901 elsewhere, and if it's there we assume
> certain semantics.  So, working under the assumption that any field is
> a good as another for storing the internal id (for your purposes),
> what would be the argument against continuing to use the 901?

Hmm. I think that's an incorrect assumption. The 901 proposal doesn't
help us with MFHD export (where "us" == the Evergreen community
interested in MFHD conformance, in this case) - or at least I don't
understand how it helps us. To conform to the MFHD standard, we have to
set the 004 of the MFHD record to match the system control number (035,
often matched to 001) of the corresponding bib record. But that's
talking about manipulation of the 004 for the purposes of conformance to
the MARC and MFHD standards. 

The 901 approach also doesn't help the institute that needs to act as an
authority source, and needs to populate the 001 for the authority
records that they create and maintain with a sequential identifier. Yes,
there will be plenty more work required to support that use case, but
nailing down the 001 is an extremely important piece.

I suppose we could make the exporter look up the 901 of the bib record
and set the 001 of the bib record to the 901 value on export, and
likewise on MFHD export do a lookup of the 901 from the corresponding
bib record to set the 004, but then we're carefully avoiding touching
the 004 of the MFHD as it exists in the serial.record_entry column only
to rewrite it any time it's accessed. And... I don't understand why we
would do that, instead of just having it be correct in the database in
the first place.

> > * It could be added to Ingest.pm instead as part of the ingest methods,
> > which would keep all of the pertinent code in one place, and possibly
> > allow us to modify the behaviour based on actor.org_unit_settings
> 
> This is almost certainly the first place you should put any
> ingest-related code.

Okay, I can buy that.

> > (although records aren't owned by any given org unit).
> 
> Coming soon enough (next 6 months or so).

Oh! Good to know.

>  Direct
> > modifications to the database wouldn't automatically result in the
> > corresponding changes to the records, though, and I have a gut feeling
> > that these sorts of options are better implemented across the entire
> > database rather than at a library-by-library level.
> >
> 
> I disagree.  With record ownership, they should be implemented however
> the owner wants them.  If they should be the same across the board
> then the top of the org tree should own the records.

Right. I had no idea that record ownership was in the works, so that
changes this consideration, uh, considerably.

> > * There might be a more natural place to do this as part of in-database
> > ingest; I'm not sure how far Mike is planning on taking in-database
> > ingest in the near future. I could always implement this as a set of
> > optional triggers and then it could get rolled into a future in-database
> > implementation.
> >
> 
> As far as possible ... all the way ... ???  Part of it is, indeed,
> already in place and active in trunk.

Right, that's why I referenced it. But intentions and near future are
not necessarily compatible (I have a long track record that proves
that).

> All that's left is to handle what the javascript fingerprinter does in
> some way (which may be "run that first, in a vestigial stub of
> ingest.pm"), reimplement the Located URI extraction logic, and gut
> ingest.pm.  If you have a strong desire to start developing now,
> please (asking as the only person to have touched all of that code,
> and the nominal maintainer of the ingest/indexing process) put the
> code in ingest.pm for the time being.
> 

Mmm, I would definitely like to get rid of the javascript
fingerprinter :) But I'll cool my heels until I'm sure I understand the
agreed upon approach.




More information about the Open-ils-dev mailing list