[OPEN-ILS-DEV] Keeping bib / auth / MFHD / record identifiers in sync

Thu Nov 19 08:34:53 EST 2009

More later, but I have to put the breaks on this a bit ... (my)
answers to your questions inline.

On Thu, Nov 19, 2009 at 6:46 AM, Dan Scott <dan at coffeecode.net> wrote:
> One of the topics that came up during a discussion with a potential
> adopter of Evergreen was the apparent lack of concern about keeping the
> 001 / 004 fields in the MARC records in the databases in sync with the
> internal system ID (or TCN value, in the case of bibs).
>
> For many of us, this hasn't an issue because it doesn't affect our
> users.  However, for this institution, it's a pretty serious issue
> because they are a supplier of authority records. We also determined
> that there are cases in Conifer where libraries had practices that
> relied on being able to identify records by their TCN or record ID
> (there was an earlier discussion about TCN values and the 001 at
> http://groups.google.ca/group/conifer-discuss/browse_thread/thread/d0d6528e5a92f781/5d10bb4f93323c6f?#5d10bb4f93323c6f)
>
> (Aside for those contemplating their own migrations: Most of the TCN
> conflicts in Conifer arose because we merged two sets of bib records
> that were numbered sequentially starting at 1; and while I adjusted the
> record IDs for one set of records by adding 1000000 to their IDs, I
> didn't adjust the 001 correspondingly... and hilarity ensued. Learn from
> my mistakes!)
>
> My seat-of-the-pants suggestion was that we could add a database
> INSERT/UPDATE trigger to ensure that the MARC record was always kept in
> sync with its assigned record ID and/or TCN value. I also opined that,
> in our case at least, we would be happy to always have the TCN value set
> to match the record ID (it would, at least on the face of it, avoid
> "conflicting TCN" issues).
>
> Of course, suggesting something as a potential solution and actually
> implementing that solution are different things. Putting one toe in the
> water, I created and tested a trigger for keeping the MFHD records' 001
> in sync with their record ID:
>
> http://evergreen-ils.org/dokuwiki/doku.php?id=scratchpad:random_magic_spells#sync_the_001_field_of_your_serials_records_to_the_linked_bibliographic_record_id
>
> Based on a handful of tests with our data, it worked! Yay.
>
> So now, a couple of questions and thoughts for the more
> experienced/world-weary on this topic:
>
> 1a) Is there general interest in having such a thing (setting the 001
> for authority records, bib records, and MFHD records to the system
> record ID as stored in the database) as an option in Evergreen? My
> suspicion is a strong "yes".
>

Actually, no.  Setting /a/ field would be good -- pinning it to 001
would be bad.  Some on why is interspersed below.

> 1b) Should this be a default behaviour, or an optional piece of database
> schema that sites would need to apply separately? My suspicion is that
> caution would lean towards making it optional. Perhaps 1a and 1b are
> really questions for open-ils-general...
>

Probably (re default behaviour).  My answer to (5) suggests that it
should just be a standard internal process -- "required" I guess.

> 1c) If optional, should it be packaged in the Open-ILS/src/sql/Pg/
> directory (and a corresponding option added to eg_db_config.pl), or
> should it be a less integrated ILS-Contrib thing?
>

The toggle (if there should be one) should not be the existence of a
trigger, but the value of a setting.  It (whatever it is) should be in
the main code, IMO.

> 2) My test implementation used regexp_replace() to replace the contents
> of the 001 field with the record ID. That's a bit brittle, though; for
> example, a record may not have a 001. Would it make more sense to have
> the trigger call a pl/perl function that uses MARC::Record and
> MARC::File::XML to manipulate the record? It would seem to be a more
> robust approach, but I worry a tiny bit about the performance impact of
> a pl/perl approach. Not having done any benchmarking, though, perhaps
> this isn't a real concern.
>

There are bigger costs we pay at insert/update time, so I wouldn't be
concerned about this.

> 3) Are there any obvious technical flaws in the record-ID-as-TCN-value
> approach? The TCN doesn't seem to be used much internally in Evergreen,
> other than as a limited means of trying to prevent duplicate bib
> records. I could see this as being a separate option for the database
> schema, again, as sites that are entirely dependent on OCLC for their
> TCN values probably don't want the record-ID-as-TCN-value approch.
>

It's far from sites that are entirely dependent.  And, it's
essentially every public library installation.  I share your distaste
for TCN, but most non-Conifer libraries are making use of it
specifically.

> 4) If I do implement this, I assume a corresponding nicety would be to
> make marc2?re.pl modify incoming records in the same way, so that the
> triggers could be dropped for large imports.
>

I disagree.  This is part of Evergreen ingest if it exists, so must be
done even for batches of records.

> 5) Are there alternate implementation approaches to consider?
>

Well, the first alternate approach is to extend the mechanism that is
already in place -- use the 901.  We modify the 901 on export today
and have specialized tools to use it on import.  Setting up the
existing code to manage this on import/update would be simple and
allow you to accomplish what you want (or, what I understand you to
want: an in-record copy of the internal EG id) without forcing all
sites to choose between having the id or the tcn.  If we do this, we
should make it required and just do it.

In addition to the code essentially existing today in another place,
we already make use of the 901 elsewhere, and if it's there we assume
certain semantics.  So, working under the assumption that any field is
a good as another for storing the internal id (for your purposes),
what would be the argument against continuing to use the 901?

> * It could be added to Ingest.pm instead as part of the ingest methods,
> which would keep all of the pertinent code in one place, and possibly
> allow us to modify the behaviour based on actor.org_unit_settings

This is almost certainly the first place you should put any
ingest-related code.

> (although records aren't owned by any given org unit).

Coming soon enough (next 6 months or so).

 Direct
> modifications to the database wouldn't automatically result in the
> corresponding changes to the records, though, and I have a gut feeling
> that these sorts of options are better implemented across the entire
> database rather than at a library-by-library level.
>

I disagree.  With record ownership, they should be implemented however
the owner wants them.  If they should be the same across the board
then the top of the org tree should own the records.

> * There might be a more natural place to do this as part of in-database
> ingest; I'm not sure how far Mike is planning on taking in-database
> ingest in the near future. I could always implement this as a set of
> optional triggers and then it could get rolled into a future in-database
> implementation.
>

As far as possible ... all the way ... ???  Part of it is, indeed,
already in place and active in trunk.

All that's left is to handle what the javascript fingerprinter does in
some way (which may be "run that first, in a vestigial stub of
ingest.pm"), reimplement the Located URI extraction logic, and gut
ingest.pm.  If you have a strong desire to start developing now,
please (asking as the only person to have touched all of that code,
and the nominal maintainer of the ingest/indexing process) put the
code in ingest.pm for the time being.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com