[OPEN-ILS-DEV] Bugs in marc2bre.pl?

Dan Wells dbw2 at calvin.edu
Mon Sep 29 16:05:42 EDT 2008


Hello Mike,

Well, a mere three months and two days later, here is the promised revision to marc2bre.pl.  I have spent the last several days poring over it to weed out any major bugs, but since so much of it has been rewritten, I am going to assume some bugs exist, but I am also going to promise that any bugs should now be much easier to fix due to the improvements, which are mostly as follows ;)

1) In addition to or instead of specifying an "ID" field, one may now specify a TCN field.  Our utter reliance on TCNs in my library may be out of the ordinary, but maybe not, so I am hoping this option will prove useful to others.
2) Some ambiguously named options have been deprecated and replaced with better ones.  Either are still supported.  I considered trying to standardize the use of underscores in option names, but didn't want to overstep on that.  The new 'tcn*' options are patterned after the 'id*' options (no underscores), but a few other old and new options both did and still do have underscores where readability is otherwise (subjectively) difficult.
3) Because of this new emphasis on preserving TCNs, any code which assumed the ID and TCN to be related no longer does so.  They can of course still be the same if desired.  Many variables have been renamed to make this distinction much more explicit.
4) A recently added 'use901' flag has been expanded to now skip all ID/TCN processing entirely and simply use the values in the 901.  I am unsure if that was the intention, but it sounded good to me, and I believe many other desired effects can be achieved by now using a combination of idfield and tcnfield values.
5) Rather than defaulting to 'System' for TCN source, 'System' is reserved for TCNs which are set to match the corresponding internal record IDs and 'Unknown' is used for all others.  Also, 'Sirsi_Auto' was added for identifying imported Sirsi auto-generated TCNs (e.g. a1234567).
6) The code is now much more throughly commented, including basic explanations of all the options.

There is probably more that I am forgetting about.  Since I started this rewrite a while ago, I also tried very hard to port in any changes made in trunk since I began.

It is with all humility and great respect that I offer this revision, and I do hope you accept it, as otherwise I will have wasted several days of my life.  Well, I suppose it certainly wouldn't be the first time.

Thanks,
DW



>>> "Mike Rylander" <mrylander at gmail.com> 6/27/2008 4:44 PM >>>
On Fri, Jun 27, 2008 at 4:12 PM, Dan Wells <dbw2 at calvin.edu> wrote:
> Hello all,
>
> I have been playing around with record loads of various shapes and sizes over the last week or two, and have come to the conclusion that marc2bre.pl is a bit discombobulated in its current state.  It boils down to a consistent state of confusion within the code between the record id (i.e. database id) and the record title control number.  I believe these should generally not be the same thing, and I would say about half of the code agrees with me :)  In particular, the idfield argument seems be the source of most of my problems.  I believe it was originally meant to be a way to specify tcns, not database record ids, but since tcns are often alphanumeric, the regular expression which strips out any non-digits flies in the face of this.  The end result is that there is a bunch of code, particularly in the preprocess subroutine that is supposed to check and intelligently set the tcn but which never gets run under normal circumstances (short of an odd dontuse_file setting).  From what I can tell, there is therefore no good way to get a file out the other end with sane tcns (unless yours happen to be all digits).
>

I'm not at a computer where I can look now, and I'm not sure which
branch you're looking at, but here's what is intended:

idfield is, in fact, meant to specify the field (subfield a) from
which to extract the database id.  More on that later.

If there is no available tcn value (as defined in preprocess()) then
the record id will be used.  There could very well have been
short-circut logic introduced into the trunk of svn that causes the
idfield value to be used, but that is not the intention.  I'll look at
it when I get back to my computer.

The purpose of the dontuse parameter (which could certainly use a
better name) is to inform marc2bre of existing TCN values already in
use in the database, for instance when you are loading new records
into an existing implementation.  That lets it look for alternate TCNs
when there is a collision.

> I have created a new version which hopefully untangles most of this.  I left in the idfield setting for setting the record database id (though I am not sure how useful this actually is) and added tcnfield and tcnsubfield settings which honor common tcn formats and use the preprocess code properly in case of duplicates.  It is currently being tested, but before I post any version of it.  I am wondering if am completely nuts about all of this.
>

You're not nuts, and being able to specify a tcn field (and subfield)
is a great addition!

As for the usefulness of idfield, the point there is to maintain (with
a potenial offset supplied by the adjustid parameter) the identifier
that a legacy system uses to address the record, where applicable, at
migration time.  Most ILSs (Evergreen included) use an internal
identifier, because TCN is too human-supplied to be trustworthy as a
unique identifier (and forcing a user to change the TCN to make it
unique seems ... bad).  Item, hold and other records will usually use
this internal identifier to point at a bib record, and moving the old
id (space-shifted by adjustid) is much easier than trying to stich
things back together using some other means ... impossible in some
cases, in fact.

In any case, please do post your new version (or you can send it to me
directly if you'd prefer) and I'll go over it as soon as I can.  Any
improvment and cleanup of marc2bre is a good thing, as it's a critical
component.

Thanks Dan!

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: miker at esilibrary.com 
 | web: http://www.esilibrary.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: marc2bre.pl
Type: application/octet-stream
Size: 12381 bytes
Desc: not available
Url : http://list.georgialibraries.org/pipermail/open-ils-dev/attachments/20080929/a987c77b/marc2bre.obj
-------------- next part --------------
Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Signed-off-by: Daniel B. Wells <dbw2 at calvin.edu>




More information about the Open-ils-dev mailing list