[OPEN-ILS-DEV] Bugs in marc2bre.pl?

Mike Rylander mrylander at gmail.com
Wed Oct 1 08:43:37 EDT 2008


On Mon, Sep 29, 2008 at 4:05 PM, Dan Wells <dbw2 at calvin.edu> wrote:
> Hello Mike,
>
> Well, a mere three months and two days later, here is the promised revision to marc2bre.pl.  I have spent the last several days poring over it to weed out any major bugs, but since so much of it has been rewritten, I am going to assume some bugs exist, but I am also going to promise that any bugs should now be much easier to fix due to the improvements, which are mostly as follows ;)
>
> 1) In addition to or instead of specifying an "ID" field, one may now specify a TCN field.  Our utter reliance on TCNs in my library may be out of the ordinary, but maybe not, so I am hoping this option will prove useful to others.
> 2) Some ambiguously named options have been deprecated and replaced with better ones.  Either are still supported.  I considered trying to standardize the use of underscores in option names, but didn't want to overstep on that.  The new 'tcn*' options are patterned after the 'id*' options (no underscores), but a few other old and new options both did and still do have underscores where readability is otherwise (subjectively) difficult.
> 3) Because of this new emphasis on preserving TCNs, any code which assumed the ID and TCN to be related no longer does so.  They can of course still be the same if desired.  Many variables have been renamed to make this distinction much more explicit.
> 4) A recently added 'use901' flag has been expanded to now skip all ID/TCN processing entirely and simply use the values in the 901.  I am unsure if that was the intention, but it sounded good to me, and I believe many other desired effects can be achieved by now using a combination of idfield and tcnfield values.
> 5) Rather than defaulting to 'System' for TCN source, 'System' is reserved for TCNs which are set to match the corresponding internal record IDs and 'Unknown' is used for all others.  Also, 'Sirsi_Auto' was added for identifying imported Sirsi auto-generated TCNs (e.g. a1234567).
> 6) The code is now much more throughly commented, including basic explanations of all the options.
>
> There is probably more that I am forgetting about.  Since I started this rewrite a while ago, I also tried very hard to port in any changes made in trunk since I began.
>
> It is with all humility and great respect that I offer this revision, and I do hope you accept it, as otherwise I will have wasted several days of my life.  Well, I suppose it certainly wouldn't be the first time.
>

*sniff,sniff*  It's just so ... beautiful!

Dan, this is really awesome.  I'm sorry for the delay in reviewing the
patch, but we've been pushing towards the first 1.4 RC.  I'm glad I
got to this before we cut the RC, though, because this will certainly
lower the barrier to entry for a lot of folks -- both getting data
into the system and generally understanding (if they "use the source,
Luke") what's going on with the MARC.

Thanks for taking this on, Dan.  It's committed with minor formatting changes.

--miker

> Thanks,
> DW
>
>
>
>>>> "Mike Rylander" <mrylander at gmail.com> 6/27/2008 4:44 PM >>>
> On Fri, Jun 27, 2008 at 4:12 PM, Dan Wells <dbw2 at calvin.edu> wrote:
>> Hello all,
>>
>> I have been playing around with record loads of various shapes and sizes over the last week or two, and have come to the conclusion that marc2bre.pl is a bit discombobulated in its current state.  It boils down to a consistent state of confusion within the code between the record id (i.e. database id) and the record title control number.  I believe these should generally not be the same thing, and I would say about half of the code agrees with me :)  In particular, the idfield argument seems be the source of most of my problems.  I believe it was originally meant to be a way to specify tcns, not database record ids, but since tcns are often alphanumeric, the regular expression which strips out any non-digits flies in the face of this.  The end result is that there is a bunch of code, particularly in the preprocess subroutine that is supposed to check and intelligently set the tcn but which never gets run under normal circumstances (short of an odd dontuse_file setting).  From what I can tell, there is therefore no good way to get a file out the other end with sane tcns (unless yours happen to be all digits).
>>
>
> I'm not at a computer where I can look now, and I'm not sure which
> branch you're looking at, but here's what is intended:
>
> idfield is, in fact, meant to specify the field (subfield a) from
> which to extract the database id.  More on that later.
>
> If there is no available tcn value (as defined in preprocess()) then
> the record id will be used.  There could very well have been
> short-circut logic introduced into the trunk of svn that causes the
> idfield value to be used, but that is not the intention.  I'll look at
> it when I get back to my computer.
>
> The purpose of the dontuse parameter (which could certainly use a
> better name) is to inform marc2bre of existing TCN values already in
> use in the database, for instance when you are loading new records
> into an existing implementation.  That lets it look for alternate TCNs
> when there is a collision.
>
>> I have created a new version which hopefully untangles most of this.  I left in the idfield setting for setting the record database id (though I am not sure how useful this actually is) and added tcnfield and tcnsubfield settings which honor common tcn formats and use the preprocess code properly in case of duplicates.  It is currently being tested, but before I post any version of it.  I am wondering if am completely nuts about all of this.
>>
>
> You're not nuts, and being able to specify a tcn field (and subfield)
> is a great addition!
>
> As for the usefulness of idfield, the point there is to maintain (with
> a potenial offset supplied by the adjustid parameter) the identifier
> that a legacy system uses to address the record, where applicable, at
> migration time.  Most ILSs (Evergreen included) use an internal
> identifier, because TCN is too human-supplied to be trustworthy as a
> unique identifier (and forcing a user to change the TCN to make it
> unique seems ... bad).  Item, hold and other records will usually use
> this internal identifier to point at a bib record, and moving the old
> id (space-shifted by adjustid) is much easier than trying to stich
> things back together using some other means ... impossible in some
> cases, in fact.
>
> In any case, please do post your new version (or you can send it to me
> directly if you'd prefer) and I'll go over it as soon as I can.  Any
> improvment and cleanup of marc2bre is a good thing, as it's a critical
> component.
>
> Thanks Dan!
>
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts
>  | phone: 1-877-OPEN-ILS (673-6457)
>  | email: miker at esilibrary.com
>  | web: http://www.esilibrary.com
>



-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list