[OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic Records

Rogan Hamby rhamby at esilibrary.com
Tue Apr 26 14:37:20 EDT 2016


We will have to agree to disagree.

On Tue, Apr 26, 2016 at 2:32 PM, Elaine Hardy <ehardy at georgialibraries.org>
wrote:

> For cataloging, ISBN is not a match point. For data cleanup and migration,
> it is, at the very least, a bad match point . There are too many potential
> errors with 020s to use it as a main match point. We still have mismatches
> in our catalog where a vendor used it as the main match point -- as a
> result, we have print items on audio records, audio on print, large print
> on regular print, etc. the records will have a mix of audio and print, etc
> since some items were on the correct record when the merge occurred.
>
> Using ISBN as your primary match point will cause an unacceptable number
> of false matches, as Blake points out as well.
>
>
>
> J. Elaine Hardy
> PINES & Collaborative Projects Manager
> Georgia Public Library Service/PINES
> 1800 Century Place, Ste. 150
> Atlanta, GA 30045
>
> 404.235.7128 Office
> 404.548.4241 Cell
> 404.235.7201 FAX
>
> On Tue, Apr 26, 2016 at 11:17 AM, Rogan Hamby <rhamby at esilibrary.com>
> wrote:
>
>> I disagree that the 020 can't be used as a match point.  I don't think it
>> should be used as the only match point.  It is possible to generate errors
>> with the method described in that code.  In my experience the benefits of
>> the high number of accurate matches outweighed the bad matches.
>>
>> CiL published an article about it with numbers of the results if anyone
>> is interested and put the full text online:
>>
>>
>> http://www.infotoday.com/cilmag/may12/Hamby-A-Practical-Approach-to-Collection-Deduping.shtml
>>
>> This is a very different approach from the default method that was
>> historically used in Evergreen and far less conservative.  Any method that
>> people use they should be aware of the pros and cons for.
>>
>>
>> On Tue, Apr 26, 2016 at 10:47 AM, Elaine Hardy <
>> ehardy at georgialibraries.org> wrote:
>>
>>> Keep in mind that an ISBN (MARC field 020) is not a match point. It is a
>>> finding aid. Publishers do reuse ISBNs or use a different ISBN for what is
>>> a new printing rather than a new publication (meaning no change in
>>> information). Not only can ISBNs for all formats of a title be present on a
>>> bib record, an incorrect ISBN can be associated with a record, particularly
>>> in a local catalog. Having the same ISBN does not mean that records are
>>> matches and should be merged. Having the same ISBN and title also does not
>>> mean that records are matches and should be merged.
>>>
>>> A matching algorithm should consider matching fields such as Form, Type,
>>> Lang, main title (245|a), publisher (26x|b), date of publication (26x|c),
>>> physical description (300|a), etc. after potential duplicates are
>>> identified with standard numbers such as ISBN.
>>>
>>> Prior to merging the records even in a test environment, if you could
>>> provide your catalogers with either a file containing a sample of the
>>> matched records  or a list of the TCNs or record IDs of the matched
>>> records, they can help you refine your matching algorithm to maximize
>>> correct matches and minimize incorrect matches.
>>>
>>>
>>>
>>>
>>>
>>> J. Elaine Hardy
>>> PINES & Collaborative Projects Manager
>>> Georgia Public Library Service/PINES
>>> 1800 Century Place, Ste. 150
>>> Atlanta, GA 30045
>>>
>>> 404.235.7128 Office
>>> 404.548.4241 Cell
>>> 404.235.7201 FAX
>>>
>>> On Mon, Apr 25, 2016 at 3:55 PM, Rogan Hamby <rhamby at esilibrary.com>
>>> wrote:
>>>
>>>> That is one thing to point out, when it was written originally
>>>> electronic records were still fairly rare.  The consortium it was written
>>>> for still only uses them in very small numbers and I setup those as
>>>> distinct bib sources that I modified the bib selection code to exclude.
>>>> Those are things to look at.
>>>>
>>>> On Mon, Apr 25, 2016 at 3:46 PM, Blake Henderson <
>>>> blake at mobiusconsortium.org> wrote:
>>>>
>>>>> Whatever method you use I heartily recommend doing so on a testing
>>>>> system and having catalogers look over the results first.
>>>>> You may have already done all the due diligence but I say it for
>>>>> anyone reading along as well.  I've never had problems with
>>>>> this method and heard back from others with positive success with it
>>>>> as well but I also heard from at least one whose data
>>>>> was apparently different enough that it was not a clean merge.  Caveat
>>>>> usor, let the user beware.
>>>>>
>>>>>
>>>>> We used this method for identifying the duplicate records. We found
>>>>> that it merged electronic resources with books. It connected other formats
>>>>> as well. We learned the hard way that we need to have better MARC records
>>>>> before we run such a tool. Tons of MARC from LOC, includes all of the
>>>>> ISBN's of the related formats for example. We subsequently wrote an
>>>>> enormous amount of code to "guess" the correct format for all of our bibs
>>>>> before deduping them. It uses phrase matching in the MARC. We presented
>>>>> this at the Evergreen conference 2015. Slides here:
>>>>> <http://slides.mobiusconsortium.org/blake/evergreencatclean/#/>
>>>>> http://slides.mobiusconsortium.org/blake/evergreencatclean/#/
>>>>> If you are curious, ping me.
>>>>>
>>>>> -Blake-
>>>>> Conducting Magic
>>>>> MOBIUS
>>>>>
>>>>> On 4/25/2016 2:04 PM, Rogan Hamby wrote:
>>>>>
>>>>> Hi Jim,
>>>>>
>>>>> It is available.  To be clear I helped create the de-duplication
>>>>> algorithm but the actual coding was done by Galen Charlton of  Equinox.
>>>>> You can find it here:
>>>>>
>>>>>
>>>>> <http://git.esilibrary.com/?p=migration-tools.git;h=300a04108fc6a3d14424c6d365329be334114f7d>
>>>>> http://git.esilibrary.com/?p=migration-tools.git;h=300a04108fc6a3d14424c6d365329be334114f7d
>>>>>
>>>>> The full scope of the script goes a bit beyond the original question
>>>>> as it also does de-duplication before the merging.  The merging work is
>>>>> done by the merge_record_assets function that Jason referenced.
>>>>>
>>>>>
>>>>> On Mon, Apr 25, 2016 at 2:36 PM, swills beyond-print.com <
>>>>> swills at beyond-print.com> wrote:
>>>>>
>>>>>> Rogan Hamby shared his work with me.  It's a set of SQL procedures
>>>>>> that product a 'best bib' and then identifies the less interesting
>>>>>> duplicate and it seems to work well.  I modified it so that it produces the
>>>>>> candidates but doesn't actually do the merge since we like to have that
>>>>>> personal touch up in Maine.  I'm not sure if it is in Evergreen Repos or
>>>>>> not?
>>>>>>
>>>>>> Rogan, can you help and thanks again.
>>>>>>
>>>>>> Steve Wills
>>>>>>
>>>>>> On April 25, 2016 at 2:24 PM Jim Taylor <jtaylor at jtdata.com> wrote:
>>>>>>
>>>>>> I raised the question at the conference regarding the ability to
>>>>>> merge records outside the program interface and was told there was a
>>>>>> procedure/function that would allow this to be done.  Does anyone know
>>>>>> where I can find this function?   My searching has availed me naught.  I
>>>>>> found something under the Vandelay tables but not sure it is what I am
>>>>>> needing as the above mentioned function is supposed to take two tcn numbers.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --------------------------------------------------------------
>>>>> Rogan R. Hamby, Data and Project Analyst
>>>>> Equinox - Open Your Library
>>>>> rogan at esilibrary.com
>>>>> 1-877-OPEN-ILS | www.esilibrary.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --------------------------------------------------------------
>>>> Rogan R. Hamby, Data and Project Analyst
>>>> Equinox - Open Your Library
>>>> rogan at esilibrary.com
>>>> 1-877-OPEN-ILS | www.esilibrary.com
>>>>
>>>>
>>>
>>
>>
>> --
>> --------------------------------------------------------------
>> Rogan R. Hamby, Data and Project Analyst
>> Equinox - Open Your Library
>> rogan at esilibrary.com
>> 1-877-OPEN-ILS | www.esilibrary.com
>>
>>
>


-- 
--------------------------------------------------------------
Rogan R. Hamby, Data and Project Analyst
Equinox - Open Your Library
rogan at esilibrary.com
1-877-OPEN-ILS | www.esilibrary.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20160426/72f22afd/attachment-0001.html>


More information about the Open-ils-general mailing list