[OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic Records

Tue Apr 26 14:32:43 EDT 2016

For cataloging, ISBN is not a match point. For data cleanup and migration,
it is, at the very least, a bad match point . There are too many potential
errors with 020s to use it as a main match point. We still have mismatches
in our catalog where a vendor used it as the main match point -- as a
result, we have print items on audio records, audio on print, large print
on regular print, etc. the records will have a mix of audio and print, etc
since some items were on the correct record when the merge occurred.

Using ISBN as your primary match point will cause an unacceptable number of
false matches, as Blake points out as well.

J. Elaine Hardy
PINES & Collaborative Projects Manager
Georgia Public Library Service/PINES
1800 Century Place, Ste. 150
Atlanta, GA 30045

404.235.7128 Office
404.548.4241 Cell
404.235.7201 FAX

On Tue, Apr 26, 2016 at 11:17 AM, Rogan Hamby <rhamby at esilibrary.com> wrote:

> I disagree that the 020 can't be used as a match point.  I don't think it
> should be used as the only match point.  It is possible to generate errors
> with the method described in that code.  In my experience the benefits of
> the high number of accurate matches outweighed the bad matches.
>
> CiL published an article about it with numbers of the results if anyone is
> interested and put the full text online:
>
>
> http://www.infotoday.com/cilmag/may12/Hamby-A-Practical-Approach-to-Collection-Deduping.shtml
>
> This is a very different approach from the default method that was
> historically used in Evergreen and far less conservative.  Any method that
> people use they should be aware of the pros and cons for.
>
>
> On Tue, Apr 26, 2016 at 10:47 AM, Elaine Hardy <
> ehardy at georgialibraries.org> wrote:
>
>> Keep in mind that an ISBN (MARC field 020) is not a match point. It is a
>> finding aid. Publishers do reuse ISBNs or use a different ISBN for what is
>> a new printing rather than a new publication (meaning no change in
>> information). Not only can ISBNs for all formats of a title be present on a
>> bib record, an incorrect ISBN can be associated with a record, particularly
>> in a local catalog. Having the same ISBN does not mean that records are
>> matches and should be merged. Having the same ISBN and title also does not
>> mean that records are matches and should be merged.
>>
>> A matching algorithm should consider matching fields such as Form, Type,
>> Lang, main title (245|a), publisher (26x|b), date of publication (26x|c),
>> physical description (300|a), etc. after potential duplicates are
>> identified with standard numbers such as ISBN.
>>
>> Prior to merging the records even in a test environment, if you could
>> provide your catalogers with either a file containing a sample of the
>> matched records  or a list of the TCNs or record IDs of the matched
>> records, they can help you refine your matching algorithm to maximize
>> correct matches and minimize incorrect matches.
>>
>>
>>
>>
>>
>> J. Elaine Hardy
>> PINES & Collaborative Projects Manager
>> Georgia Public Library Service/PINES
>> 1800 Century Place, Ste. 150
>> Atlanta, GA 30045
>>
>> 404.235.7128 Office
>> 404.548.4241 Cell
>> 404.235.7201 FAX
>>
>> On Mon, Apr 25, 2016 at 3:55 PM, Rogan Hamby <rhamby at esilibrary.com>
>> wrote:
>>
>>> That is one thing to point out, when it was written originally
>>> electronic records were still fairly rare.  The consortium it was written
>>> for still only uses them in very small numbers and I setup those as
>>> distinct bib sources that I modified the bib selection code to exclude.
>>> Those are things to look at.
>>>
>>> On Mon, Apr 25, 2016 at 3:46 PM, Blake Henderson <
>>> blake at mobiusconsortium.org> wrote:
>>>
>>>> Whatever method you use I heartily recommend doing so on a testing
>>>> system and having catalogers look over the results first.
>>>> You may have already done all the due diligence but I say it for anyone
>>>> reading along as well.  I've never had problems with
>>>> this method and heard back from others with positive success with it as
>>>> well but I also heard from at least one whose data
>>>> was apparently different enough that it was not a clean merge.  Caveat
>>>> usor, let the user beware.
>>>>
>>>>
>>>> We used this method for identifying the duplicate records. We found
>>>> that it merged electronic resources with books. It connected other formats
>>>> as well. We learned the hard way that we need to have better MARC records
>>>> before we run such a tool. Tons of MARC from LOC, includes all of the
>>>> ISBN's of the related formats for example. We subsequently wrote an
>>>> enormous amount of code to "guess" the correct format for all of our bibs
>>>> before deduping them. It uses phrase matching in the MARC. We presented
>>>> this at the Evergreen conference 2015. Slides here:
>>>> <http://slides.mobiusconsortium.org/blake/evergreencatclean/#/>
>>>> http://slides.mobiusconsortium.org/blake/evergreencatclean/#/
>>>> If you are curious, ping me.
>>>>
>>>> -Blake-
>>>> Conducting Magic
>>>> MOBIUS
>>>>
>>>> On 4/25/2016 2:04 PM, Rogan Hamby wrote:
>>>>
>>>> Hi Jim,
>>>>
>>>> It is available.  To be clear I helped create the de-duplication
>>>> algorithm but the actual coding was done by Galen Charlton of  Equinox.
>>>> You can find it here:
>>>>
>>>>
>>>> <http://git.esilibrary.com/?p=migration-tools.git;h=300a04108fc6a3d14424c6d365329be334114f7d>
>>>> http://git.esilibrary.com/?p=migration-tools.git;h=300a04108fc6a3d14424c6d365329be334114f7d
>>>>
>>>> The full scope of the script goes a bit beyond the original question as
>>>> it also does de-duplication before the merging.  The merging work is done
>>>> by the merge_record_assets function that Jason referenced.
>>>>
>>>>
>>>> On Mon, Apr 25, 2016 at 2:36 PM, swills beyond-print.com <
>>>> swills at beyond-print.com> wrote:
>>>>
>>>>> Rogan Hamby shared his work with me.  It's a set of SQL procedures
>>>>> that product a 'best bib' and then identifies the less interesting
>>>>> duplicate and it seems to work well.  I modified it so that it produces the
>>>>> candidates but doesn't actually do the merge since we like to have that
>>>>> personal touch up in Maine.  I'm not sure if it is in Evergreen Repos or
>>>>> not?
>>>>>
>>>>> Rogan, can you help and thanks again.
>>>>>
>>>>> Steve Wills
>>>>>
>>>>> On April 25, 2016 at 2:24 PM Jim Taylor <jtaylor at jtdata.com> wrote:
>>>>>
>>>>> I raised the question at the conference regarding the ability to merge
>>>>> records outside the program interface and was told there was a
>>>>> procedure/function that would allow this to be done.  Does anyone know
>>>>> where I can find this function?   My searching has availed me naught.  I
>>>>> found something under the Vandelay tables but not sure it is what I am
>>>>> needing as the above mentioned function is supposed to take two tcn numbers.
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --------------------------------------------------------------
>>>> Rogan R. Hamby, Data and Project Analyst
>>>> Equinox - Open Your Library
>>>> rogan at esilibrary.com
>>>> 1-877-OPEN-ILS | www.esilibrary.com
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> --------------------------------------------------------------
>>> Rogan R. Hamby, Data and Project Analyst
>>> Equinox - Open Your Library
>>> rogan at esilibrary.com
>>> 1-877-OPEN-ILS | www.esilibrary.com
>>>
>>>
>>
>
>
> --
> --------------------------------------------------------------
> Rogan R. Hamby, Data and Project Analyst
> Equinox - Open Your Library
> rogan at esilibrary.com
> 1-877-OPEN-ILS | www.esilibrary.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20160426/313a26df/attachment-0001.html>