[OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic Records

Blake Henderson blake at mobiusconsortium.org
Tue Apr 26 14:57:20 EDT 2016


Jason,

We liked your fingerprinting idea. We expanded it a bit:

  $fingerprints{alternate} = join("\t",
       $marc{item_form}, $marc{date1}, $marc{record_type},
       $marc{bib_lvl}, $marc{title}, $marc{subtitle}.$marc{subtitlep}, 
$marc{author} ? $marc{author} : '',
       $marc{audioformat}, $marc{videoformat}, $marc{pubyear}, 
$marc{normalizedisbns}
       );

each of these have been "normalized"
$marc{title}, $marc{subtitle}.$marc{subtitlep}, $marc{author}

The ISBN's are heavily normalized. 13 digit ISBN's are stripped of the 
first three characters (978), and the last character. 10 digit ISBN's 
are stripped of the last character. Then the whole lot is deduped and 
sorted.

On top of the fingerprinting, we changed the way the quality scoring works.
We ended up coming up with this scoring algorithm

1. Count the number of subfields in the 245. Give 100 points each for a 
maximum of 400 points
2. Count the number of characters in the 100. Assign 1 point for each 
character for a maximum of 150 points
3. Count the number of characters in the 110. Assign 1 point for each 
character for a maximum of 150 points
4. Count the number of 6XX fields. Assign 50 points to each one for a 
maximum of 200 points
5. Count the number of 02X fields. Assign 50 points to each one for a 
maximum of 100 points
6. Count the number of 246 fields. Assign 100 points to each one for a 
maximum of 200 points
7. Count the number of 130 fields. Assign 100 points to each one for a 
maximum of 100 points
8. Count the number of 010 fields. Assign 100 points to each one for a 
maximum of 100 points
9. Count the number of 490 fields. Assign 100 points to each one for a 
maximum of 200 points
10. Count the number of 830 fields. Assign 10 points to each one for a 
maximum of 50 points
11. Count the number of characters in the 300. Assign .5 points for each 
character for a maximum of 50 points
12. Count the number of 7XX fields. Assign 1 points to each one for a 
maximum of 100 points
13. Count the number of subfields in the 50X. Give 2 points each for a 
maximum of 100 points
14. Count the number of subfields in the 52X. Give 2 points each for a 
maximum of 100 points
15. Count the number of subfields in the 51X,53X,54X,55X,56X,57X,58X. 
Give .5 points each for a maximum of 500 points

Add the score together and we have the "quality" of the MARC. The higher 
quality wins.

This approach allowed us to dedupe almost 18% of our bibs in the catalog!


-Blake-
Conducting Magic
MOBIUS

On 4/26/2016 1:40 PM, Jason Etheridge wrote:
> For what it's worth, this is the fairly conservative algorithm used by
> the default fingerprinter in the migration-tools repository:
>
> https://docs.google.com/document/d/1tvuA0Os3W0B2Fl_GvO_Z6ZG6ZHecg8JtTRMz3QUktK8/edit?usp=sharing
>
> Comments welcome.
>



More information about the Open-ils-general mailing list