[OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic Records
Blake Henderson
blake at mobiusconsortium.org
Tue Apr 26 14:57:20 EDT 2016
Jason,
We liked your fingerprinting idea. We expanded it a bit:
$fingerprints{alternate} = join("\t",
$marc{item_form}, $marc{date1}, $marc{record_type},
$marc{bib_lvl}, $marc{title}, $marc{subtitle}.$marc{subtitlep},
$marc{author} ? $marc{author} : '',
$marc{audioformat}, $marc{videoformat}, $marc{pubyear},
$marc{normalizedisbns}
);
each of these have been "normalized"
$marc{title}, $marc{subtitle}.$marc{subtitlep}, $marc{author}
The ISBN's are heavily normalized. 13 digit ISBN's are stripped of the
first three characters (978), and the last character. 10 digit ISBN's
are stripped of the last character. Then the whole lot is deduped and
sorted.
On top of the fingerprinting, we changed the way the quality scoring works.
We ended up coming up with this scoring algorithm
1. Count the number of subfields in the 245. Give 100 points each for a
maximum of 400 points
2. Count the number of characters in the 100. Assign 1 point for each
character for a maximum of 150 points
3. Count the number of characters in the 110. Assign 1 point for each
character for a maximum of 150 points
4. Count the number of 6XX fields. Assign 50 points to each one for a
maximum of 200 points
5. Count the number of 02X fields. Assign 50 points to each one for a
maximum of 100 points
6. Count the number of 246 fields. Assign 100 points to each one for a
maximum of 200 points
7. Count the number of 130 fields. Assign 100 points to each one for a
maximum of 100 points
8. Count the number of 010 fields. Assign 100 points to each one for a
maximum of 100 points
9. Count the number of 490 fields. Assign 100 points to each one for a
maximum of 200 points
10. Count the number of 830 fields. Assign 10 points to each one for a
maximum of 50 points
11. Count the number of characters in the 300. Assign .5 points for each
character for a maximum of 50 points
12. Count the number of 7XX fields. Assign 1 points to each one for a
maximum of 100 points
13. Count the number of subfields in the 50X. Give 2 points each for a
maximum of 100 points
14. Count the number of subfields in the 52X. Give 2 points each for a
maximum of 100 points
15. Count the number of subfields in the 51X,53X,54X,55X,56X,57X,58X.
Give .5 points each for a maximum of 500 points
Add the score together and we have the "quality" of the MARC. The higher
quality wins.
This approach allowed us to dedupe almost 18% of our bibs in the catalog!
-Blake-
Conducting Magic
MOBIUS
On 4/26/2016 1:40 PM, Jason Etheridge wrote:
> For what it's worth, this is the fairly conservative algorithm used by
> the default fingerprinter in the migration-tools repository:
>
> https://docs.google.com/document/d/1tvuA0Os3W0B2Fl_GvO_Z6ZG6ZHecg8JtTRMz3QUktK8/edit?usp=sharing
>
> Comments welcome.
>
More information about the Open-ils-general
mailing list