[OPEN-ILS-GENERAL] dewey call number normalization
Thomas Berezansky
tsbere at mvlc.org
Fri Apr 25 13:51:52 EDT 2014
I haven't pushed the call numbers in this thread through them, but I
have written a different version of normalizers already. They, with
examples of output and sorting, can be found in this paste:
http://pastebin.com/94U8XBzE
Wasn't sure if a "make them the normal Evergreen normalizers" branch
was a good idea or not, though.
Thomas Berezansky
Merrimack Valley Library Consortium
Quoting Dan Wells <dbw2 at calvin.edu>:
> Okay, looking more carefully at the sortkey Ben pointed out, you
> really do have two different problems affecting your sort. Sorry
> for focusing on the smaller one!
>
> Everything I advocated earlier still stands, but in the meantime, we
> do need to fix the misplaced padding in the 'no decimal but we have
> a year' case.
>
> Dan
>
>
> Daniel Wells
> Library Programmer/Analyst
> Hekman Library, Calvin College
> 616.526.7133
>
> -----Original Message-----
> From: open-ils-general-bounces at list.georgialibraries.org
> [mailto:open-ils-general-bounces at list.georgialibraries.org] On
> Behalf Of Dan Wells
> Sent: Friday, April 25, 2014 10:15 AM
> To: Evergreen Discussion Group
> Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization
>
> Hello Paul,
>
> You've pretty much nailed your problem as being the extra decimals
> before your Cutters. While that's normal for an LC call number,
> I've looked around and found nothing to make me believe that's
> common or accepted practice for DDC.
>
> That said, as far as I can tell, the "standard" format for Dewey is:
>
> [Dewey Decimal Number] [Whatever else you want to make it unique]
>
> In my experience, the second part is *usually* the first few letters
> of the author's last name, or a cutter-ized version of the same.
> Can anyone point to an authoritative source on how to build the
> non-DDC part of the call number? It would be a great help if we
> could at least reference something and say "this is what our
> normalizer supports."
>
> Naturally, if we can cook up a normalizer that works 100% with our
> agreed upon form (whatever that might be), yet also make it flexible
> enough to accommodate variances, we absolutely should do that. I
> also think the code you included here is on the right track for
> being more flexible. Still, our first step must be to establish a
> canonical support format before we consider any code to handle
> exceptions.
>
> Thanks,
> Dan
>
>
> Daniel Wells
> Library Programmer/Analyst
> Hekman Library, Calvin College
> 616.526.7133
>
> -----Original Message-----
> From: open-ils-general-bounces at list.georgialibraries.org
> [mailto:open-ils-general-bounces at list.georgialibraries.org] On
> Behalf Of Paul Hoffman
> Sent: Friday, April 25, 2014 4:45 AM
> To: Evergreen Discussion Group
> Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization
>
> On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote:
>> This will be a slightly more technical answer that may require some
>> direct database access to ascertain more details.
>>
>> You mention that the call numbers are identified as DDC. So that's
>> label_class of 2, I believe. We are using Dewey (DDC) for all of our
>> materials by default as well in our consortium.
>>
>> I'd be curious to know what the label_sortkey values were for those
>> call numbers you mention. That field is what actually drives the
>> sorting values for a given set.
>
> Here's what our DB shows (Adam and I work together):
>
> SELECT label_class, label, label_sortkey
> FROM asset.call_number
> WHERE label_sortkey like '720%'
> ORDER BY label_sortkey;
>
> label_class | label | label_sortkey
> -------------+-----------------+-------------------------------
> 1 | 720 H47 1979 | 720 H47 1979
> 2 | 720 a | 720_000000000000000_A
> 2 | 720 .H47 | 720_000000000000000__H47
> 2 | 720.1 H74 1979 | 720_100000000000000_H74_1979
> 2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980
> 2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979
> 2 | 720 H74 1979 | 720_H74_197900000000000
> 2 | 720 .H47 1980 | 720__H47_198000000000000
> 2 | 720 .H47 1980 | 720__H47_198000000000000
> 2 | 720 .H74 1979 | 720__H74_197900000000000
> (10 rows)
>
> So the problem appears to be caused by the periods that sometimes
> occur before the Cutter number. I don't know if that's kosher or
> not, but I can see that it occurs plenty in our (Voyager) catalog.
>
> Looking at the function asset.label_normalizer_dewey it seems to me
> that it can be done much more simply and efficiently if you leverage
> the fact that space (ASCII 32) and tilde (ASCII 126) come before and
> after (respectively) anything else meaningful that might be found in
> a call number. Except periods, which complicate things. Anyhow,
> here's a first stab at it:
>
> use strict;
> use warnings;
> sub ddcnorm {
> local $_ = uc shift;
> # Strip leading or trailing space and any slashes or apostrophes
> s/^\s+|\s+$|[\/']//g;
> # Insert a space at digit/non-digit boundaries
> s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g;
> # Replace some punctuation with a space
> tr/-/ /; # XXX What else?
> # Strip extra junk -- XXX make this work on non-ASCII call numbers
> tr/A-Za-z0-9. //cd;
> s/ \. /~/g;
> s/ \.|\. / /g;
> tr/ //s;
> return $_;
> }
>
> When I run our Deweys in the 720s through this, I get what seems to
> be the right order:
>
> 2 | 720 a | 720 A
> 2 | 720 .H47 | 720 H 47
> 2 | 720 .H47 1980 | 720 H 47 1980
> 2 | 720 .H47 1980 | 720 H 47 1980
> 2 | 720 .H74 1979 | 720 H 74 1979
> 2 | 720 H74 1979 | 720 H 74 1979
> 2 | 720.1 .H47 1980 | 720~1 H 47 1980
> 2 | 720.1 .H74 1979 | 720~1 H 74 1979
> 2 | 720.1 H74 1979 | 720~1 H 74 1979
>
> If there's any interest, I'll run our entire set of Deweys through
> it and see if I can make sense of the results. Hmm... should
> prefixes like "j" or "C"
> (juvenile or Canadian) be ignored?
>
> Paul.
>
>> On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <adam at flo.org> wrote:
>>
>> > Hi Everyone,
>> >
>> > We are testing in evergreen 2.5.2
>> >
>> > I'm noticing what I think looks like incorrect behavior when using
>> > the call number browse feature.
>> >
>> > Doing a call number browse search for 720 results in the following
>> > call number sort order:
>> >
>> > 720 H47 1979
>> > 720 .H47
>> > 720.1 H74 1979
>> > 720.1 .H47 1980
>> > 720.1 .H74 1979
>> > 720 H74 1979
>> > 720 .H74 1979
>> >
>> >
>> > It looks like the decimal point might be throwing things off. I
>> > think that should be taken care of in a normalizer, but maybe there
>> > is a reason not to. I think the 720.1's should come at the end of
>> > this list, regardless of the decimal point before the cutter.
>> >
>> > All of the call numbers are identified as DDC.
>> >
>> > you can probably replicate this here
>> > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2
>> >
>> >
>> > I didn't see any bug reports that seemed to address this specific
>> > issue, so I'm wondering if there could be something else causing
>> this behavior.
>> >
>> > thanks,
>> > Adam
>> >
>> > --
>> >
>> > Adam Shire
>> > Member Services Librarian
>> > Fenway Libraries Online <http://flo.org>
>> > 617-442-2384
>> >
>>
>>
>>
>> --
>> Benjamin Shum
>> Evergreen Systems Manager
>> Bibliomation, Inc.
>> 24 Wooster Ave.
>> Waterbury, CT 06708
>> 203-577-4070, ext. 113
>
> --
> Paul Hoffman <paul at flo.org>
> Systems Librarian
> Fenway Libraries Online
> c/o Wentworth Institute of Technology
> 550 Huntington Ave.
> Boston, MA 02115
> (617) 442-2384 (FLO main number)
>
More information about the Open-ils-general
mailing list