[OPEN-ILS-GENERAL] dewey call number normalization

Thomas Berezansky tsbere at mvlc.org
Fri Apr 25 13:51:52 EDT 2014


I haven't pushed the call numbers in this thread through them, but I  
have written a different version of normalizers already. They, with  
examples of output and sorting, can be found in this paste:

http://pastebin.com/94U8XBzE

Wasn't sure if a "make them the normal Evergreen normalizers" branch  
was a good idea or not, though.

Thomas Berezansky
Merrimack Valley Library Consortium


Quoting Dan Wells <dbw2 at calvin.edu>:

> Okay, looking more carefully at the sortkey Ben pointed out, you  
> really do have two different problems affecting your sort.  Sorry  
> for focusing on the smaller one!
>
> Everything I advocated earlier still stands, but in the meantime, we  
> do need to fix the misplaced padding in the 'no decimal but we have  
> a year' case.
>
> Dan
>
>
> Daniel Wells
> Library Programmer/Analyst
> Hekman Library, Calvin College
> 616.526.7133
>
> -----Original Message-----
> From: open-ils-general-bounces at list.georgialibraries.org  
> [mailto:open-ils-general-bounces at list.georgialibraries.org] On  
> Behalf Of Dan Wells
> Sent: Friday, April 25, 2014 10:15 AM
> To: Evergreen Discussion Group
> Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization
>
> Hello Paul,
>
> You've pretty much nailed your problem as being the extra decimals  
> before your Cutters.  While that's normal for an LC call number,  
> I've looked around and found nothing to make me believe that's  
> common or accepted practice for DDC.
>
> That said, as far as I can tell, the "standard" format for Dewey is:
>
> [Dewey Decimal Number] [Whatever else you want to make it unique]
>
> In my experience, the second part is *usually* the first few letters  
> of the author's last name, or a cutter-ized version of the same.   
> Can anyone point to an authoritative source on how to build the  
> non-DDC part of the call number?  It would be a great help if we  
> could at least reference something and say "this is what our  
> normalizer supports."
>
> Naturally, if we can cook up a normalizer that works 100% with our  
> agreed upon form (whatever that might be), yet also make it flexible  
> enough to accommodate variances, we absolutely should do that.  I  
> also think the code you included here is on the right track for  
> being more flexible.  Still, our first step must be to establish a  
> canonical support format before we consider any code to handle  
> exceptions.
>
> Thanks,
> Dan
>
>
> Daniel Wells
> Library Programmer/Analyst
> Hekman Library, Calvin College
> 616.526.7133
>
> -----Original Message-----
> From: open-ils-general-bounces at list.georgialibraries.org  
> [mailto:open-ils-general-bounces at list.georgialibraries.org] On  
> Behalf Of Paul Hoffman
> Sent: Friday, April 25, 2014 4:45 AM
> To: Evergreen Discussion Group
> Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization
>
> On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote:
>> This will be a slightly more technical answer that may require some
>> direct database access to ascertain more details.
>>
>> You mention that the call numbers are identified as DDC.  So that's
>> label_class of 2, I believe.  We are using Dewey (DDC) for all of our
>> materials by default as well in our consortium.
>>
>> I'd be curious to know what the label_sortkey values were for those
>> call numbers you mention.  That field is what actually drives the
>> sorting values for a given set.
>
> Here's what our DB shows (Adam and I work together):
>
> SELECT   label_class, label, label_sortkey
> FROM     asset.call_number
> WHERE    label_sortkey like '720%'
> ORDER BY label_sortkey;
>
>  label_class |      label      |         label_sortkey
> -------------+-----------------+-------------------------------
>            1 | 720 H47 1979    | 720 H47 1979
>            2 | 720 a           | 720_000000000000000_A
>            2 | 720 .H47        | 720_000000000000000__H47
>            2 | 720.1 H74 1979  | 720_100000000000000_H74_1979
>            2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980
>            2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979
>            2 | 720 H74 1979    | 720_H74_197900000000000
>            2 | 720 .H47 1980   | 720__H47_198000000000000
>            2 | 720 .H47 1980   | 720__H47_198000000000000
>            2 | 720 .H74 1979   | 720__H74_197900000000000
> (10 rows)
>
> So the problem appears to be caused by the periods that sometimes  
> occur before the Cutter number.  I don't know if that's kosher or  
> not, but I can see that it occurs plenty in our (Voyager) catalog.
>
> Looking at the function asset.label_normalizer_dewey it seems to me  
> that it can be done much more simply and efficiently if you leverage  
> the fact that space (ASCII 32) and tilde (ASCII 126) come before and  
> after (respectively) anything else meaningful that might be found in  
> a call number.  Except periods, which complicate things.  Anyhow,  
> here's a first stab at it:
>
> use strict;
> use warnings;
> sub ddcnorm {
>     local $_ = uc shift;
>     # Strip leading or trailing space and any slashes or apostrophes
>     s/^\s+|\s+$|[\/']//g;
>     # Insert a space at digit/non-digit boundaries
>     s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g;
>     # Replace some punctuation with a space
>     tr/-/ /;  # XXX What else?
>     # Strip extra junk -- XXX make this work on non-ASCII call numbers
>     tr/A-Za-z0-9. //cd;
>     s/ \. /~/g;
>     s/ \.|\. / /g;
>     tr/ //s;
>     return $_;
> }
>
> When I run our Deweys in the 720s through this, I get what seems to  
> be the right order:
>
>            2 | 720 a           | 720 A
>            2 | 720 .H47        | 720 H 47
>            2 | 720 .H47 1980   | 720 H 47 1980
>            2 | 720 .H47 1980   | 720 H 47 1980
>            2 | 720 .H74 1979   | 720 H 74 1979
>            2 | 720 H74 1979    | 720 H 74 1979
>            2 | 720.1 .H47 1980 | 720~1 H 47 1980
>            2 | 720.1 .H74 1979 | 720~1 H 74 1979
>            2 | 720.1 H74 1979  | 720~1 H 74 1979
>
> If there's any interest, I'll run our entire set of Deweys through  
> it and see if I can make sense of the results.  Hmm... should  
> prefixes like "j" or "C"
> (juvenile or Canadian) be ignored?
>
> Paul.
>
>> On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <adam at flo.org> wrote:
>>
>> > Hi Everyone,
>> >
>> > We are testing in evergreen 2.5.2
>> >
>> > I'm noticing what I think looks like incorrect behavior when using
>> > the call number browse feature.
>> >
>> > Doing a call number browse search for 720 results in the following
>> > call number sort order:
>> >
>> > 720 H47 1979
>> > 720 .H47
>> > 720.1 H74 1979
>> > 720.1 .H47 1980
>> > 720.1 .H74 1979
>> > 720 H74 1979
>> > 720 .H74 1979
>> >
>> >
>> > It looks like the decimal point might be throwing things off. I
>> > think that should be taken care of in a normalizer, but maybe there
>> > is a reason not to. I think the 720.1's should come at the end of
>> > this list, regardless of the decimal point before the cutter.
>> >
>> > All of the call numbers are identified as DDC.
>> >
>> > you can probably replicate this here
>> > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2
>> >
>> >
>> > I didn't see any bug reports that seemed to address this specific
>> > issue, so I'm wondering if there could be something else causing  
>> this behavior.
>> >
>> > thanks,
>> > Adam
>> >
>> > --
>> >
>> > Adam Shire
>> > Member Services Librarian
>> > Fenway Libraries Online <http://flo.org>
>> > 617-442-2384
>> >
>>
>>
>>
>> --
>> Benjamin Shum
>> Evergreen Systems Manager
>> Bibliomation, Inc.
>> 24 Wooster Ave.
>> Waterbury, CT 06708
>> 203-577-4070, ext. 113
>
> --
> Paul Hoffman <paul at flo.org>
> Systems Librarian
> Fenway Libraries Online
> c/o Wentworth Institute of Technology
> 550 Huntington Ave.
> Boston, MA 02115
> (617) 442-2384 (FLO main number)
>




More information about the Open-ils-general mailing list