[OPEN-ILS-GENERAL] dewey call number normalization

Dan Wells dbw2 at calvin.edu
Fri Apr 25 13:35:32 EDT 2014


Okay, looking more carefully at the sortkey Ben pointed out, you really do have two different problems affecting your sort.  Sorry for focusing on the smaller one!

Everything I advocated earlier still stands, but in the meantime, we do need to fix the misplaced padding in the 'no decimal but we have a year' case.

Dan


Daniel Wells
Library Programmer/Analyst
Hekman Library, Calvin College
616.526.7133

-----Original Message-----
From: open-ils-general-bounces at list.georgialibraries.org [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Dan Wells
Sent: Friday, April 25, 2014 10:15 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization

Hello Paul,

You've pretty much nailed your problem as being the extra decimals before your Cutters.  While that's normal for an LC call number, I've looked around and found nothing to make me believe that's common or accepted practice for DDC.

That said, as far as I can tell, the "standard" format for Dewey is:

[Dewey Decimal Number] [Whatever else you want to make it unique]

In my experience, the second part is *usually* the first few letters of the author's last name, or a cutter-ized version of the same.  Can anyone point to an authoritative source on how to build the non-DDC part of the call number?  It would be a great help if we could at least reference something and say "this is what our normalizer supports."

Naturally, if we can cook up a normalizer that works 100% with our agreed upon form (whatever that might be), yet also make it flexible enough to accommodate variances, we absolutely should do that.  I also think the code you included here is on the right track for being more flexible.  Still, our first step must be to establish a canonical support format before we consider any code to handle exceptions.

Thanks,
Dan


Daniel Wells
Library Programmer/Analyst
Hekman Library, Calvin College
616.526.7133

-----Original Message-----
From: open-ils-general-bounces at list.georgialibraries.org [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Paul Hoffman
Sent: Friday, April 25, 2014 4:45 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization

On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote:
> This will be a slightly more technical answer that may require some 
> direct database access to ascertain more details.
> 
> You mention that the call numbers are identified as DDC.  So that's 
> label_class of 2, I believe.  We are using Dewey (DDC) for all of our 
> materials by default as well in our consortium.
> 
> I'd be curious to know what the label_sortkey values were for those 
> call numbers you mention.  That field is what actually drives the 
> sorting values for a given set.

Here's what our DB shows (Adam and I work together):

SELECT   label_class, label, label_sortkey
FROM     asset.call_number
WHERE    label_sortkey like '720%'
ORDER BY label_sortkey;

 label_class |      label      |         label_sortkey         
-------------+-----------------+-------------------------------
           1 | 720 H47 1979    | 720 H47 1979
           2 | 720 a           | 720_000000000000000_A
           2 | 720 .H47        | 720_000000000000000__H47
           2 | 720.1 H74 1979  | 720_100000000000000_H74_1979
           2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980
           2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979
           2 | 720 H74 1979    | 720_H74_197900000000000
           2 | 720 .H47 1980   | 720__H47_198000000000000
           2 | 720 .H47 1980   | 720__H47_198000000000000
           2 | 720 .H74 1979   | 720__H74_197900000000000
(10 rows)

So the problem appears to be caused by the periods that sometimes occur before the Cutter number.  I don't know if that's kosher or not, but I can see that it occurs plenty in our (Voyager) catalog.

Looking at the function asset.label_normalizer_dewey it seems to me that it can be done much more simply and efficiently if you leverage the fact that space (ASCII 32) and tilde (ASCII 126) come before and after (respectively) anything else meaningful that might be found in a call number.  Except periods, which complicate things.  Anyhow, here's a first stab at it:

use strict;
use warnings;
sub ddcnorm {
    local $_ = uc shift;
    # Strip leading or trailing space and any slashes or apostrophes
    s/^\s+|\s+$|[\/']//g;
    # Insert a space at digit/non-digit boundaries
    s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g;
    # Replace some punctuation with a space
    tr/-/ /;  # XXX What else?
    # Strip extra junk -- XXX make this work on non-ASCII call numbers
    tr/A-Za-z0-9. //cd;
    s/ \. /~/g;
    s/ \.|\. / /g;
    tr/ //s;
    return $_;
}

When I run our Deweys in the 720s through this, I get what seems to be the right order:

           2 | 720 a           | 720 A
           2 | 720 .H47        | 720 H 47
           2 | 720 .H47 1980   | 720 H 47 1980
           2 | 720 .H47 1980   | 720 H 47 1980
           2 | 720 .H74 1979   | 720 H 74 1979
           2 | 720 H74 1979    | 720 H 74 1979
           2 | 720.1 .H47 1980 | 720~1 H 47 1980
           2 | 720.1 .H74 1979 | 720~1 H 74 1979
           2 | 720.1 H74 1979  | 720~1 H 74 1979

If there's any interest, I'll run our entire set of Deweys through it and see if I can make sense of the results.  Hmm... should prefixes like "j" or "C"
(juvenile or Canadian) be ignored?

Paul.

> On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <adam at flo.org> wrote:
> 
> > Hi Everyone,
> >
> > We are testing in evergreen 2.5.2
> >
> > I'm noticing what I think looks like incorrect behavior when using 
> > the call number browse feature.
> >
> > Doing a call number browse search for 720 results in the following 
> > call number sort order:
> >
> > 720 H47 1979
> > 720 .H47
> > 720.1 H74 1979
> > 720.1 .H47 1980
> > 720.1 .H74 1979
> > 720 H74 1979
> > 720 .H74 1979
> >
> >
> > It looks like the decimal point might be throwing things off. I 
> > think that should be taken care of in a normalizer, but maybe there 
> > is a reason not to. I think the 720.1's should come at the end of 
> > this list, regardless of the decimal point before the cutter.
> >
> > All of the call numbers are identified as DDC.
> >
> > you can probably replicate this here
> > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2
> >
> >
> > I didn't see any bug reports that seemed to address this specific 
> > issue, so I'm wondering if there could be something else causing this behavior.
> >
> > thanks,
> > Adam
> >
> > --
> >
> > Adam Shire
> > Member Services Librarian
> > Fenway Libraries Online <http://flo.org>
> > 617-442-2384
> >
> 
> 
> 
> --
> Benjamin Shum
> Evergreen Systems Manager
> Bibliomation, Inc.
> 24 Wooster Ave.
> Waterbury, CT 06708
> 203-577-4070, ext. 113

--
Paul Hoffman <paul at flo.org>
Systems Librarian
Fenway Libraries Online
c/o Wentworth Institute of Technology
550 Huntington Ave.
Boston, MA 02115
(617) 442-2384 (FLO main number)


More information about the Open-ils-general mailing list