[OPEN-ILS-GENERAL] dewey call number normalization

Paul Hoffman paul at flo.org
Fri Apr 25 04:45:17 EDT 2014


On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote:
> This will be a slightly more technical answer that may require some direct
> database access to ascertain more details.
> 
> You mention that the call numbers are identified as DDC.  So that's
> label_class of 2, I believe.  We are using Dewey (DDC) for all of our
> materials by default as well in our consortium.
> 
> I'd be curious to know what the label_sortkey values were for those call
> numbers you mention.  That field is what actually drives the sorting values
> for a given set.

Here's what our DB shows (Adam and I work together):

SELECT   label_class, label, label_sortkey
FROM     asset.call_number
WHERE    label_sortkey like '720%'
ORDER BY label_sortkey;

 label_class |      label      |         label_sortkey         
-------------+-----------------+-------------------------------
           1 | 720 H47 1979    | 720 H47 1979
           2 | 720 a           | 720_000000000000000_A
           2 | 720 .H47        | 720_000000000000000__H47
           2 | 720.1 H74 1979  | 720_100000000000000_H74_1979
           2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980
           2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979
           2 | 720 H74 1979    | 720_H74_197900000000000
           2 | 720 .H47 1980   | 720__H47_198000000000000
           2 | 720 .H47 1980   | 720__H47_198000000000000
           2 | 720 .H74 1979   | 720__H74_197900000000000
(10 rows)

So the problem appears to be caused by the periods that sometimes occur before
the Cutter number.  I don't know if that's kosher or not, but I can see that it
occurs plenty in our (Voyager) catalog.

Looking at the function asset.label_normalizer_dewey it seems to me that it can
be done much more simply and efficiently if you leverage the fact that space
(ASCII 32) and tilde (ASCII 126) come before and after (respectively) anything
else meaningful that might be found in a call number.  Except periods, which
complicate things.  Anyhow, here's a first stab at it:

use strict;
use warnings;
sub ddcnorm {
    local $_ = uc shift;
    # Strip leading or trailing space and any slashes or apostrophes
    s/^\s+|\s+$|[\/']//g;
    # Insert a space at digit/non-digit boundaries
    s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g;
    # Replace some punctuation with a space
    tr/-/ /;  # XXX What else?
    # Strip extra junk -- XXX make this work on non-ASCII call numbers
    tr/A-Za-z0-9. //cd;
    s/ \. /~/g;
    s/ \.|\. / /g;
    tr/ //s;
    return $_;
}

When I run our Deweys in the 720s through this, I get what seems to be the
right order:

           2 | 720 a           | 720 A
           2 | 720 .H47        | 720 H 47
           2 | 720 .H47 1980   | 720 H 47 1980
           2 | 720 .H47 1980   | 720 H 47 1980
           2 | 720 .H74 1979   | 720 H 74 1979
           2 | 720 H74 1979    | 720 H 74 1979
           2 | 720.1 .H47 1980 | 720~1 H 47 1980
           2 | 720.1 .H74 1979 | 720~1 H 74 1979
           2 | 720.1 H74 1979  | 720~1 H 74 1979

If there's any interest, I'll run our entire set of Deweys through it and see
if I can make sense of the results.  Hmm... should prefixes like "j" or "C"
(juvenile or Canadian) be ignored?

Paul.

> On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <adam at flo.org> wrote:
> 
> > Hi Everyone,
> >
> > We are testing in evergreen 2.5.2
> >
> > I'm noticing what I think looks like incorrect behavior when using the
> > call number browse feature.
> >
> > Doing a call number browse search for 720 results in the following call
> > number sort order:
> >
> > 720 H47 1979
> > 720 .H47
> > 720.1 H74 1979
> > 720.1 .H47 1980
> > 720.1 .H74 1979
> > 720 H74 1979
> > 720 .H74 1979
> >
> >
> > It looks like the decimal point might be throwing things off. I think that
> > should be taken care of in a normalizer, but maybe there is a reason not
> > to. I think the 720.1's should come at the end of this list, regardless of
> > the decimal point before the cutter.
> >
> > All of the call numbers are identified as DDC.
> >
> > you can probably replicate this here
> > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2
> >
> >
> > I didn't see any bug reports that seemed to address this specific issue,
> > so I'm wondering if there could be something else causing this behavior.
> >
> > thanks,
> > Adam
> >
> > --
> >
> > Adam Shire
> > Member Services Librarian
> > Fenway Libraries Online <http://flo.org>
> > 617-442-2384
> >
> 
> 
> 
> -- 
> Benjamin Shum
> Evergreen Systems Manager
> Bibliomation, Inc.
> 24 Wooster Ave.
> Waterbury, CT 06708
> 203-577-4070, ext. 113

-- 
Paul Hoffman <paul at flo.org>
Systems Librarian
Fenway Libraries Online
c/o Wentworth Institute of Technology
550 Huntington Ave.
Boston, MA 02115
(617) 442-2384 (FLO main number)


More information about the Open-ils-general mailing list