[OPEN-ILS-GENERAL] dewey call number normalization

Ben Shum bshum at biblio.org
Fri Apr 25 07:26:57 EDT 2014


Huh, so the lines you shared seem strange and not what I would have
expected, especially:

2 | 720 H74 1979    | 720_H74_197900000000000

For the year to be padded with 0's instead of the first group of numbers
(the 720) looks weird.  And surely explains a lot about why sorting seems
off.  I just replicated the same effect on our system using Dewey class and
a year at the end of the Dewey string, so it's definitely starting to feel
like this is a new bug in the normalizer function --
asset.label_normalizer_dewey()

Looking back in time, I know that
https://bugs.launchpad.net/evergreen/+bug/1150939 was a recent bug.  Looks
like back then, the test cases we were looking at did not include the year
as a potential part at the end of the call number and we were very focused
on testing cases where there was a lead prefix or none with a dewey 3
number and then a cutter.  This issue might affect Koha too, given that we
share this normalizer routine in both of our projects.

I guess it's time to file a new potential bug and take a closer look at the
asset.label_normalizer_dewey function to see what it's doing wrong...

-- Ben


On Fri, Apr 25, 2014 at 4:45 AM, Paul Hoffman <paul at flo.org> wrote:

> On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote:
> > This will be a slightly more technical answer that may require some
> direct
> > database access to ascertain more details.
> >
> > You mention that the call numbers are identified as DDC.  So that's
> > label_class of 2, I believe.  We are using Dewey (DDC) for all of our
> > materials by default as well in our consortium.
> >
> > I'd be curious to know what the label_sortkey values were for those call
> > numbers you mention.  That field is what actually drives the sorting
> values
> > for a given set.
>
> Here's what our DB shows (Adam and I work together):
>
> SELECT   label_class, label, label_sortkey
> FROM     asset.call_number
> WHERE    label_sortkey like '720%'
> ORDER BY label_sortkey;
>
>  label_class |      label      |         label_sortkey
> -------------+-----------------+-------------------------------
>            1 | 720 H47 1979    | 720 H47 1979
>            2 | 720 a           | 720_000000000000000_A
>            2 | 720 .H47        | 720_000000000000000__H47
>            2 | 720.1 H74 1979  | 720_100000000000000_H74_1979
>            2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980
>            2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979
>            2 | 720 H74 1979    | 720_H74_197900000000000
>            2 | 720 .H47 1980   | 720__H47_198000000000000
>            2 | 720 .H47 1980   | 720__H47_198000000000000
>            2 | 720 .H74 1979   | 720__H74_197900000000000
> (10 rows)
>
> So the problem appears to be caused by the periods that sometimes occur
> before
> the Cutter number.  I don't know if that's kosher or not, but I can see
> that it
> occurs plenty in our (Voyager) catalog.
>
> Looking at the function asset.label_normalizer_dewey it seems to me that
> it can
> be done much more simply and efficiently if you leverage the fact that
> space
> (ASCII 32) and tilde (ASCII 126) come before and after (respectively)
> anything
> else meaningful that might be found in a call number.  Except periods,
> which
> complicate things.  Anyhow, here's a first stab at it:
>
> use strict;
> use warnings;
> sub ddcnorm {
>     local $_ = uc shift;
>     # Strip leading or trailing space and any slashes or apostrophes
>     s/^\s+|\s+$|[\/']//g;
>     # Insert a space at digit/non-digit boundaries
>     s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g;
>     # Replace some punctuation with a space
>     tr/-/ /;  # XXX What else?
>     # Strip extra junk -- XXX make this work on non-ASCII call numbers
>     tr/A-Za-z0-9. //cd;
>     s/ \. /~/g;
>     s/ \.|\. / /g;
>     tr/ //s;
>     return $_;
> }
>
> When I run our Deweys in the 720s through this, I get what seems to be the
> right order:
>
>            2 | 720 a           | 720 A
>            2 | 720 .H47        | 720 H 47
>            2 | 720 .H47 1980   | 720 H 47 1980
>            2 | 720 .H47 1980   | 720 H 47 1980
>            2 | 720 .H74 1979   | 720 H 74 1979
>            2 | 720 H74 1979    | 720 H 74 1979
>            2 | 720.1 .H47 1980 | 720~1 H 47 1980
>            2 | 720.1 .H74 1979 | 720~1 H 74 1979
>            2 | 720.1 H74 1979  | 720~1 H 74 1979
>
> If there's any interest, I'll run our entire set of Deweys through it and
> see
> if I can make sense of the results.  Hmm... should prefixes like "j" or "C"
> (juvenile or Canadian) be ignored?
>
> Paul.
>
> > On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <adam at flo.org> wrote:
> >
> > > Hi Everyone,
> > >
> > > We are testing in evergreen 2.5.2
> > >
> > > I'm noticing what I think looks like incorrect behavior when using the
> > > call number browse feature.
> > >
> > > Doing a call number browse search for 720 results in the following call
> > > number sort order:
> > >
> > > 720 H47 1979
> > > 720 .H47
> > > 720.1 H74 1979
> > > 720.1 .H47 1980
> > > 720.1 .H74 1979
> > > 720 H74 1979
> > > 720 .H74 1979
> > >
> > >
> > > It looks like the decimal point might be throwing things off. I think
> that
> > > should be taken care of in a normalizer, but maybe there is a reason
> not
> > > to. I think the 720.1's should come at the end of this list,
> regardless of
> > > the decimal point before the cutter.
> > >
> > > All of the call numbers are identified as DDC.
> > >
> > > you can probably replicate this here
> > > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2
> > >
> > >
> > > I didn't see any bug reports that seemed to address this specific
> issue,
> > > so I'm wondering if there could be something else causing this
> behavior.
> > >
> > > thanks,
> > > Adam
> > >
> > > --
> > >
> > > Adam Shire
> > > Member Services Librarian
> > > Fenway Libraries Online <http://flo.org>
> > > 617-442-2384
> > >
> >
> >
> >
> > --
> > Benjamin Shum
> > Evergreen Systems Manager
> > Bibliomation, Inc.
> > 24 Wooster Ave.
> > Waterbury, CT 06708
> > 203-577-4070, ext. 113
>
> --
> Paul Hoffman <paul at flo.org>
> Systems Librarian
> Fenway Libraries Online
> c/o Wentworth Institute of Technology
> 550 Huntington Ave.
> Boston, MA 02115
> (617) 442-2384 (FLO main number)
>



-- 
Benjamin Shum
Evergreen Systems Manager
Bibliomation, Inc.
24 Wooster Ave.
Waterbury, CT 06708
203-577-4070, ext. 113
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20140425/7fb862f4/attachment-0001.htm>


More information about the Open-ils-general mailing list