[OPEN-ILS-GENERAL] Mangled UTF8 characters with imported MARC records in Z39.50
Elaine Hardy
ehardy at georgialibraries.org
Mon Dec 5 08:47:06 EST 2016
The 264 fields are the most likely field to be locally added if the library
is editing hybrid AACR2/RDA records to full RDA in their catalogs. And 264s
are also likely to be the only field in a MARC record for an English
language title to have a special character since we indicate copyright and
phonograph copyright date with the appropriate symbols. So, it can be the
local catalogers adding the character once the record is imported or the
vendor that created the record for them using a nonMARC-8 or UTF8 character
set.
Your catalogers can correct the symbols as they are cataloging their local
items.
Elaine
J. Elaine Hardy
PINES & Collaborative Projects Manager
Georgia Public Library Service/PINES
1800 Century Place, Ste. 150
Atlanta, GA 30045
404.235.7128 Office
404.548.4241 Cell
404.235.7201 FAX
On Sat, Dec 3, 2016 at 2:02 PM, Brent Mills <brent at hoodriverlibrary.org>
wrote:
> Jason and Mike,
>
> Thanks so much for the help! Glad to know that it’s a remote issue and not
> something set up incorrectly on our side.
>
> -Brent
> -----------------------------
>
> Brent Mills
> Systems Librarian | Sage Library System
>
> email: brent at hoodriverlibrary.org
> tickets: https://sagelib.org/support
> phone: 541.610.8384
>
> On Dec 2, 2016, at 2:30 PM, Mike Rylander <mrylander at gmail.com> wrote:
>
> Jason hit on (almost certainly) the answer: bad records from sources that
> don't restrict cataloging to valid character sets. I'll add a couple
> comments below for general clarification, as well...
>
> On Fri, Dec 2, 2016 at 4:52 PM, Brent Mills <brent at hoodriverlibrary.org>
> wrote:
>
>> Hello,
>>
>> I’ve recently noticed some issues with imported MARC records from a
>> specific set of Z39.50 servers.
>>
>> A noticeable amount of records that are imported through
>> Prospector/MaineCat targets have mangled characters when diacritics,
>> symbols,etc.. are present in the record.
>>
>> Does anyone have some ideas on what could be causing the character
>> encoding problems from these particular targets? Or run into this at their
>> own site?
>>
>> - dgo.conf has <charset>marc-8</charset>. changing that to usmarc, utf8
>> has had no effect
>> - xml2marc-yaz.cfg is setup like described in https://wiki.evergreen-ils.
>> org/doku.php?id=evergreen-admin:sru_and_z39.50 changing the charset
>> options hasn’t had any effect either
>>
>
> The reason this doesn't change anything is that it's only used to describe
> how Evergreen will server records to /others/ as a z39.50 server. Those
> are not client settings.
>
>
>> - the encoding/translation problems do not happen with OCLC and Library
>> of Congress targets, it seems to mainly affect servers with the INNOPAC db
>> type. I’m not sure if that’s related.
>>
>>
> This and the log message below are the smoking guns. OCLC and LoC are
> generally very good about making sure records really are in the character
> set they advertise, and that that character set is one of only MARC-8 or
> UTF8.
>
> So, Jason nailed it -- there are non-UTF8, non-MARC-8 characters in those
> records, as served by the INNOPAC sources. That's a (remote) cataloging
> issue.
>
> HTH,
>
> --Mike
>
> Going through the logs I can see things like:
>>
>> open-ils.search.z3950.search_class: no mapping found for [0x80] at
>> position 56 in Kurt and Joe tangle with the most determined enemy they’ve
>> ever encountered when a ruthless powerbroker schemes to build a new
>> Egyptian empire as glorious as those of the Pharaohs. Part of his plan
>> rests on the manipulation of a newly discovered aquifer beneath the Sahara,
>> but an even more devastating weapon at his disposal may threaten the entire
>> world: a plant extract known as the black mist, discovered in the City of
>> the Dead and rumored to have the power to take life from the living and
>> restore it to the dead. With the balance of power in Africa and Europe on
>> the verge of tipping, Kurt, Joe, and the rest of the NUMA team will have to
>> fight to discover the truth behind the legends—but to do that, they have
>> to confront in person the greatest legend of them all: Osiris, the ruler of
>> the Egyptian underworld. g0=ASCII_DEFAULT g1=EXTENDED_LATIN at
>> /usr/share/perl5/MARC/Charset.pm line 308.
>>
>>
>> So I’m thinking something is happening in the MARC8 to UTF8 conversion?
>>
>> Attaching a screenshot of what it looks like in the Z39.50 Import screen.
>> The 264s have been the most obvious place to see the issue, but it happens
>> in any field with special characters.
>>
>> Been banging my head trying to figure out what’s causing this. Any help
>> would be appreciated!
>>
>> Thank you,
>>
>> -Brent
>>
>> <bad264.jpg>
>> -----------------------------
>>
>> Brent Mills
>> Systems Librarian | Sage Library System
>>
>> email: brent at hoodriverlibrary.org
>> tickets: https://sagelib.org/support
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20161205/d08a6a02/attachment.html>
More information about the Open-ils-general
mailing list