[OPEN-ILS-GENERAL] Mangled UTF8 characters with imported MARC records in Z39.50
jason at sigio.com
Fri Dec 2 17:20:55 EST 2016
The records are mostly likely not MARC-8 or UTF-8. The example you
shared looks like a Windows-1252 "smart" quote. I would not be surprised
if the records have characters from multiple character sets in them.
I've seen that before.
I don't have any useful suggestions for you, other than suggesting that
staff not try to import records from those sources.
On 12/02/2016 04:52 PM, Brent Mills wrote:
> I’ve recently noticed some issues with imported MARC records from a
> specific set of Z39.50 servers.
> A noticeable amount of records that are imported through
> Prospector/MaineCat targets have mangled characters when diacritics,
> symbols,etc.. are present in the record.
> Does anyone have some ideas on what could be causing the character
> encoding problems from these particular targets? Or run into this at
> their own site?
> - dgo.conf has <charset>marc-8</charset>. changing that to usmarc, utf8
> has had no effect
> - xml2marc-yaz.cfg is setup like described
> in https://wiki.evergreen-ils.org/doku.php?id=evergreen-admin:sru_and_z39.50 changing
> the charset options hasn’t had any effect either
> - the encoding/translation problems do not happen with OCLC and Library
> of Congress targets, it seems to mainly affect servers with the INNOPAC
> db type. I’m not sure if that’s related.
> Going through the logs I can see things like:
> open-ils.search.z3950.search_class: no mapping found for [0x80] at
> position 56 in Kurt and Joe tangle with the most
> determined enemy theyâve ever encountered when a ruthless
> powerbroker schemes to build a new Egyptian empire as glorious as
> those of the Pharaohs. Part of his plan rests on the manipulation of
> a newly discovered aquifer beneath the Sahara, but an even
> more devastating weapon at his disposal may threaten the entire
> world: a plant extract known as the black mist, discovered in the
> City of the Dead and rumored to have the power to take life from the
> living and restore it to the dead. With the balance of power
> in Africa and Europe on the verge of tipping, Kurt, Joe, and the
> rest of the NUMA team will have to fight to discover the
> truth behind the legendsâbut to do that, they have to confront in
> person the greatest legend of them all: Osiris, the ruler of
> the Egyptian underworld. g0=ASCII_DEFAULT g1=EXTENDED_LATIN at
> /usr/share/perl5/MARC/Charset.pm line 308.
> So I’m thinking something is happening in the MARC8 to UTF8 conversion?
> Attaching a screenshot of what it looks like in the Z39.50 Import
> screen. The 264s have been the most obvious place to see the issue, but
> it happens in any field with special characters.
> Been banging my head trying to figure out what’s causing this. Any help
> would be appreciated!
> Thank you,
> Brent Mills
> Systems Librarian | Sage Library System
> email: brent at hoodriverlibrary.org <mailto:brent at hoodriverlibrary.org>
> tickets: https://sagelib.org/support
More information about the Open-ils-general