[OPEN-ILS-GENERAL] Mangled UTF8 characters with imported MARC records in Z39.50

Jason Stephenson jason at sigio.com
Fri Dec 2 17:20:55 EST 2016


Brent,

The records are mostly likely not MARC-8 or UTF-8. The example you
shared looks like a Windows-1252 "smart" quote. I would not be surprised
if the records have characters from multiple character sets in them.
I've seen that before.

I don't have any useful suggestions for you, other than suggesting that
staff not try to import records from those sources.

Jason

On 12/02/2016 04:52 PM, Brent Mills wrote:
> Hello,
> 
> I’ve recently noticed some issues with imported MARC records from a
> specific set of Z39.50 servers.
> 
> A noticeable amount of records that are imported through
> Prospector/MaineCat targets have mangled characters when diacritics,
> symbols,etc.. are present in the record.
> 
> Does anyone have some ideas on what could be causing the character
> encoding problems from these particular targets? Or run into this at
> their own site?
> 
> - dgo.conf has <charset>marc-8</charset>. changing that to usmarc, utf8
> has had no effect
> - xml2marc-yaz.cfg is setup like described
> in https://wiki.evergreen-ils.org/doku.php?id=evergreen-admin:sru_and_z39.50 changing
> the charset options hasn’t had any effect either
> - the encoding/translation problems do not happen with OCLC and Library
> of Congress targets, it seems to mainly affect servers with the INNOPAC
> db type. I’m not sure if that’s related.
> 
> Going through the logs I can see things like:
> 
>     open-ils.search.z3950.search_class: no mapping found for [0x80] at
>     position 56 in Kurt and Joe tangle with the most
>     determined enemy they’ve ever encountered when a ruthless
>     powerbroker schemes to build a new Egyptian empire as glorious as
>     those of the Pharaohs. Part of his plan rests on the manipulation of
>     a newly discovered aquifer beneath the Sahara, but an even
>     more devastating weapon at his disposal may threaten the entire
>     world: a plant extract known as the black mist, discovered in the
>     City of the Dead and rumored to have the power to take life from the
>     living and restore it to the dead. With the balance of power
>     in Africa and Europe on the verge of tipping, Kurt, Joe, and the
>     rest of the NUMA team will have to fight to discover the
>     truth behind the legends—but to do that, they have to confront in
>     person the greatest legend of them all: Osiris, the ruler of
>     the Egyptian underworld. g0=ASCII_DEFAULT g1=EXTENDED_LATIN at
>     /usr/share/perl5/MARC/Charset.pm line 308.
> 
> 
> So I’m thinking something is happening in the MARC8 to UTF8 conversion?
> 
> Attaching a screenshot of what it looks like in the Z39.50 Import
> screen. The 264s have been the most obvious place to see the issue, but
> it happens in any field with special characters.
> 
> Been banging my head trying to figure out what’s causing this. Any help
> would be appreciated!
> 
> Thank you,
> 
> -Brent
> 
> -----------------------------
> 
> Brent Mills
> Systems Librarian | Sage Library System
> 
> email: brent at hoodriverlibrary.org <mailto:brent at hoodriverlibrary.org>
> tickets: https://sagelib.org/support
> 


More information about the Open-ils-general mailing list