[OPEN-ILS-GENERAL] HTML entities in MARC Record editor

Elaine Hardy ehardy at georgialibraries.org
Wed Mar 21 15:30:11 EDT 2018


Jason,

Dan is correct -- &amp  appears rather than an & because at some point the
symbol is not interpreted correctly or someone copy and pasted information
from another source into a bib record and didn't notice & didn't render
correctly. &amp  can also be the result of an OCR program not correctly
interpreting the scan. We used to see it much more often. It is in 520s and
505s since they are often copied and pasted or scanned.

There are also occasions where utf-8 translation of diacritics and other
special characters doesn't happen correctly and the codes display in the
bib record. OCLC did an upgrade once that caused the problem for record
imported with the Z39.50 gateway. Evergreen had a bug many versions ago as
well that didn't display such characters correctly

When I find records what don't correctly display, I overlay a more current
record from OCLC. If it is an scanning or copy and paste issue, I usually
have to correct it in OCLC first (our only source of bib records is OCLC).
While the display isn't as desired, if it isn't causing any search and
retrieval issue, we just handle it serendipitously.

Elaine



J. Elaine Hardy
PINES & Collaborative Projects Manager
Georgia Public Library Service/PINES
1800 Century Place, Ste. 150
Atlanta, GA 30045

404.235.7128 Office
404.548.4241 Cell
404.235.7201 FAX

On Wed, Mar 21, 2018 at 10:29 AM, Josh Stompro <stomproj at exchange.larl.org>
wrote:

> Hello Dan,
>
>
>
> We are still on 2.10 using the XUL client, so maybe the 520 display
> anomaly has been fixed in a later version.  I’ll make a note to check back
> once we are on a more modern version.
>
>
>
> Would it be accurate to say that characters like & in the marc editor are
> encoded as html entities in the biblio.record_entry.marc since they are
> stored as marc xml?
>
>
>
> The record that I was looking at was one of the free overdrive records,
> which are very very very rough, so it wouldn’t surprise me that they are
> grabbing the 520 from a web page and not being very careful with encoding.
> I just looked for occurrences of ‘&amp’ and there are only 344 of them
> and all but 3 are from the free overdrive records.  There are also quite a
> few instances(6000) of &#8212; (em dash), again they are all the free
> overdrive records.  I guess we get what we pay for.
>
>
>
> I’m tempted to just use regexp_replace against biblio.record_entry to try
> and clean these up, like the example here: https://wiki.evergreen-ils.
> org/doku.php?id=scratchpad:random_magic_spells#how_to_
> prune_a_tag_under_the_hood
>
>
>
> Josh Stompro - LARL IT Director
>
>
>
> *From:* Open-ils-general [mailto:open-ils-general-
> bounces at list.georgialibraries.org] *On Behalf Of *Dan Scott
> *Sent:* Tuesday, March 20, 2018 4:23 PM
> *To:* Evergreen Discussion Group <open-ils-general at list.
> georgialibraries.org>
> *Subject:* Re: [OPEN-ILS-GENERAL] HTML entities in MARC Record editor
>
>
>
> Hi Josh:
>
> Quick question: XUL or web staff client? And version?
>
>
>
> In theory, what you see is what you should get - MARC has no idea what
> HTML entities are, so "&" in the editor should be displayed as "&"
> (properly escaped, of course) in the catalogue.
>
> If you see &amp; in the biblio.record_entry.marc, it may be the result
> of corrupted catalogue enrichment efforts (e.g. grabbing the summary for a
> book from a website via a script with a bug), and thus should just be
> corrected directly to "&". Unless it's a deliberately torturous book title
> like "Escaping <HTML> &amp; other Secure Web Practices" :)
>
> If & in the MARC shows up as just & in the 520 catalogue output, it
> sounds like there might be a bug for us to track down...
>
>
>
> Thanks,
>
> Dan
>
>
>
> On Tue, Mar 20, 2018 at 9:50 PM, Josh Stompro <stomproj at exchange.larl.org>
> wrote:
>
> Hello, could someone give me some pointers in regards to html entities in
> marc data?  Sometimes I see & used in 490a data and displayed as &
> in the evergreen marc editor, and in the catalog it is displayed as &
> and not as &.
>
>
>
> We also see things like a 520 that contains & but it does get
> displayed as & in the catalog?
>
>
>
> And when I look at the biblio.record_entry.marc It looks like & in the
> editor gets encoded as &amp;, so is this a double encoding error?
> Should I ever see html entities when looking at marc data in the editor?
>
>
>
> If those should be cleaned up, anyone have any magic spells/queries for
> doing so?
>
> Thanks
>
> Josh
>
>
>
> Lake Agassiz Regional Library - Moorhead MN larl.org
>
> Josh Stompro     | Office 218.233.3757 EXT-139 <(218)%20233-3757>
>
> LARL IT Director | Cell 218.790.2110 <(218)%20790-2110>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20180321/7aab7bce/attachment.html>


More information about the Open-ils-general mailing list