[OPEN-ILS-GENERAL] HTML entities in MARC Record editor

Josh Stompro stomproj at exchange.larl.org
Wed Mar 21 10:29:37 EDT 2018


Hello Dan,

We are still on 2.10 using the XUL client, so maybe the 520 display anomaly has been fixed in a later version.  I’ll make a note to check back once we are on a more modern version.

Would it be accurate to say that characters like & in the marc editor are encoded as html entities in the biblio.record_entry.marc since they are stored as marc xml?

The record that I was looking at was one of the free overdrive records, which are very very very rough, so it wouldn’t surprise me that they are grabbing the 520 from a web page and not being very careful with encoding.  I just looked for occurrences of ‘&amp’ and there are only 344 of them and all but 3 are from the free overdrive records.  There are also quite a few instances(6000) of — (em dash), again they are all the free overdrive records.  I guess we get what we pay for.

I’m tempted to just use regexp_replace against biblio.record_entry to try and clean these up, like the example here: https://wiki.evergreen-ils.org/doku.php?id=scratchpad:random_magic_spells#how_to_prune_a_tag_under_the_hood

Josh Stompro - LARL IT Director

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Dan Scott
Sent: Tuesday, March 20, 2018 4:23 PM
To: Evergreen Discussion Group <open-ils-general at list.georgialibraries.org>
Subject: Re: [OPEN-ILS-GENERAL] HTML entities in MARC Record editor

Hi Josh:
Quick question: XUL or web staff client? And version?

In theory, what you see is what you should get - MARC has no idea what HTML entities are, so "&" in the editor should be displayed as "&" (properly escaped, of course) in the catalogue.
If you see &amp; in the biblio.record_entry.marc, it may be the result of corrupted catalogue enrichment efforts (e.g. grabbing the summary for a book from a website via a script with a bug), and thus should just be corrected directly to "&". Unless it's a deliberately torturous book title like "Escaping <HTML> &amp; other Secure Web Practices" :)
If & in the MARC shows up as just & in the 520 catalogue output, it sounds like there might be a bug for us to track down...

Thanks,
Dan

On Tue, Mar 20, 2018 at 9:50 PM, Josh Stompro <stomproj at exchange.larl.org<mailto:stomproj at exchange.larl.org>> wrote:
Hello, could someone give me some pointers in regards to html entities in marc data?  Sometimes I see & used in 490a data and displayed as & in the evergreen marc editor, and in the catalog it is displayed as & and not as &.

We also see things like a 520 that contains & but it does get displayed as & in the catalog?

And when I look at the biblio.record_entry.marc It looks like & in the editor gets encoded as &amp;, so is this a double encoding error?  Should I ever see html entities when looking at marc data in the editor?

If those should be cleaned up, anyone have any magic spells/queries for doing so?
Thanks
Josh

Lake Agassiz Regional Library - Moorhead MN larl.org<http://larl.org>
Josh Stompro     | Office 218.233.3757 EXT-139<tel:(218)%20233-3757>
LARL IT Director | Cell 218.790.2110<tel:(218)%20790-2110>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20180321/21388bf2/attachment.html>


More information about the Open-ils-general mailing list