[OPEN-ILS-GENERAL] HTML entities in MARC Record editor

Dan Scott dan at coffeecode.net
Wed Mar 21 15:47:09 EDT 2018


Ah, of course you're right that the & will get escaped as & in the
MARCXML in biblio.record_entry.marc. But then when it's displayed in the
catalogue or if you load it back up in the editor, you should just see "&"
again.

However, I just tested putting "This & that" into a 520 field in our
2.12 system and noticed that the displayed data is "This & that" -- meaning
it is *not* getting escaped. Which is a Very Bad Thing. (Bug forthcoming).

As for your free Overdrive records, yes, I regularly spend time fixing mass
corruption of records that are sourced from OCLC, where someone has
almost-but-not-quite managed to flesh out the table of contents & summary
fields correctly. There is some pretty heinous data out there :)

On Wed, Mar 21, 2018 at 3:29 PM, Josh Stompro <stomproj at exchange.larl.org>
wrote:

> Hello Dan,
>
>
>
> We are still on 2.10 using the XUL client, so maybe the 520 display
> anomaly has been fixed in a later version.  I’ll make a note to check back
> once we are on a more modern version.
>
>
>
> Would it be accurate to say that characters like & in the marc editor are
> encoded as html entities in the biblio.record_entry.marc since they are
> stored as marc xml?
>
>
>
> The record that I was looking at was one of the free overdrive records,
> which are very very very rough, so it wouldn’t surprise me that they are
> grabbing the 520 from a web page and not being very careful with encoding.
> I just looked for occurrences of ‘&amp’ and there are only 344 of them
> and all but 3 are from the free overdrive records.  There are also quite a
> few instances(6000) of &#8212; (em dash), again they are all the free
> overdrive records.  I guess we get what we pay for.
>
>
>
> I’m tempted to just use regexp_replace against biblio.record_entry to try
> and clean these up, like the example here: https://wiki.evergreen-ils.
> org/doku.php?id=scratchpad:random_magic_spells#how_to_
> prune_a_tag_under_the_hood
>
>
>
> Josh Stompro - LARL IT Director
>
>
>
> *From:* Open-ils-general [mailto:open-ils-general-
> bounces at list.georgialibraries.org] *On Behalf Of *Dan Scott
> *Sent:* Tuesday, March 20, 2018 4:23 PM
> *To:* Evergreen Discussion Group <open-ils-general at list.
> georgialibraries.org>
> *Subject:* Re: [OPEN-ILS-GENERAL] HTML entities in MARC Record editor
>
>
>
> Hi Josh:
>
> Quick question: XUL or web staff client? And version?
>
>
>
> In theory, what you see is what you should get - MARC has no idea what
> HTML entities are, so "&" in the editor should be displayed as "&"
> (properly escaped, of course) in the catalogue.
>
> If you see &amp; in the biblio.record_entry.marc, it may be the result
> of corrupted catalogue enrichment efforts (e.g. grabbing the summary for a
> book from a website via a script with a bug), and thus should just be
> corrected directly to "&". Unless it's a deliberately torturous book title
> like "Escaping <HTML> &amp; other Secure Web Practices" :)
>
> If & in the MARC shows up as just & in the 520 catalogue output, it
> sounds like there might be a bug for us to track down...
>
>
>
> Thanks,
>
> Dan
>
>
>
> On Tue, Mar 20, 2018 at 9:50 PM, Josh Stompro <stomproj at exchange.larl.org>
> wrote:
>
> Hello, could someone give me some pointers in regards to html entities in
> marc data?  Sometimes I see & used in 490a data and displayed as &
> in the evergreen marc editor, and in the catalog it is displayed as &
> and not as &.
>
>
>
> We also see things like a 520 that contains & but it does get
> displayed as & in the catalog?
>
>
>
> And when I look at the biblio.record_entry.marc It looks like & in the
> editor gets encoded as &amp;, so is this a double encoding error?
> Should I ever see html entities when looking at marc data in the editor?
>
>
>
> If those should be cleaned up, anyone have any magic spells/queries for
> doing so?
>
> Thanks
>
> Josh
>
>
>
> Lake Agassiz Regional Library - Moorhead MN larl.org
>
> Josh Stompro     | Office 218.233.3757 EXT-139 <(218)%20233-3757>
>
> LARL IT Director | Cell 218.790.2110 <(218)%20790-2110>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20180321/8eca2ef3/attachment-0001.html>


More information about the Open-ils-general mailing list