[OPEN-ILS-DEV] AddedContent.pm produces garbled encoding for application/json

Linda Jansova skolkova at chello.cz
Mon Jan 14 11:06:33 EST 2019


Okay :-).

Of course it would be great if a more appropriate place where to make 
the encoding correction was found!

Just a little note  - at the end of the first message I mentioned:

"However, we are not sure if the issue is in AddedContent.pm or in our
Apache configuration (because our test Perl code run from bash works
okay but AddedContent.pm called from Apache does not)."

We have also tested AddedContent.pm's get_url function when run directly 
from Apache and the encoding has been okay (when viewed in the web 
browser and also when saved to a file).

Linda

PS: I'm adding our developer Jakub - who has come up with the encoding 
tweak - to Cc in case there is something else to add and/or clarify :-).

On 1/14/19 3:38 PM, Jason Stephenson wrote:
> Linda,
>
> Your plan sounds OK to me. I think it can wait for the new features,
> since your site is the main (probably the only) user of the module.
>
> I'll take a look at the Open Library bug you mentioned. We're not using
> it, but maybe there is something that we can do generically to resolve
> this in a way that doesn't require duplicated code.
>
> Cheers,
> Jason
>
> On 1/14/19 9:19 AM, Linda Jansova wrote:
>> Hi Jason,
>>
>> There is not a Launchpad bug for this yet because at this point it has
>> only been fixed locally (in ObalkyKnih.pm module in our Evergreen
>> installation). Currently we are in the process of adding some more
>> features to ObalkyKnih.pm module and testing these features. Of course,
>> we will create a wishlist bug and submit the enriched code (which would
>> also contain the encoding fix) to GIT. Should there be a wider interest
>> in the corrected code at this point, we would submit a patch for the
>> encoding only (and add the new features at a later date).
>>
>> Maybe that this (or similar) solution would also work for encoding
>> issues in added content coming from Open Library - for which a bug has
>> been opened already: https://bugs.launchpad.net/evergreen/+bug/1610678.
>> (Josh did some testing with Open Library data encoding.)
>>
>> Linda
>>
>> On 1/14/19 2:45 PM, Jason Stephenson wrote:
>>> Hi, Linda.
>>>
>>> Good find! That sounds like a bug to me. Could you submit a Launchpad
>>> bug (if there isn't one already) and a patch commit, please?
>>>
>>> Depending on which module this occurs in, the fix may not be as simple
>>> as using utf8::decode. It could be that we need code to determine the
>>> character set from the HTTP response.
>>>
>>> Cheers,
>>> Jason
>>>
>>> On 1/14/19 2:38 AM, Linda Jansova wrote:
>>>> Hi,
>>>>
>>>> Just letting you know that our developer has eventually changed the
>>>> encoding of the data blob coming from Obalkyknih.cz from UTF-8 to Perl's
>>>> internal representation:
>>>>
>>>> utf8::decode($response_content);
>>>>
>>>> Adding this step before further data processing takes place has
>>>> successfully solved the problem of seeing gibberish characters in our
>>>> catalog :-).
>>>>
>>>> Linda
>>>>
>>>> On 11/25/18 4:56 PM, Linda Jansova wrote:
>>>>> Hi,
>>>>>
>>>>> We have encountered a problem in AddedContent.pm's get_url function
>>>>> (both in Evergreen 3.1.4 and 2.12.6) - letters with diacritics from
>>>>> Czech added content provider Obalkyknih.cz have started being
>>>>> corrupted in our TPACs. Our added content provider has reported a
>>>>> switch to application/json MIME type.
>>>>>
>>>>> After the switch we seem to be getting strange chars in summary (and
>>>>> table of contents if available as text, not as an image). We have
>>>>> tried to locate the problem in a separate Perl program and have come
>>>>> to a conclusion that data only get corrupted when fetched by the
>>>>> get_url function from AddedContent.pm.
>>>>>
>>>>> We have added additional logging to the following part of
>>>>> AddedContent.pm:
>>>>>
>>>>> # returns an HTPP::Response object
>>>>> sub get_url {
>>>>>       my( $self, $url ) = @_;
>>>>>
>>>>>       $logger->info("added content getting [timeout=$net_timeout,
>>>>> errors_remaining=$error_countdown] URL = $url");
>>>>>       my $agent = LWP::UserAgent->new(timeout => $net_timeout);
>>>>>
>>>>>       my $res = $agent->get($url);
>>>>>       $logger->info("added content request returned with code " .
>>>>> $res->code);
>>>>>
>>>>>       #VJ
>>>>>       $logger->info("added contet res is: " . $res->content);
>>>>>
>>>>>       die "added content request failed: " . $res->status_line ."\n"
>>>>> unless $res->is_success;
>>>>>
>>>>>       return $res;
>>>>> }
>>>>>
>>>>> And a corresponding sample log looks like this:
>>>>>
>>>>> [2018-11-24 22:22:58] /usr/sbin/apache2
>>>>> [INFO:32231:AddedContent.pm:296:1543094555322319] added contet res is:
>>>>> [{"_id":"5bf904c905509b06182848ac","succ_toc_count":"0","cover_preview510_url":"https://cache.obalkyknih.cz/file/cover/1830989/preview510","ean":"9788026203667","uuid":["uuid:6ec863b0-055e-11e6-a611-005056827e51"],"cooperating_with":"https://www.cbdb.cz|CBDB.cz","succ_cover_count":"0","flag_bare_record":0,"csn_iso_690_source":"Národní
>>>>>
>>>>> knihovna Ä<U+008C>eské Republiky
>>>>> 18.11.2018","rating_url":"https://www.obalkyknih.cz/stars?value=100","rating_sum":200,"cover_thumbnail_url":"https://cache.obalkyknih.cz/file/cover/1830989/thumbnail","oclc_other":[],"reviews":[],"part_root":1,"orig_height":"510","backlink_url":"https://www.obalkyknih.cz/view?isbn=9788026203667","toc_thumbnail_url":"https://cache.obalkyknih.cz/file/toc/362647/thumbnail","cover_medium_url":"https://cache.obalkyknih.cz/file/cover/1830989/medium","annotation":{"source":"Web
>>>>>
>>>>> obalkyknih.cz","html":"Encyklopedie sociální práce (ESP)
>>>>> pÅ<U+0099>ináší pÅ<U+0099>es 200 hesel ze vÅ¡ech oblastí
>>>>> sociální práce. ESP je postavena na interakÄ<U+008D>ním pojetí
>>>>> sociální práce. JedineÄ<U+008D>nost sociální práce spoÄ
>>>>> <U+008D>ívá v tom, že operuje v poli mezi klientem a jeho
>>>>> sociálním prostÅ<U+0099>edím; pracovník je v obecném smyslu
>>>>> mediátorem mezi jednotlivcem a spoleÄ<U+008D>ností. Jeho úkolem je
>>>>> napomáhat sociálnímu fungování klientů a pomáhat
>>>>> spoleÄ<U+008D>nosti, aby citlivÄ<U+009B> reagovala na potÅ<U+0099>eby
>>>>> svých Ä<U+008D>lenů. Tato dvojitá mediaÄ<U+008D>ní role je role
>>>>> angažovaná. Je zakotvená hodnotovÄ<U+009B> v náboženství nebo v
>>>>> huma
>>>>> [2018-11-24 22:22:58] /usr/sbin/apache2
>>>>> [INFO:32231:ObalkyKnih.pm:228:1543094555322319] ObalkyKnih.cz for
>>>>> books?isbn=9788026203667 response was
>>>>> [{"_id":"5bf904c905509b06182848ac","succ_toc_count":"0","cover_preview510_url":"https://cache.obalkyknih.cz/file/cover/1830989/preview510","ean":"9788026203667","uuid":["uuid:6ec863b0-055e-11e6-a611-005056827e51"],"cooperating_with":"https://www.cbdb.cz|CBDB.cz","succ_cover_count":"0","flag_bare_record":0,"csn_iso_690_source":"Národní
>>>>>
>>>>> knihovna Ä<U+008C>eské Republiky
>>>>> 18.11.2018","rating_url":"https://www.obalkyknih.cz/stars?value=100","rating_sum":200,"cover_thumbnail_url":"https://cache.obalkyknih.cz/file/cover/1830989/thumbnail","oclc_other":[],"reviews":[],"part_root":1,"orig_height":"510","backlink_url":"https://www.obalkyknih.cz/view?isbn=9788026203667","toc_thumbnail_url":"https://cache.obalkyknih.cz/file/toc/362647/thumbnail","cover_medium_url":"https://cache.obalkyknih.cz/file/cover/1830989/medium","annotation":{"source":"Web
>>>>>
>>>>> obalkyknih.cz","html":"Encyklopedie sociální práce (ESP)
>>>>> pÅ<U+0099>ináší pÅ<U+0099>es 200 hesel ze vÅ¡ech oblastí
>>>>> sociální práce. ESP je postavena na interakÄ<U+008D>ním pojetí
>>>>> sociální práce. JedineÄ
>>>>> <U+008D>nost sociální práce spoÄ<U+008D>ívá v tom, že operuje v
>>>>> poli mezi klientem a jeho sociálním prostÅ<U+0099>edím; pracovník
>>>>> je v obecném smyslu mediátorem mezi jednotlivcem a
>>>>> spoleÄ<U+008D>ností. Jeho úkolem je napomáhat sociálnímu
>>>>> fungování klientů a pomáhat spoleÄ<U+008D>nosti, aby
>>>>> citlivÄ<U+009B> reagovala na potÅ
>>>>> <U+0099>eby svých Ä<U+008D>lenů. Tato dvojitá mediaÄ<U+008D>ní
>>>>> role je role angažovaná. Je zakotvená hodnot
>>>>>
>>>>> Our Perl code used for testing purposes
>>>>>
>>>>> #!usr/bin/perl
>>>>> use LWP::UserAgent;
>>>>>
>>>>> my $ua = LWP::UserAgent->new;
>>>>> my $response = $ua->get(
>>>>> 'http://cache.obalkyknih.cz/api/books?isbn=978-80-262-0366-7' );
>>>>> my $content = $response->content;
>>>>> print $content ;
>>>>>
>>>>> produces data with the correct encoding:
>>>>>
>>>>> [{"_id":"5bf904c905509b06182848ac","succ_toc_count":"0","cover_preview510_url":"https://cache.obalkyknih.cz/file/cover/1830989/preview510","ean":"9788026203667","uuid":["uuid:6ec863b0-055e-11e6-a611-005056827e51"],"cooperating_with":"https://www.cbdb.cz|CBDB.cz","succ_cover_count":"0","flag_bare_record":0,"csn_iso_690_source":"Národní
>>>>>
>>>>> knihovna České Republiky
>>>>> 18.11.2018","rating_url":"https://www.obalkyknih.cz/stars?value=100","rating_sum":200,"cover_thumbnail_url":"https://cache.obalkyknih.cz/file/cover/1830989/thumbnail","oclc_other":[],"reviews":[],"part_root":1,"orig_height":"510","backlink_url":"https://www.obalkyknih.cz/view?isbn=9788026203667","toc_thumbnail_url":"https://cache.obalkyknih.cz/file/toc/362647/thumbnail","cover_medium_url":"https://cache.obalkyknih.cz/file/cover/1830989/medium","annotation":{"source":"Web
>>>>>
>>>>> obalkyknih.cz","html":"Encyklopedie sociální práce (ESP) přináší přes
>>>>> 200 hesel ze všech oblastí sociální práce. ESP je postavena na
>>>>> interakčním pojetí sociální práce. Jedinečnost sociální práce spočívá
>>>>> v tom, že operuje v poli mezi klientem a jeho sociálním prostředím;
>>>>> pracovník je v obecném smyslu mediátorem mezi jednotlivcem a
>>>>> společností. Jeho úkolem je napomáhat sociálnímu fungování klientů a
>>>>> pomáhat společnosti, aby citlivě reagovala na potřeby svých členů.
>>>>> Tato dvojitá mediační role je role angažovaná. Je zakotvená hodnotově
>>>>> v náboženství nebo v humanitních ideálech. V podobě tematicky
>>>>> uspořádaných samostatných hesel poskytuje toto rozsáhlé dílo přehled
>>>>> psychologických a sociologických teorií a přístupů s dopadem do
>>>>> sociální práce, náboženský, filozofický a společenský kontext oboru.
>>>>> Přináší přehled klíčových pojmů, technik a metod sociální práce,
>>>>> ohrožených skupin a poskytovaných služeb. Samostatnou část tvoří hesla
>>>>> charakterizující profesi sociálního pracovníka a hesla zabývající se
>>>>> výzkumem v oblasti sociální práce. ESP reflektuje domácí vývoj oboru v
>>>>> evropském kontextu a zohledňuje i širší mezinárodní zřetel. Hesla
>>>>> popisují daný jev a jeho historii, hodnotová východiska, aplikační
>>>>> možnosti a výzkum.","id":"2391500"},"csn_iso_690":"MATOUŠEK, Oldřich.
>>>>> <i>Encyklopedie sociální práce. </i>Vyd. 1. Editor Alois KŘIŠŤAN.
>>>>> Praha: Portál, 2013. 570
>>>>> s.","nbn":"cnb002436000","rating_avg100":"100","orig_width":"346","rating_avg5":5,"toc_pdf_url":"https://cache.obalkyknih.cz/file/toc/362647/pdf","ean_other":[],"nbn_other":[],"rating_count":2,"cover_icon_url":"https://cache.obalkyknih.cz/file/cover/1830989/icon","dig_obj":{"BOA001":{"public":0,"url":"https://kramerius.mzk.cz/search/i.jsp?pid=uuid:6ec863b0-055e-11e6-a611-005056827e51","uuid":"uuid:6ec863b0-055e-11e6-a611-005056827e51"},"ABA001":{"public":0,"url":"http://kramerius4.nkp.cz/search/i.jsp?pid=uuid:6ec863b0-055e-11e6-a611-005056827e51","uuid":"uuid:6ec863b0-055e-11e6-a611-005056827e51"}},"bib_year":"2013","oclc":"(OCoLC)852382182","bib_title":"Encyklopedie
>>>>>
>>>>> sociální práce","succ_bib_count":"0","book_id":"112038753"}]
>>>>>
>>>>> It confirms that data from Obalkyknih.cz are in UTF-8.
>>>>>
>>>>> Basically, what happens when things go wrong is this (using letter á
>>>>> as an example): the original character U+00E1
>>>>> (http://www.fileformat.info/info/unicode/char/e1/index.htm) is encoded
>>>>> as two letters á (\u00c3\u00a1 as represented in memcached log). à is
>>>>> U+00E3 (https://www.fileformat.info/info/unicode/char/00c3/index.htm)
>>>>> while ¡ is U+00A1
>>>>> (http://www.fileformat.info/info/unicode/char/00A1/index.htm).
>>>>>
>>>>> A description at
>>>>> https://www.effectiveperlprogramming.com/2011/08/know-the-difference-between-character-strings-and-utf-8-strings/
>>>>>
>>>>> probably gives some ideas how encoding could get broken in Perl.
>>>>>
>>>>> However, we are not sure if the issue is in AddedContent.pm or in our
>>>>> Apache configuration (because our test Perl code run from bash works
>>>>> okay but AddedContent.pm called from Apache does not).
>>>>>
>>>>> Does anybody have any idea where to look next to fix it?
>>>>>
>>>>> Thank you in advance!
>>>>>
>>>>> Linda
>>>>>
>>>>>


More information about the Open-ils-dev mailing list