[OPEN-ILS-DEV] Removing URLS from 856

Tim Spindler tjspindler at gmail.com
Wed Apr 3 17:56:47 EDT 2013


I actually created a backup in the staging schema did a test update on that
data first before doing it on the production data using the select
statement based on the bib id I had in the staging table.


On Wed, Apr 3, 2013 at 5:48 PM, Liam Whalen <whalen.ld at gmail.com> wrote:

> I would still use SELECTs to see what you are replacing.  Subfield u is
> repeatable, so if you have a 856 field with two URLs, one with your HCC
> link, and the other with a a different link but an HCC short name, that
> regex will replace that as well.
>
> That is most likely an unusual situation, but if you are not really
> familiar with this data there is no way to know.  I guess there is only so
> much that can be done with bad data, but I always feel better knowing that
> I'm not getting rid of something I am not anticipating deleting.
>
> Liam
>
> On 2013-04-03, at 9:13 AM, Tim Spindler <tjspindler at gmail.com> wrote:
>
> Thanks fro all the replies.  I did get the following regex to work
>
> regexp_replace(marc,E'<datafield tag="856" ind1="4" ind2="0"><subfield
> code="u">
> http://ezhc.ez.cwmars.org:4000/login.url=http://digital.films.com/PortalPlaylists.aspx.aid=9206.+?HCC
> </subfield></datafield>','','g')
>
> The url is unique to Films on Demand  for HCC so that seems to work well
> to identify the beginning part of the 856 and then the HCC part identifies
> the end.
>
>
>
>
> On Wed, Apr 3, 2013 at 6:51 AM, Liam Whalen <whalen.ld at gmail.com> wrote:
>
>> Sorry for the chain replying to my own emails.  But, that code is not
>> exactly right.  using matches between ^ $ will not get you what you want
>> exactly, that will only return the first occurrence of 856 replaced in a
>> record.  It has been over a year since I've done stuff like this, so I'm
>> sorry if the code is not very usable.  However, the code wasn't really my
>> main point.  What I've learned using regex to replace data is I expect
>> regex to get rid of useful data because there is no way to predict the
>> outliers in the data.  Almost definitely, there will be some 856s fields
>> that do not conform to typical pattern in your data, and regex may remove
>> them, in which case you will lose useful data.
>>
>> Liam
>> On 2013-04-03, at 6:35 AM, Liam Whalen <whalen.ld at gmail.com> wrote:
>>
>> I forgot to mention checking the data that was replaced:
>>
>> SELECT id, replaced FROM
>> (SELECT id, regexp_replace(marc,E'^(?:(?!<datafield[^>]+?tag="856").)*(<datafield[^>]+?tag="856".+?<subfield
>> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>').*$,'\1','g')
>> AS replaced FROM biblio.record_entry WHERE marc ~
>> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>')
>> WHERE length(replaced) > 0
>>
>> That should give you all the fields that were replaced and the ids for
>> their records.  You can then use further selects to make sure you are only
>> replacing what you think you are replacing.  For instance, you can look for
>> data that was coded improperly.  Perhaps there is a link with an HCC in 9
>> that isn't really an HCC link.
>>
>> I'm not sure what determines if it is HCC, maybe its the port or the
>> aid=#?
>>
>> Something like
>>
>> SELECT id, regexp_replace(replaced, E'^(?:(?!<subfield
>> code="u">).)*([^<]*)</subfield>','\1','g') AS url FROM
>>  (SELECT id, replaced FROM
>> (SELECT id, regexp_replace(marc,E'^(?:(?!<datafield[^>]+?tag="856").)*(<datafield[^>]+?tag="856".+?<subfield
>> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>').*$,'\1','g')
>> AS replaced FROM biblio.record_entry WHERE marc ~
>> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>')
>> WHERE length(replaced) > 0))
>>
>> WHERE url !~ '.*aid=9206.*'
>>
>> That should give you urls you are replacing that do not have aid=9206.
>>  I'm not sure what data you need to check for, but the idea is to not
>> replace data that has been entered incorrectly.   While it might have HCC
>> in subfield 9, that could be there incorrectly, in which case you will be
>> eliminating a useful URL that has been coded incorrectly. There are other
>> things to check for. I'm not sure if there are limitations on the data's
>> use of subfield 9.  However it might be repeated in a url indicating that a
>> link belongs to more than one shortname.  In which case the second subfield
>> would get eaten by the negative lookahead, and the 856 tag would be removed
>> when it may still be valid for another shortname.
>>
>> Liam
>>
>>
>> On 2013-04-02, at 11:32 PM, Liam Whalen <whalen.ld at gmail.com> wrote:
>>
>> For the first SELECT you can use something like SELECT
>> regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield
>> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') FROM
>> biblio.record_entry WHERE marc ~
>> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>'
>>
>> That should give you your datafields without HCC.  If you can search
>> through the results for HCC and find any examples of HCC left, then you can
>> improve the regex to catch them.  Or opt for another non-regex route.  If
>> there are no HCC's left, then check to make sure you aren't missing some
>> due to something else in the WHERE regex.
>>
>> SELECT no_hcc FROM
>> (SELECT  regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield
>> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') AS
>> no_hcc FROM biblio.record_entry WHERE marc ~
>> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>' )
>> WHERE no_hcc ~* <subfield code="9">HCC</subfield>'
>>
>> Its been a while since I've written that type SQL, and I don't have an
>> Evergreen database to test it on.  So, that may not work exactly as
>> written, but that is the general idea on how to double check mass db
>> updates.  The idea with the last WHERE clause is to look for the least
>> amount of data that could catch the HCC subfield.  If you are willing to
>> sort through some false positives, it might be worthwhile changing the
>> where to '>HCC<' or even 'HCC', just to make sure you are getting
>> everything.  The idea being to look for malformed MARCXML.  As well, I put
>> a case insensitive regex search (~*) in the last WHERE clause. If you are
>> sure all the HCCs are upper case you will not need that.
>>
>> Liam
>>
>>
>> On 2013-04-02, at 10:20 PM, Liam Whalen <whalen.ld at gmail.com> wrote:
>>
>> You could also use negative lookahead to capture the 856 fields where
>> subfield 9 is not at the end.
>>
>> regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield
>> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g')
>>
>> I did not test this much either, but it does pick up only your HCC fields
>> in the data you provided below.  That being said, I was always surprised
>> when using regex on large sets of data.  A good way to test this is to run
>> it as a SELECT first to try and pull records with that regex, then grep
>> through them to make sure you are only getting rid of the data you want.
>>  Then for completeness, do another select for 856 fields with a WHERE
>> clause using your regex to exclude the items you just verified for
>> completeness, and grep through those results to make sure you are not
>> missing any 9 = HCC fields.
>>
>> Liam
>>
>>
>> On 2013-04-02, at 10:14 AM, Mike Rylander <mrylander at gmail.com> wrote:
>>
>> On Tue, Apr 2, 2013 at 9:43 AM, Tim Spindler <tjspindler at gmail.com>wrote:
>>
>>> This is the regex I was trying but in testing it wasn't working.
>>>
>>> regexp_replace(marc,E'<datafield[^>]*?tag="856"+?<subfield
>>> code="9">HCC</subfield></datafield>','','g')
>>>
>>>
>> Perhaps something like: regexp_replace(marc,E'<datafield[^>]+?tag="856".+
>> ?<subfield code="9">HCC</subfield></datafield>','','g')
>>
>> As long as subfield 9 is at the end of all the datafields, that should
>> work.  It's not heavily tested.
>>
>> --miker
>>
>>
>>> On Tue, Apr 2, 2013 at 9:41 AM, Tim Spindler <tjspindler at gmail.com>wrote:
>>>
>>>> Mike,
>>>>
>>>> For the batch update, it is about 6000 records with bucket, I didn't
>>>> think Evergreen could handle that many in a bucket.
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Tue, Apr 2, 2013 at 9:20 AM, Mike Rylander <mrylander at gmail.com>wrote:
>>>>
>>>>> If your removal code is regexp-based, you'll just need to match on HCC
>>>>> in the middle of the "find for removal" regexp.  Alternatively, you could
>>>>> apply the batch MARC update function to a bucket or CSV file full of the
>>>>> records.  That has the ability to modify records conditionally, based on
>>>>> subfield content within a datafield.
>>>>>
>>>>> --miker
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 2, 2013 at 9:12 AM, Tim Spindler <tjspindler at gmail.com>wrote:
>>>>>
>>>>>> We have a library that wants to reload its Films on Demand records.
>>>>>> These records have multiple 856s for other institutions with the subfield 9
>>>>>> associated with their shortname.  What would the best way be to remove
>>>>>> these for instance.   I have created a list of the records needing updating
>>>>>> in the staging table.  Here is an example of the of he 856.  I have code
>>>>>> that can remove all the 856s but am having trouble removing just one.  We
>>>>>> just want to remove the one for HCC (although I'll take suggests for
>>>>>> removing the duplicate QCC lines also)
>>>>>>
>>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here
>>>>>> to access this film online (MCLA students, faculty, and staff
>>>>>> only)</subfield><subfield code="u">
>>>>>> http://libproxy.mcla.edu:2048/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=11580&amp;xtid=152<http://libproxy.mcla.edu:2048/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=11580&xtid=152></subfield><subfield
>>>>>> code="9">MCLA</subfield></datafield>
>>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="u">
>>>>>> http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&amp;xtid=152<http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152></subfield><subfield
>>>>>> code="y">Click here to access this film online (HCC Students and Staff
>>>>>> only)</subfield><subfield code="9">HCC</subfield></datafield>
>>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here
>>>>>> to access this film online (QCC studetns, faculty, &amp; staff
>>>>>> only)</subfield><subfield code="u">
>>>>>> http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&amp;xtid=152<http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152></subfield><subfield
>>>>>> code="9">QCC</subfield></datafield>
>>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="u">
>>>>>> http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&amp;xtid=152<http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152></subfield><subfield
>>>>>> code="y">Click here to access this film online (HCC Students and Staff
>>>>>> only)</subfield><subfield code="9">HCC</subfield></datafield>
>>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here
>>>>>> to access this film online (QCC studetns, faculty, &amp; staff
>>>>>> only)</subfield><subfield code="u">
>>>>>> http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&amp;xtid=152<http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152></subfield><subfield
>>>>>> code="9">QCC</subfield></datafield>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Tim Spindler
>>>>>> tjspindler at gmail.com
>>>>>>
>>>>>> *P**   Go Green - **Save a tree! Please don't print this e-mail
>>>>>> unless it's really necessary.*
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mike Rylander
>>>>>  | Director of Research and Development
>>>>>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>>>>>  | phone:  1-877-OPEN-ILS (673-6457)
>>>>>  | email:  miker at esilibrary.com
>>>>>  | web:  http://www.esilibrary.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Tim Spindler
>>>> tjspindler at gmail.com
>>>>
>>>> *P**   Go Green - **Save a tree! Please don't print this e-mail unless
>>>> it's really necessary.*
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Tim Spindler
>>> tjspindler at gmail.com
>>>
>>> *P**   Go Green - **Save a tree! Please don't print this e-mail unless
>>> it's really necessary.*
>>>
>>>
>>>
>>
>>
>>
>> --
>> Mike Rylander
>>  | Director of Research and Development
>>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>>  | phone:  1-877-OPEN-ILS (673-6457)
>>  | email:  miker at esilibrary.com
>>  | web:  http://www.esilibrary.com
>>
>>
>>
>>
>>
>>
>
>
> --
> Tim Spindler
> tjspindler at gmail.com
>
> *P**   Go Green - **Save a tree! Please don't print this e-mail unless
> it's really necessary.*
>
>
>
>


-- 
Tim Spindler
tjspindler at gmail.com

*P**   Go Green - **Save a tree! Please don't print this e-mail unless it's
really necessary.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20130403/b4b41bd5/attachment-0001.htm>


More information about the Open-ils-dev mailing list