[OPEN-ILS-DEV] ***SPAM*** Re: Re: Re: Removing URLS from 856

Tim Spindler tjspindler at gmail.com
Wed Apr 3 09:13:18 EDT 2013


Thanks fro all the replies.  I did get the following regex to work

regexp_replace(marc,E'<datafield tag="856" ind1="4" ind2="0"><subfield
code="u">
http://ezhc.ez.cwmars.org:4000/login.url=http://digital.films.com/PortalPlaylists.aspx.aid=9206.+?HCC
</subfield></datafield>','','g')

The url is unique to Films on Demand  for HCC so that seems to work well to
identify the beginning part of the 856 and then the HCC part identifies the
end.




On Wed, Apr 3, 2013 at 6:51 AM, Liam Whalen <whalen.ld at gmail.com> wrote:

> Sorry for the chain replying to my own emails.  But, that code is not
> exactly right.  using matches between ^ $ will not get you what you want
> exactly, that will only return the first occurrence of 856 replaced in a
> record.  It has been over a year since I've done stuff like this, so I'm
> sorry if the code is not very usable.  However, the code wasn't really my
> main point.  What I've learned using regex to replace data is I expect
> regex to get rid of useful data because there is no way to predict the
> outliers in the data.  Almost definitely, there will be some 856s fields
> that do not conform to typical pattern in your data, and regex may remove
> them, in which case you will lose useful data.
>
> Liam
> On 2013-04-03, at 6:35 AM, Liam Whalen <whalen.ld at gmail.com> wrote:
>
> I forgot to mention checking the data that was replaced:
>
> SELECT id, replaced FROM
> (SELECT id, regexp_replace(marc,E'^(?:(?!<datafield[^>]+?tag="856").)*(<datafield[^>]+?tag="856".+?<subfield
> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>').*$,'\1','g')
> AS replaced FROM biblio.record_entry WHERE marc ~
> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>')
> WHERE length(replaced) > 0
>
> That should give you all the fields that were replaced and the ids for
> their records.  You can then use further selects to make sure you are only
> replacing what you think you are replacing.  For instance, you can look for
> data that was coded improperly.  Perhaps there is a link with an HCC in 9
> that isn't really an HCC link.
>
> I'm not sure what determines if it is HCC, maybe its the port or the aid=#?
>
> Something like
>
> SELECT id, regexp_replace(replaced, E'^(?:(?!<subfield
> code="u">).)*([^<]*)</subfield>','\1','g') AS url FROM
> (SELECT id, replaced FROM
> (SELECT id, regexp_replace(marc,E'^(?:(?!<datafield[^>]+?tag="856").)*(<datafield[^>]+?tag="856".+?<subfield
> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>').*$,'\1','g')
> AS replaced FROM biblio.record_entry WHERE marc ~
> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>')
> WHERE length(replaced) > 0))
>
> WHERE url !~ '.*aid=9206.*'
>
> That should give you urls you are replacing that do not have aid=9206.
>  I'm not sure what data you need to check for, but the idea is to not
> replace data that has been entered incorrectly.   While it might have HCC
> in subfield 9, that could be there incorrectly, in which case you will be
> eliminating a useful URL that has been coded incorrectly. There are other
> things to check for. I'm not sure if there are limitations on the data's
> use of subfield 9.  However it might be repeated in a url indicating that a
> link belongs to more than one shortname.  In which case the second subfield
> would get eaten by the negative lookahead, and the 856 tag would be removed
> when it may still be valid for another shortname.
>
> Liam
>
>
> On 2013-04-02, at 11:32 PM, Liam Whalen <whalen.ld at gmail.com> wrote:
>
> For the first SELECT you can use something like SELECT
> regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield
> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') FROM
> biblio.record_entry WHERE marc ~
> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>'
>
> That should give you your datafields without HCC.  If you can search
> through the results for HCC and find any examples of HCC left, then you can
> improve the regex to catch them.  Or opt for another non-regex route.  If
> there are no HCC's left, then check to make sure you aren't missing some
> due to something else in the WHERE regex.
>
> SELECT no_hcc FROM
> (SELECT  regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield
> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') AS
> no_hcc FROM biblio.record_entry WHERE marc ~
> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>' )
> WHERE no_hcc ~* <subfield code="9">HCC</subfield>'
>
> Its been a while since I've written that type SQL, and I don't have an
> Evergreen database to test it on.  So, that may not work exactly as
> written, but that is the general idea on how to double check mass db
> updates.  The idea with the last WHERE clause is to look for the least
> amount of data that could catch the HCC subfield.  If you are willing to
> sort through some false positives, it might be worthwhile changing the
> where to '>HCC<' or even 'HCC', just to make sure you are getting
> everything.  The idea being to look for malformed MARCXML.  As well, I put
> a case insensitive regex search (~*) in the last WHERE clause. If you are
> sure all the HCCs are upper case you will not need that.
>
> Liam
>
>
> On 2013-04-02, at 10:20 PM, Liam Whalen <whalen.ld at gmail.com> wrote:
>
> You could also use negative lookahead to capture the 856 fields where
> subfield 9 is not at the end.
>
> regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield
> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g')
>
> I did not test this much either, but it does pick up only your HCC fields
> in the data you provided below.  That being said, I was always surprised
> when using regex on large sets of data.  A good way to test this is to run
> it as a SELECT first to try and pull records with that regex, then grep
> through them to make sure you are only getting rid of the data you want.
>  Then for completeness, do another select for 856 fields with a WHERE
> clause using your regex to exclude the items you just verified for
> completeness, and grep through those results to make sure you are not
> missing any 9 = HCC fields.
>
> Liam
>
>
> On 2013-04-02, at 10:14 AM, Mike Rylander <mrylander at gmail.com> wrote:
>
> On Tue, Apr 2, 2013 at 9:43 AM, Tim Spindler <tjspindler at gmail.com> wrote:
>
>> This is the regex I was trying but in testing it wasn't working.
>>
>> regexp_replace(marc,E'<datafield[^>]*?tag="856"+?<subfield
>> code="9">HCC</subfield></datafield>','','g')
>>
>>
> Perhaps something like: regexp_replace(marc,E'<datafield[^>]+?tag="856".+
> ?<subfield code="9">HCC</subfield></datafield>','','g')
>
> As long as subfield 9 is at the end of all the datafields, that should
> work.  It's not heavily tested.
>
> --miker
>
>
>> On Tue, Apr 2, 2013 at 9:41 AM, Tim Spindler <tjspindler at gmail.com>wrote:
>>
>>> Mike,
>>>
>>> For the batch update, it is about 6000 records with bucket, I didn't
>>> think Evergreen could handle that many in a bucket.
>>>
>>> Tim
>>>
>>>
>>> On Tue, Apr 2, 2013 at 9:20 AM, Mike Rylander <mrylander at gmail.com>wrote:
>>>
>>>> If your removal code is regexp-based, you'll just need to match on HCC
>>>> in the middle of the "find for removal" regexp.  Alternatively, you could
>>>> apply the batch MARC update function to a bucket or CSV file full of the
>>>> records.  That has the ability to modify records conditionally, based on
>>>> subfield content within a datafield.
>>>>
>>>> --miker
>>>>
>>>>
>>>>
>>>> On Tue, Apr 2, 2013 at 9:12 AM, Tim Spindler <tjspindler at gmail.com>wrote:
>>>>
>>>>> We have a library that wants to reload its Films on Demand records.
>>>>> These records have multiple 856s for other institutions with the subfield 9
>>>>> associated with their shortname.  What would the best way be to remove
>>>>> these for instance.   I have created a list of the records needing updating
>>>>> in the staging table.  Here is an example of the of he 856.  I have code
>>>>> that can remove all the 856s but am having trouble removing just one.  We
>>>>> just want to remove the one for HCC (although I'll take suggests for
>>>>> removing the duplicate QCC lines also)
>>>>>
>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here
>>>>> to access this film online (MCLA students, faculty, and staff
>>>>> only)</subfield><subfield code="u">
>>>>> http://libproxy.mcla.edu:2048/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=11580&amp;xtid=152<http://libproxy.mcla.edu:2048/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=11580&xtid=152></subfield><subfield
>>>>> code="9">MCLA</subfield></datafield>
>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="u">
>>>>> http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&amp;xtid=152<http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152></subfield><subfield
>>>>> code="y">Click here to access this film online (HCC Students and Staff
>>>>> only)</subfield><subfield code="9">HCC</subfield></datafield>
>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here
>>>>> to access this film online (QCC studetns, faculty, &amp; staff
>>>>> only)</subfield><subfield code="u">
>>>>> http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&amp;xtid=152<http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152></subfield><subfield
>>>>> code="9">QCC</subfield></datafield>
>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="u">
>>>>> http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&amp;xtid=152<http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152></subfield><subfield
>>>>> code="y">Click here to access this film online (HCC Students and Staff
>>>>> only)</subfield><subfield code="9">HCC</subfield></datafield>
>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here
>>>>> to access this film online (QCC studetns, faculty, &amp; staff
>>>>> only)</subfield><subfield code="u">
>>>>> http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&amp;xtid=152<http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152></subfield><subfield
>>>>> code="9">QCC</subfield></datafield>
>>>>>
>>>>>
>>>>> --
>>>>> Tim Spindler
>>>>> tjspindler at gmail.com
>>>>>
>>>>> *P**   Go Green - **Save a tree! Please don't print this e-mail
>>>>> unless it's really necessary.*
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mike Rylander
>>>>  | Director of Research and Development
>>>>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>>>>  | phone:  1-877-OPEN-ILS (673-6457)
>>>>  | email:  miker at esilibrary.com
>>>>  | web:  http://www.esilibrary.com
>>>>
>>>
>>>
>>>
>>> --
>>> Tim Spindler
>>> tjspindler at gmail.com
>>>
>>> *P**   Go Green - **Save a tree! Please don't print this e-mail unless
>>> it's really necessary.*
>>>
>>>
>>>
>>
>>
>>
>> --
>> Tim Spindler
>> tjspindler at gmail.com
>>
>> *P**   Go Green - **Save a tree! Please don't print this e-mail unless
>> it's really necessary.*
>>
>>
>>
>
>
>
> --
> Mike Rylander
>  | Director of Research and Development
>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
>
>
>
>
>
>


-- 
Tim Spindler
tjspindler at gmail.com

*P**   Go Green - **Save a tree! Please don't print this e-mail unless it's
really necessary.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20130403/f4372c46/attachment-0001.htm>


More information about the Open-ils-dev mailing list