[OPEN-ILS-DEV] SPAM Re: Link Checker Staff Client patch

Mon Dec 5 17:09:54 EST 2011

> -----Original Message-----
> From: open-ils-dev-bounces at list.georgialibraries.org 
> [mailto:open-ils-dev-bounces at list.georgialibraries.org] On 
> Behalf Of Paul Hoffman
> Sent: December 2, 2011 15:26
> To: open-ils-dev at list.georgialibraries.org
> Subject: [OPEN-ILS-DEV] ***SPAM*** Re: Link Checker Staff Client patch
> 
> On Thu, Dec 01, 2011 at 11:10:44AM -0500, Whalen, Liam wrote:
... 
> > > and you really should randomize the order in which you 
> check links 
> > > so that you don't end up unknowingly burying a server in 
> a flurry of 
> > > requests.
> > 
> > This is a good point of which I hadn't thought.  Doing this 
> would add 
> > a fair bit of size to the database because the URLs to be checked 
> > would need to be stored in the database.  [...]
> 
> > > (This last point is where I've been burned in the past.) Also, I 
> > > assume (blithely!) that there are already plenty of good link 
> > > checkers out there -- you could even use something as 
> simple as curl 
> > > or wget with the proper options.
> > 
> > I'm using Perl's HTTP::Request package to do the link checking.  It 
> > would be fairly straight forward to separate the searching 
> of the URLs 
> > from the checking.  I could add a new method to the LinkChecker.pm 
> > that would harvest the URLs, then modify the current 
> checking code to 
> > loop over that data.  I'm not sure if having the list of 
> URLs stored 
> > permanently in a separate table is entirely worth while 
> though.  The 
> > data already exists in the MARC records.  Storing it again is 
> > duplication.  Storing it to sort it is another matter. Perhaps the 
> > best option would be to store it, sort it, check the URLs, 
> then delete 
> > the data.
> 
> Could you ORDER BY RAND() or some such?  That way you 
> wouldn't need to store the URLs at all (unless I'm missing something).

I think sorting them is the better option.  That way it ensures that
URLs with the same domain do not get checked consecutively.  I did think
of a way of doing it without storing the URLs though and without having
to pre-sort it.  I can store the domain of the last checked URL, and
when I check a new link I can see if the new link's domain is the same
as the last checked domain.  If it is the same, then I'll cache the new
link and move on to another link.  Once all the URLs have been processed
from the database, then I can loop over the cache doing the same check
of the domain name.  Eventually the cache will only contain URLs that
belong to one domain at which point I can start pausing after checking
each URL.

Liam