[OPEN-ILS-DEV] Link Checker Staff Client patch

Thu Dec 1 11:10:44 EST 2011

> -----Original Message-----
> From: open-ils-dev-bounces at list.georgialibraries.org 
> [mailto:open-ils-dev-bounces at list.georgialibraries.org] On 
> Behalf Of Paul Hoffman
> Sent: December 1, 2011 10:37
> To: open-ils-dev at list.georgialibraries.org
> Subject: Re: [OPEN-ILS-DEV] Link Checker Staff Client patch
> 
> I can see that a lot of effort has gone into this patch.  I 
> hope you won't mind my butting in to comment on this -- we're 
> not running Evergreen (yet), and I haven't read your code 
> closely, but I've run into problems caused by link checkers 
> in the past and have some thoughts on how you might rethink 
> things in order to avoid adding to the problem.
> 
> In my opinion, it's much better to separate the code that "harvests" 
> URLs to check from the code that actually performs the checking.
> 
> Why?  First and foremost, link checking is tricky -- it's 
> best done a little at a time during off-peak hours (whatever 
> those might be!); you have to (or *should*) consult 
> robots.txt once a day (or so) for each host; 

I thought robots.txt was meant to be checked by web crawlers?
Is there a difference between someone manually going to each link
and verifying that its active and having a machine do it?  I will
have to read more about robots.txt.  I supposed if I am hitting a 
large number of links from the same host then this is an issue, 
which leads me to your next point.

> and you really 
> should randomize the order in which you check links so that 
> you don't end up unknowingly burying a server in a flurry of 
> requests.  

This is a good point of which I hadn't thought.  Doing this would add 
a fair bit of size to the database because the URLs to be checked 
would need to be stored in the database.  I'm not sure how to approach
that.
In our midsized collection there are close to 25,000 links that get 
checked when I run the code against all of our data.  As it is, the
links
are not ordered. The are returned unordered by the SQL search, and if
any
pattern exists, then it is most likely because similar records were
added
sequentially, which is likely to occur.  I suppose the best option here
would be to order the links by domain, then sort them so that no domain
in
the list follows itself.  At least as much as that would be possible.

> (This last point is where I've been burned in the 
> past.) Also, I assume (blithely!) that there are already 
> plenty of good link checkers out there -- you could even use 
> something as simple as curl or wget with the proper options.

I'm using Perl's HTTP::Request package to do the link checking.  It
would be 
fairly straight forward to separate the searching of the URLs from the 
checking.  I could add a new method to the LinkChecker.pm that would
harvest the URLs, then modify the current checking code to loop over
that
data.  I'm not sure if having the list of URLs stored permanently in a
separate 
table is entirely worth while though.  The data already exists in the
MARC 
records.  Storing it again is duplication.  Storing it to sort it is
another 
matter. Perhaps the best option would be to store it, sort it, check the
URLs, 
then delete the data.

> Finally, this simply gives you a lot more flexibility while 
> keeping things simpler -- the UNIX philosophy, in a nutshell.

Simpler is a good philosophy.

> Sorry if this comes across as harsh; it's meant as 
> constructive criticism.

That wasn't harsh at all.  Thanks for your input!

Liam