[OPEN-ILS-DEV] ***SPAM*** Re: Link Checker Staff Client patch

Paul Hoffman paul at flo.org
Fri Dec 2 15:26:10 EST 2011


On Thu, Dec 01, 2011 at 11:10:44AM -0500, Whalen, Liam wrote:
> I thought robots.txt was meant to be checked by web crawlers?

D'oh!  Yes, of course, robots.txt is irrelevant.  Sorry, I don't know 
what I was thinking.

> > and you really should randomize the order in which you check links 
> > so that you don't end up unknowingly burying a server in a flurry of 
> > requests.  
> 
> This is a good point of which I hadn't thought.  Doing this would add 
> a fair bit of size to the database because the URLs to be checked 
> would need to be stored in the database.  [...]

> > (This last point is where I've been burned in the past.) Also, I 
> > assume (blithely!) that there are already plenty of good link 
> > checkers out there -- you could even use something as simple as curl 
> > or wget with the proper options.
> 
> I'm using Perl's HTTP::Request package to do the link checking.  It
> would be fairly straight forward to separate the searching of the URLs 
> from the checking.  I could add a new method to the LinkChecker.pm 
> that would harvest the URLs, then modify the current checking code to 
> loop over that data.  I'm not sure if having the list of URLs stored 
> permanently in a separate table is entirely worth while though.  The 
> data already exists in the MARC records.  Storing it again is 
> duplication.  Storing it to sort it is another matter. Perhaps the 
> best option would be to store it, sort it, check the URLs, then delete 
> the data.

Could you ORDER BY RAND() or some such?  That way you wouldn't need to 
store the URLs at all (unless I'm missing something).

> That wasn't harsh at all.  Thanks for your input!

Whew!  You're very welcome.

Paul.

-- 
Paul Hoffman <paul at flo.org>
Systems Librarian
Fenway Libraries Online
c/o Wentworth Institute of Technology
550 Huntington Ave.
Boston, MA 02115
(617) 445-2914
(617) 442-2384 (FLO main number)


More information about the Open-ils-dev mailing list