[OPEN-ILS-DEV] ***SPAM*** Re: Link Checker Staff Client patch
Paul Hoffman
paul at flo.org
Fri Dec 2 15:26:10 EST 2011
On Thu, Dec 01, 2011 at 11:10:44AM -0500, Whalen, Liam wrote:
> I thought robots.txt was meant to be checked by web crawlers?
D'oh! Yes, of course, robots.txt is irrelevant. Sorry, I don't know
what I was thinking.
> > and you really should randomize the order in which you check links
> > so that you don't end up unknowingly burying a server in a flurry of
> > requests.
>
> This is a good point of which I hadn't thought. Doing this would add
> a fair bit of size to the database because the URLs to be checked
> would need to be stored in the database. [...]
> > (This last point is where I've been burned in the past.) Also, I
> > assume (blithely!) that there are already plenty of good link
> > checkers out there -- you could even use something as simple as curl
> > or wget with the proper options.
>
> I'm using Perl's HTTP::Request package to do the link checking. It
> would be fairly straight forward to separate the searching of the URLs
> from the checking. I could add a new method to the LinkChecker.pm
> that would harvest the URLs, then modify the current checking code to
> loop over that data. I'm not sure if having the list of URLs stored
> permanently in a separate table is entirely worth while though. The
> data already exists in the MARC records. Storing it again is
> duplication. Storing it to sort it is another matter. Perhaps the
> best option would be to store it, sort it, check the URLs, then delete
> the data.
Could you ORDER BY RAND() or some such? That way you wouldn't need to
store the URLs at all (unless I'm missing something).
> That wasn't harsh at all. Thanks for your input!
Whew! You're very welcome.
Paul.
--
Paul Hoffman <paul at flo.org>
Systems Librarian
Fenway Libraries Online
c/o Wentworth Institute of Technology
550 Huntington Ave.
Boston, MA 02115
(617) 445-2914
(617) 442-2384 (FLO main number)
More information about the Open-ils-dev
mailing list