[Evergreen-general] Question about search engine bots & DB CPU spikes

Jeff Davis jeff.davis at bc.libraries.coop
Wed Dec 1 14:27:24 EST 2021


Our robots.txt file (https://catalogue.libraries.coop/robots.txt) 
throttles Googlebot and Bingbot to 60 seconds and disallows certain 
other crawlers entirely.  So even 10 seconds seems generous to me.

Of course, robots.txt will only be respected by well-behaved crawlers; 
there's nothing preventing a bot from ignoring it (in which case, as 
Jason says, your best bet may be to block the offending IP).

Is the "LONG_STRING" in your examples a legitimate search -- i.e, no 
unusual characters or obvious SQL injection attempts?  Does it contain 
complex nesting of search terms?

Jeff


On 2021-11-30 6:34 p.m., JonGeorg SageLibrary via Evergreen-general wrote:
> Question. We've been getting hammered by search engine bots [?], but 
> they seem to all query our system at the same time. Enough that it's 
> crashing the app servers. We have a robots.txt file in place. I've 
> increased the crawling delay speed from 3 to 10 seconds, and have 
> explicitly disallowed the specific bots, but I've seen no change from 
> the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits from 
> Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the same 
> timeframe. All a couple hours after I made the changes to the robots 
> file and restarted apache services. Which out of 100k entries in the 
> vhosts files in that time frame doesn't sound like a lot, but the rest 
> of the traffic looks normal. This issue has been happening 
> intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and the only 
> thing that seems to work is to manually kill the services on the DB 
> servers and restart services on the application servers.
> 
> The symptom is an immediate spike in the Database CPU load. I start 
> killing all queries older than 2 minutes, but it still usually 
> overwhelms the system causing the app servers to stop serving requests. 
> The stuck queries are almost always ones along the lines of:
> 
> -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords 
> from_metarecord(*/BIB_RECORD#/*) core_limit(100000) 
> badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0) 
> check_limit(1000) sort(1) filter_group_entry(1) 1 
> site(*/LIBRARY_BRANCH/*) depth(2)                                        
>                      +
>                   |       |         WITH w AS (
>                  |       | WITH */STRING/*_keyword_xq AS (SELECT        
>                                                  +
>                   |       |       (to_tsquery('english_nostop', 
> COALESCE(NULLIF( '(' || 
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), 
> */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) || 
> to_tsquery('simple', COALESCE(NULLIF( '(' || 
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), 
> */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), ''))) AS 
> tsq,+
>                   |       |       (to_tsquery('english_nostop', 
> COALESCE(NULLIF( '(' || 
> btrim(regexp_replace(split_date_range(search_normalize
>   00:02:17.319491 | */STRING/* |
> 
> And the queries by DorkBot look like they could be starting the query 
> since it's using the basket function in the OPAC.
> 
> "GET 
> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 
> HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
> 
> I've anonymized the output just to be cautious. Reports are run off the 
> backup database server, so it cannot be an auto generated report, and it 
> doesn't happen often enough for that either. At this point I'm tempted 
> to block the IP addresses. What strategies are you all using to deal 
> with crawlers, and does anyone have an idea what is causing this?
> -Jon
> 
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
> 


More information about the Evergreen-general mailing list