<div dir="ltr">The LONG STRING sometimes contains a word, but it's usually just a string of numbers repeated, like this-Â $_78110$[$_78110$, $_78110$$_78110$), $_78110$]$_78110$, $_78110$$_78110$. The numbers change which is why I suspect it's a SQL injection attempt.<br><br>I agree re blocking by IP's. I didn't set the robots file crawl time any higher as I wanted to see what, if any, effect the initial change had during an attack.Â <div>-Jon</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Dec 1, 2021 at 11:27 AM Jeff Davis via Evergreen-general <<a href="mailto:evergreen-general@list.evergreen-ils.org">evergreen-general@list.evergreen-ils.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Our robots.txt file (<a href="https://catalogue.libraries.coop/robots.txt" rel="noreferrer" target="_blank">https://catalogue.libraries.coop/robots.txt</a>) <br>

throttles Googlebot and Bingbot to 60 seconds and disallows certain <br>

other crawlers entirely.Â  So even 10 seconds seems generous to me.<br>

<br>

Of course, robots.txt will only be respected by well-behaved crawlers; <br>

there's nothing preventing a bot from ignoring it (in which case, as <br>

Jason says, your best bet may be to block the offending IP).<br>

<br>

Is the "LONG_STRING" in your examples a legitimate search -- i.e, no <br>

unusual characters or obvious SQL injection attempts?Â  Does it contain <br>

complex nesting of search terms?<br>

<br>

Jeff<br>

<br>

<br>

On 2021-11-30 6:34 p.m., JonGeorg SageLibrary via Evergreen-general wrote:<br>

> Question. We've been getting hammered by search engine bots [?], but <br>

> they seem to all query our system at the same time. Enough that it's <br>

> crashing the app servers. We have a robots.txt file in place. I've <br>

> increased the crawling delay speed from 3 to 10 seconds, and have <br>

> explicitlyÂ disallowed the specific bots, but I've seen no change from <br>

> the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits from <br>

> Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the same <br>

> timeframe. All a couple hours after I made the changes to the robots <br>

> file and restarted apache services. WhichÂ out of 100k entries in the <br>

> vhosts files in that time frame doesn't sound like a lot, but the rest <br>

> of the traffic looks normal. This issue has been happening <br>

> intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and the only <br>

> thing that seems to work is to manually kill the services on the DB <br>

> servers and restart services on the application servers.<br>

> <br>

> The symptom is an immediate spike in the Database CPU load. I start <br>

> killing all queries older than 2 minutes, but it still usually <br>

> overwhelms the system causing the app servers to stop serving requests. <br>

> The stuck queries are almost always ones along the lines of:<br>

> <br>

> -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords <br>

> from_metarecord(*/BIB_RECORD#/*) core_limit(100000) <br>

> badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0) <br>

> check_limit(1000) sort(1) filter_group_entry(1) 1 <br>

> site(*/LIBRARY_BRANCH/*) depth(2)Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  <br>

>Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  +<br>

>Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  | Â  Â  Â  Â  WITH w AS (<br>

>Â  Â  Â  Â  Â  Â  Â  Â  Â  | Â  Â  Â  | WITH */STRING/*_keyword_xq AS (SELECTÂ  Â  Â  Â  <br>

>Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  +<br>

>Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  | Â  Â  Â  (to_tsquery('english_nostop', <br>

> COALESCE(NULLIF( '(' || <br>

> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), <br>

> */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') Â || ')', '()'), '')) || <br>

> to_tsquery('simple', COALESCE(NULLIF( '(' || <br>

> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), <br>

> */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') Â || ')', '()'), ''))) AS <br>

> tsq,+<br>

>Â  Â  Â  Â  Â  Â  Â  Â  Â  Â | Â  Â  Â  | Â  Â  Â  (to_tsquery('english_nostop', <br>

> COALESCE(NULLIF( '(' || <br>

> btrim(regexp_replace(split_date_range(search_normalize<br>

>Â  Â 00:02:17.319491 | */STRING/* |<br>

> <br>

> And the queries by DorkBot look like they could be starting the query <br>

> since it's using the basket function in the OPAC.<br>

> <br>

> "GET <br>

> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 <br>

> HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"<br>

> <br>

> I've anonymized the output just to be cautious. Reports are run off the <br>

> backup database server, so it cannot be an auto generated report,Â and it <br>

> doesn't happen often enough for that either. At this point I'm tempted <br>

> to block the IP addresses. What strategies are you all using toÂ deal <br>

> with crawlers, and does anyone have an idea what is causing this?<br>

> -Jon<br>

> <br>

> _______________________________________________<br>

> Evergreen-general mailing list<br>

> <a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>

> <a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>

> <br>

_______________________________________________<br>

Evergreen-general mailing list<br>

<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>

<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>

</blockquote></div>