[Evergreen-general] Question about search engine bots & DB CPU spikes

JonGeorg SageLibrary jongeorg.sagelibrary at gmail.com
Tue Nov 30 21:34:30 EST 2021


Question. We've been getting hammered by search engine bots [?], but they
seem to all query our system at the same time. Enough that it's crashing
the app servers. We have a robots.txt file in place. I've increased the
crawling delay speed from 3 to 10 seconds, and have explicitly disallowed
the specific bots, but I've seen no change from the worst offenders -
Bingbot and UT-Dorkbot. We had over 4k hits from Dorkbot alone from 2pm-5pm
today, and over 5k from Bingbot in the same timeframe. All a couple hours
after I made the changes to the robots file and restarted apache services.
Which out of 100k entries in the vhosts files in that time frame doesn't
sound like a lot, but the rest of the traffic looks normal. This issue has
been happening intermittently [last 3 are 11/30, 11/3, 7/20] for a while,
and the only thing that seems to work is to manually kill the services on
the DB servers and restart services on the application servers.

The symptom is an immediate spike in the Database CPU load. I start killing
all queries older than 2 minutes, but it still usually overwhelms the
system causing the app servers to stop serving requests. The stuck queries
are almost always ones along the lines of:

-- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
from_metarecord(*BIB_RECORD#*) core_limit(100000) badge_orgs(1,138,151)
estimation_strategy(inclusion) skip_check(0) check_limit(1000) sort(1)
filter_group_entry(1) 1 site(*LIBRARY_BRANCH*) depth(2)
                                        +
                 |       |         WITH w AS (
                |       | WITH *STRING*_keyword_xq AS (SELECT
                                          +
                 |       |       (to_tsquery('english_nostop',
COALESCE(NULLIF( '(' ||
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
*LONG_STRING*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) ||
to_tsquery('simple', COALESCE(NULLIF( '(' ||
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
*LONG_STRING*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), ''))) AS tsq,+
                 |       |       (to_tsquery('english_nostop',
COALESCE(NULLIF( '(' ||
btrim(regexp_replace(split_date_range(search_normalize
 00:02:17.319491 | *STRING* |

And the queries by DorkBot look like they could be starting the query since
it's using the basket function in the OPAC.

"GET /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=
*LONG_STRING*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"

I've anonymized the output just to be cautious. Reports are run off the
backup database server, so it cannot be an auto generated report, and it
doesn't happen often enough for that either. At this point I'm tempted to
block the IP addresses. What strategies are you all using to deal with
crawlers, and does anyone have an idea what is causing this?
-Jon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20211130/a67bac37/attachment.html>


More information about the Evergreen-general mailing list