[Evergreen-general] Question about search engine bots & DB CPU spikes

JonGeorg SageLibrary jongeorg.sagelibrary at gmail.com
Wed Dec 1 00:53:07 EST 2021


Because we're behind a firewall, all the addresses display as 127.0.0.1. I
can talk to the people who administer the firewall though about blocking
IP's. Thanks
-Jon

On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general <
evergreen-general at list.evergreen-ils.org> wrote:

> JonGeorg,
>
> Check your Apache logs for the source IP addresses. If you can't find
> them, I can share the correct configuration for Apache with Nginx so
> that you will get the addresses logged.
>
> Once you know the IP address ranges, block them. If you have a firewall,
> I suggest you block them there. If not, you can block them in Nginx or
> in your load balancer configuration if you have one and it allows that.
>
> You may think you want your catalog to show up in search engines, but
> bad bots will lie about who they are. All you can do with misbehaving
> bots is to block them.
>
> HtH,
> Jason
>
> On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general wrote:
> > Question. We've been getting hammered by search engine bots [?], but
> > they seem to all query our system at the same time. Enough that it's
> > crashing the app servers. We have a robots.txt file in place. I've
> > increased the crawling delay speed from 3 to 10 seconds, and have
> > explicitly disallowed the specific bots, but I've seen no change from
> > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits from
> > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the same
> > timeframe. All a couple hours after I made the changes to the robots
> > file and restarted apache services. Which out of 100k entries in the
> > vhosts files in that time frame doesn't sound like a lot, but the rest
> > of the traffic looks normal. This issue has been happening
> > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and the only
> > thing that seems to work is to manually kill the services on the DB
> > servers and restart services on the application servers.
> >
> > The symptom is an immediate spike in the Database CPU load. I start
> > killing all queries older than 2 minutes, but it still usually
> > overwhelms the system causing the app servers to stop serving requests.
> > The stuck queries are almost always ones along the lines of:
> >
> > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
> > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
> > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0)
> > check_limit(1000) sort(1) filter_group_entry(1) 1
> > site(*/LIBRARY_BRANCH/*) depth(2)
> >                      +
> >                   |       |         WITH w AS (
> >                  |       | WITH */STRING/*_keyword_xq AS (SELECT
> >                                                  +
> >                   |       |       (to_tsquery('english_nostop',
> > COALESCE(NULLIF( '(' ||
> >
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>
> > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) ||
> > to_tsquery('simple', COALESCE(NULLIF( '(' ||
> >
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>
> > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), ''))) AS
> > tsq,+
> >                   |       |       (to_tsquery('english_nostop',
> > COALESCE(NULLIF( '(' ||
> > btrim(regexp_replace(split_date_range(search_normalize
> >   00:02:17.319491 | */STRING/* |
> >
> > And the queries by DorkBot look like they could be starting the query
> > since it's using the basket function in the OPAC.
> >
> > "GET
> >
> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
>
> > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
> >
> > I've anonymized the output just to be cautious. Reports are run off the
> > backup database server, so it cannot be an auto generated report, and it
> > doesn't happen often enough for that either. At this point I'm tempted
> > to block the IP addresses. What strategies are you all using to deal
> > with crawlers, and does anyone have an idea what is causing this?
> > -Jon
> >
> > _______________________________________________
> > Evergreen-general mailing list
> > Evergreen-general at list.evergreen-ils.org
> > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
> >
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20211130/75048af8/attachment-0001.html>


More information about the Evergreen-general mailing list