[Evergreen-general] Question about search engine bots & DB CPU spikes

Blake Henderson blake at mobiusconsortium.org
Fri Dec 3 11:10:42 EST 2021


JonGeorg,

This reminds me of a similar issues that we had. We resolved it with 
this change to NGINX. Here's the link:

https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits

and the bug:
https://bugs.launchpad.net/evergreen/+bug/1913610

I'm not sure that it's the same issue though, as you've shared a search 
SQL query and this solution addresses external requests to 
"/opac/extras/unapi"
But you might be able to apply the same nginx rate limiting technique 
here if you can detect the URL they are using.

There is a tool called "apachetop" which I used in order to see the 
URL's that were being used.

apt-get -y install apachetop && apachetop -f 
/var/log/apache2/other_vhosts_access.log

and another useful command:

cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort | 
uniq -c | sort -rn

You have to ignore (not limit) all the requests to the Evergreen gateway 
as most of that traffic is the staff client and should (probably) not be 
limited.

I'm just throwing some ideas out there for you. Good luck!

-Blake-
Conducting Magic
Can consume data in any format
MOBIUS

On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
> I tried that and still got the loopback address, after restarting 
> services. Any other ideas? And the robots.txt file seems to be doing 
> nothing, which is not much of a surprise. I've reached out to the 
> people who host our network and have control of everything on the 
> other side of the firewall.
> -Jon
>
>
> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <jason at sigio.com> wrote:
>
>     JonGeorg,
>
>     If you're using nginx as a proxy, that may be the configuration of
>     Apache and nginx.
>
>     First, make sure that mod_remote_ip is installed and enabled for
>     Apache 2.
>
>     Then, in eg_vhost.conf, find the 3 lines the begin with
>     "RemoteIPInternalProxy 127.0.0.1/24 <http://127.0.0.1/24>" and
>     uncomment them.
>
>     Next, see what header Apache checks for the remote IP address. In my
>     example it is "RemoteIPHeader X-Forwarded-For"
>
>     Next, make sure that the following two lines appear in BOTH
>     "location /"
>     blocks in the ngins configuration:
>
>              proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>              proxy_set_header X-Forwarded-Proto $scheme;
>
>     After reloading/restarting nginx and Apache, you should start seeing
>     remote IP addresses in the Apache logs.
>
>     Hope that helps!
>     Jason
>
>
>     On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
>     > Because we're behind a firewall, all the addresses display as
>     127.0.0.1.
>     > I can talk to the people who administer the firewall though about
>     > blocking IP's. Thanks
>     > -Jon
>     >
>     > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via
>     Evergreen-general
>     > <evergreen-general at list.evergreen-ils.org
>     > <mailto:evergreen-general at list.evergreen-ils.org>> wrote:
>     >
>     >     JonGeorg,
>     >
>     >     Check your Apache logs for the source IP addresses. If you
>     can't find
>     >     them, I can share the correct configuration for Apache with
>     Nginx so
>     >     that you will get the addresses logged.
>     >
>     >     Once you know the IP address ranges, block them. If you have a
>     >     firewall,
>     >     I suggest you block them there. If not, you can block them
>     in Nginx or
>     >     in your load balancer configuration if you have one and it
>     allows that.
>     >
>     >     You may think you want your catalog to show up in search
>     engines, but
>     >     bad bots will lie about who they are. All you can do with
>     misbehaving
>     >     bots is to block them.
>     >
>     >     HtH,
>     >     Jason
>     >
>     >     On 11/30/21 9:34 PM, JonGeorg SageLibrary via
>     Evergreen-general wrote:
>     >      > Question. We've been getting hammered by search engine
>     bots [?], but
>     >      > they seem to all query our system at the same time.
>     Enough that it's
>     >      > crashing the app servers. We have a robots.txt file in
>     place. I've
>     >      > increased the crawling delay speed from 3 to 10 seconds,
>     and have
>     >      > explicitly disallowed the specific bots, but I've seen no
>     change
>     >     from
>     >      > the worst offenders - Bingbot and UT-Dorkbot. We had over
>     4k hits
>     >     from
>     >      > Dorkbot alone from 2pm-5pm today, and over 5k from
>     Bingbot in the
>     >     same
>     >      > timeframe. All a couple hours after I made the changes to
>     the robots
>     >      > file and restarted apache services. Which out of 100k
>     entries in the
>     >      > vhosts files in that time frame doesn't sound like a lot,
>     but the
>     >     rest
>     >      > of the traffic looks normal. This issue has been happening
>     >      > intermittently [last 3 are 11/30, 11/3, 7/20] for a
>     while, and
>     >     the only
>     >      > thing that seems to work is to manually kill the services
>     on the DB
>     >      > servers and restart services on the application servers.
>     >      >
>     >      > The symptom is an immediate spike in the Database CPU
>     load. I start
>     >      > killing all queries older than 2 minutes, but it still
>     usually
>     >      > overwhelms the system causing the app servers to stop serving
>     >     requests.
>     >      > The stuck queries are almost always ones along the lines of:
>     >      >
>     >      > -- bib search: #CD_documentLength #CD_meanHarmonic
>     #CD_uniqueWords
>     >      > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
>     >      > badge_orgs(1,138,151) estimation_strategy(inclusion)
>     skip_check(0)
>     >      > check_limit(1000) sort(1) filter_group_entry(1) 1
>     >      > site(*/LIBRARY_BRANCH/*) depth(2)
>     >      >                      +
>     >      >                   |       |         WITH w AS (
>     >      >                  |       | WITH */STRING/*_keyword_xq AS
>     (SELECT
>     >      >       +
>     >      >                   |       | (to_tsquery('english_nostop',
>     >      > COALESCE(NULLIF( '(' ||
>     >      >
>     >
>      btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>     >
>     >      > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')',
>     '()'),
>     >     '')) ||
>     >      > to_tsquery('simple', COALESCE(NULLIF( '(' ||
>     >      >
>     >
>      btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>     >
>     >      > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')',
>     '()'),
>     >     ''))) AS
>     >      > tsq,+
>     >      >                   |       | (to_tsquery('english_nostop',
>     >      > COALESCE(NULLIF( '(' ||
>     >      > btrim(regexp_replace(split_date_range(search_normalize
>     >      >   00:02:17.319491 | */STRING/* |
>     >      >
>     >      > And the queries by DorkBot look like they could be
>     starting the
>     >     query
>     >      > since it's using the basket function in the OPAC.
>     >      >
>     >      > "GET
>     >      >
>     >
>      /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
>     >
>     >      > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
>     >      >
>     >      > I've anonymized the output just to be cautious. Reports
>     are run
>     >     off the
>     >      > backup database server, so it cannot be an auto generated
>     >     report, and it
>     >      > doesn't happen often enough for that either. At this
>     point I'm
>     >     tempted
>     >      > to block the IP addresses. What strategies are you all
>     using to deal
>     >      > with crawlers, and does anyone have an idea what is
>     causing this?
>     >      > -Jon
>     >      >
>     >      > _______________________________________________
>     >      > Evergreen-general mailing list
>     >      > Evergreen-general at list.evergreen-ils.org
>     >     <mailto:Evergreen-general at list.evergreen-ils.org>
>     >      >
>     >
>     http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>     >   
>      <http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
>     >      >
>     >     _______________________________________________
>     >     Evergreen-general mailing list
>     > Evergreen-general at list.evergreen-ils.org
>     >     <mailto:Evergreen-general at list.evergreen-ils.org>
>     >
>     http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>     >   
>      <http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
>     >
>
>
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20211203/736e1964/attachment-0001.html>


More information about the Evergreen-general mailing list