[Evergreen-general] Question about search engine bots & DB CPU spikes

Jason Stephenson jason at sigio.com
Wed Dec 1 06:57:40 EST 2021


JonGeorg,

If you're using nginx as a proxy, that may be the configuration of 
Apache and nginx.

First, make sure that mod_remote_ip is installed and enabled for Apache 2.

Then, in eg_vhost.conf, find the 3 lines the begin with 
"RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.

Next, see what header Apache checks for the remote IP address. In my 
example it is "RemoteIPHeader X-Forwarded-For"

Next, make sure that the following two lines appear in BOTH "location /" 
blocks in the ngins configuration:

         proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
         proxy_set_header X-Forwarded-Proto $scheme;

After reloading/restarting nginx and Apache, you should start seeing 
remote IP addresses in the Apache logs.

Hope that helps!
Jason


On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
> Because we're behind a firewall, all the addresses display as 127.0.0.1. 
> I can talk to the people who administer the firewall though about 
> blocking IP's. Thanks
> -Jon
> 
> On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general 
> <evergreen-general at list.evergreen-ils.org 
> <mailto:evergreen-general at list.evergreen-ils.org>> wrote:
> 
>     JonGeorg,
> 
>     Check your Apache logs for the source IP addresses. If you can't find
>     them, I can share the correct configuration for Apache with Nginx so
>     that you will get the addresses logged.
> 
>     Once you know the IP address ranges, block them. If you have a
>     firewall,
>     I suggest you block them there. If not, you can block them in Nginx or
>     in your load balancer configuration if you have one and it allows that.
> 
>     You may think you want your catalog to show up in search engines, but
>     bad bots will lie about who they are. All you can do with misbehaving
>     bots is to block them.
> 
>     HtH,
>     Jason
> 
>     On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general wrote:
>      > Question. We've been getting hammered by search engine bots [?], but
>      > they seem to all query our system at the same time. Enough that it's
>      > crashing the app servers. We have a robots.txt file in place. I've
>      > increased the crawling delay speed from 3 to 10 seconds, and have
>      > explicitly disallowed the specific bots, but I've seen no change
>     from
>      > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits
>     from
>      > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the
>     same
>      > timeframe. All a couple hours after I made the changes to the robots
>      > file and restarted apache services. Which out of 100k entries in the
>      > vhosts files in that time frame doesn't sound like a lot, but the
>     rest
>      > of the traffic looks normal. This issue has been happening
>      > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and
>     the only
>      > thing that seems to work is to manually kill the services on the DB
>      > servers and restart services on the application servers.
>      >
>      > The symptom is an immediate spike in the Database CPU load. I start
>      > killing all queries older than 2 minutes, but it still usually
>      > overwhelms the system causing the app servers to stop serving
>     requests.
>      > The stuck queries are almost always ones along the lines of:
>      >
>      > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
>      > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
>      > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0)
>      > check_limit(1000) sort(1) filter_group_entry(1) 1
>      > site(*/LIBRARY_BRANCH/*) depth(2)
>      >                      +
>      >                   |       |         WITH w AS (
>      >                  |       | WITH */STRING/*_keyword_xq AS (SELECT
>      >                                                  +
>      >                   |       |       (to_tsquery('english_nostop',
>      > COALESCE(NULLIF( '(' ||
>      >
>     btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
> 
>      > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'),
>     '')) ||
>      > to_tsquery('simple', COALESCE(NULLIF( '(' ||
>      >
>     btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
> 
>      > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'),
>     ''))) AS
>      > tsq,+
>      >                   |       |       (to_tsquery('english_nostop',
>      > COALESCE(NULLIF( '(' ||
>      > btrim(regexp_replace(split_date_range(search_normalize
>      >   00:02:17.319491 | */STRING/* |
>      >
>      > And the queries by DorkBot look like they could be starting the
>     query
>      > since it's using the basket function in the OPAC.
>      >
>      > "GET
>      >
>     /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
> 
>      > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
>      >
>      > I've anonymized the output just to be cautious. Reports are run
>     off the
>      > backup database server, so it cannot be an auto generated
>     report, and it
>      > doesn't happen often enough for that either. At this point I'm
>     tempted
>      > to block the IP addresses. What strategies are you all using to deal
>      > with crawlers, and does anyone have an idea what is causing this?
>      > -Jon
>      >
>      > _______________________________________________
>      > Evergreen-general mailing list
>      > Evergreen-general at list.evergreen-ils.org
>     <mailto:Evergreen-general at list.evergreen-ils.org>
>      >
>     http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>     <http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
>      >
>     _______________________________________________
>     Evergreen-general mailing list
>     Evergreen-general at list.evergreen-ils.org
>     <mailto:Evergreen-general at list.evergreen-ils.org>
>     http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>     <http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
> 


More information about the Evergreen-general mailing list