[Evergreen-general] Question about search engine bots & DB CPU spikes
JonGeorg SageLibrary
jongeorg.sagelibrary at gmail.com
Thu Dec 2 22:07:30 EST 2021
I tried that and still got the loopback address, after restarting services.
Any other ideas? And the robots.txt file seems to be doing nothing, which
is not much of a surprise. I've reached out to the people who host our
network and have control of everything on the other side of the firewall.
-Jon
On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <jason at sigio.com> wrote:
> JonGeorg,
>
> If you're using nginx as a proxy, that may be the configuration of
> Apache and nginx.
>
> First, make sure that mod_remote_ip is installed and enabled for Apache 2.
>
> Then, in eg_vhost.conf, find the 3 lines the begin with
> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.
>
> Next, see what header Apache checks for the remote IP address. In my
> example it is "RemoteIPHeader X-Forwarded-For"
>
> Next, make sure that the following two lines appear in BOTH "location /"
> blocks in the ngins configuration:
>
> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
> proxy_set_header X-Forwarded-Proto $scheme;
>
> After reloading/restarting nginx and Apache, you should start seeing
> remote IP addresses in the Apache logs.
>
> Hope that helps!
> Jason
>
>
> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
> > Because we're behind a firewall, all the addresses display as 127.0.0.1.
> > I can talk to the people who administer the firewall though about
> > blocking IP's. Thanks
> > -Jon
> >
> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general
> > <evergreen-general at list.evergreen-ils.org
> > <mailto:evergreen-general at list.evergreen-ils.org>> wrote:
> >
> > JonGeorg,
> >
> > Check your Apache logs for the source IP addresses. If you can't find
> > them, I can share the correct configuration for Apache with Nginx so
> > that you will get the addresses logged.
> >
> > Once you know the IP address ranges, block them. If you have a
> > firewall,
> > I suggest you block them there. If not, you can block them in Nginx
> or
> > in your load balancer configuration if you have one and it allows
> that.
> >
> > You may think you want your catalog to show up in search engines, but
> > bad bots will lie about who they are. All you can do with misbehaving
> > bots is to block them.
> >
> > HtH,
> > Jason
> >
> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general
> wrote:
> > > Question. We've been getting hammered by search engine bots [?],
> but
> > > they seem to all query our system at the same time. Enough that
> it's
> > > crashing the app servers. We have a robots.txt file in place. I've
> > > increased the crawling delay speed from 3 to 10 seconds, and have
> > > explicitly disallowed the specific bots, but I've seen no change
> > from
> > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits
> > from
> > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the
> > same
> > > timeframe. All a couple hours after I made the changes to the
> robots
> > > file and restarted apache services. Which out of 100k entries in
> the
> > > vhosts files in that time frame doesn't sound like a lot, but the
> > rest
> > > of the traffic looks normal. This issue has been happening
> > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and
> > the only
> > > thing that seems to work is to manually kill the services on the
> DB
> > > servers and restart services on the application servers.
> > >
> > > The symptom is an immediate spike in the Database CPU load. I
> start
> > > killing all queries older than 2 minutes, but it still usually
> > > overwhelms the system causing the app servers to stop serving
> > requests.
> > > The stuck queries are almost always ones along the lines of:
> > >
> > > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
> > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
> > > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0)
> > > check_limit(1000) sort(1) filter_group_entry(1) 1
> > > site(*/LIBRARY_BRANCH/*) depth(2)
> > > +
> > > | | WITH w AS (
> > > | | WITH */STRING/*_keyword_xq AS (SELECT
> > > +
> > > | | (to_tsquery('english_nostop',
> > > COALESCE(NULLIF( '(' ||
> > >
> >
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
> >
> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),
> > '')) ||
> > > to_tsquery('simple', COALESCE(NULLIF( '(' ||
> > >
> >
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
> >
> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),
> > ''))) AS
> > > tsq,+
> > > | | (to_tsquery('english_nostop',
> > > COALESCE(NULLIF( '(' ||
> > > btrim(regexp_replace(split_date_range(search_normalize
> > > 00:02:17.319491 | */STRING/* |
> > >
> > > And the queries by DorkBot look like they could be starting the
> > query
> > > since it's using the basket function in the OPAC.
> > >
> > > "GET
> > >
> >
> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
> >
> > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
> > >
> > > I've anonymized the output just to be cautious. Reports are run
> > off the
> > > backup database server, so it cannot be an auto generated
> > report, and it
> > > doesn't happen often enough for that either. At this point I'm
> > tempted
> > > to block the IP addresses. What strategies are you all using
> to deal
> > > with crawlers, and does anyone have an idea what is causing this?
> > > -Jon
> > >
> > > _______________________________________________
> > > Evergreen-general mailing list
> > > Evergreen-general at list.evergreen-ils.org
> > <mailto:Evergreen-general at list.evergreen-ils.org>
> > >
> >
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
> > <
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
> > >
> > _______________________________________________
> > Evergreen-general mailing list
> > Evergreen-general at list.evergreen-ils.org
> > <mailto:Evergreen-general at list.evergreen-ils.org>
> >
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
> > <
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20211202/9355632b/attachment.html>
More information about the Evergreen-general
mailing list