[Evergreen-general] Question about search engine bots & DB CPU spikes
JonGeorg SageLibrary
jongeorg.sagelibrary at gmail.com
Fri Dec 3 13:45:21 EST 2021
Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs.
Is DorkBot used legitimately for querying the opac?
-Jon
On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary <
jongeorg.sagelibrary at gmail.com> wrote:
> Thank you!
> -Jon
>
> On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general <
> evergreen-general at list.evergreen-ils.org> wrote:
>
>> JonGeorg,
>>
>> This reminds me of a similar issues that we had. We resolved it with this
>> change to NGINX. Here's the link:
>>
>>
>> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits
>>
>> and the bug:
>> https://bugs.launchpad.net/evergreen/+bug/1913610
>>
>> I'm not sure that it's the same issue though, as you've shared a search
>> SQL query and this solution addresses external requests to
>> "/opac/extras/unapi"
>> But you might be able to apply the same nginx rate limiting technique
>> here if you can detect the URL they are using.
>>
>> There is a tool called "apachetop" which I used in order to see the URL's
>> that were being used.
>>
>> apt-get -y install apachetop && apachetop -f
>> /var/log/apache2/other_vhosts_access.log
>>
>> and another useful command:
>>
>> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort |
>> uniq -c | sort -rn
>>
>> You have to ignore (not limit) all the requests to the Evergreen gateway
>> as most of that traffic is the staff client and should (probably) not be
>> limited.
>>
>> I'm just throwing some ideas out there for you. Good luck!
>>
>> -Blake-
>> Conducting Magic
>> Can consume data in any format
>> MOBIUS
>>
>> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
>>
>> I tried that and still got the loopback address, after restarting
>> services. Any other ideas? And the robots.txt file seems to be doing
>> nothing, which is not much of a surprise. I've reached out to the people
>> who host our network and have control of everything on the other side of
>> the firewall.
>> -Jon
>>
>>
>> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <jason at sigio.com> wrote:
>>
>>> JonGeorg,
>>>
>>> If you're using nginx as a proxy, that may be the configuration of
>>> Apache and nginx.
>>>
>>> First, make sure that mod_remote_ip is installed and enabled for Apache
>>> 2.
>>>
>>> Then, in eg_vhost.conf, find the 3 lines the begin with
>>> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.
>>>
>>> Next, see what header Apache checks for the remote IP address. In my
>>> example it is "RemoteIPHeader X-Forwarded-For"
>>>
>>> Next, make sure that the following two lines appear in BOTH "location /"
>>> blocks in the ngins configuration:
>>>
>>> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>>> proxy_set_header X-Forwarded-Proto $scheme;
>>>
>>> After reloading/restarting nginx and Apache, you should start seeing
>>> remote IP addresses in the Apache logs.
>>>
>>> Hope that helps!
>>> Jason
>>>
>>>
>>> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
>>> > Because we're behind a firewall, all the addresses display as
>>> 127.0.0.1.
>>> > I can talk to the people who administer the firewall though about
>>> > blocking IP's. Thanks
>>> > -Jon
>>> >
>>> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general
>>> > <evergreen-general at list.evergreen-ils.org
>>> > <mailto:evergreen-general at list.evergreen-ils.org>> wrote:
>>> >
>>> > JonGeorg,
>>> >
>>> > Check your Apache logs for the source IP addresses. If you can't
>>> find
>>> > them, I can share the correct configuration for Apache with Nginx
>>> so
>>> > that you will get the addresses logged.
>>> >
>>> > Once you know the IP address ranges, block them. If you have a
>>> > firewall,
>>> > I suggest you block them there. If not, you can block them in
>>> Nginx or
>>> > in your load balancer configuration if you have one and it allows
>>> that.
>>> >
>>> > You may think you want your catalog to show up in search engines,
>>> but
>>> > bad bots will lie about who they are. All you can do with
>>> misbehaving
>>> > bots is to block them.
>>> >
>>> > HtH,
>>> > Jason
>>> >
>>> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general
>>> wrote:
>>> > > Question. We've been getting hammered by search engine bots
>>> [?], but
>>> > > they seem to all query our system at the same time. Enough that
>>> it's
>>> > > crashing the app servers. We have a robots.txt file in place.
>>> I've
>>> > > increased the crawling delay speed from 3 to 10 seconds, and
>>> have
>>> > > explicitly disallowed the specific bots, but I've seen no change
>>> > from
>>> > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k
>>> hits
>>> > from
>>> > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in
>>> the
>>> > same
>>> > > timeframe. All a couple hours after I made the changes to the
>>> robots
>>> > > file and restarted apache services. Which out of 100k entries
>>> in the
>>> > > vhosts files in that time frame doesn't sound like a lot, but
>>> the
>>> > rest
>>> > > of the traffic looks normal. This issue has been happening
>>> > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and
>>> > the only
>>> > > thing that seems to work is to manually kill the services on
>>> the DB
>>> > > servers and restart services on the application servers.
>>> > >
>>> > > The symptom is an immediate spike in the Database CPU load. I
>>> start
>>> > > killing all queries older than 2 minutes, but it still usually
>>> > > overwhelms the system causing the app servers to stop serving
>>> > requests.
>>> > > The stuck queries are almost always ones along the lines of:
>>> > >
>>> > > -- bib search: #CD_documentLength #CD_meanHarmonic
>>> #CD_uniqueWords
>>> > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
>>> > > badge_orgs(1,138,151) estimation_strategy(inclusion)
>>> skip_check(0)
>>> > > check_limit(1000) sort(1) filter_group_entry(1) 1
>>> > > site(*/LIBRARY_BRANCH/*) depth(2)
>>> > > +
>>> > > | | WITH w AS (
>>> > > | | WITH */STRING/*_keyword_xq AS (SELECT
>>> > > +
>>> > > | | (to_tsquery('english_nostop',
>>> > > COALESCE(NULLIF( '(' ||
>>> > >
>>> >
>>> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>>> >
>>> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),
>>> > '')) ||
>>> > > to_tsquery('simple', COALESCE(NULLIF( '(' ||
>>> > >
>>> >
>>> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>>> >
>>> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),
>>> > ''))) AS
>>> > > tsq,+
>>> > > | | (to_tsquery('english_nostop',
>>> > > COALESCE(NULLIF( '(' ||
>>> > > btrim(regexp_replace(split_date_range(search_normalize
>>> > > 00:02:17.319491 | */STRING/* |
>>> > >
>>> > > And the queries by DorkBot look like they could be starting the
>>> > query
>>> > > since it's using the basket function in the OPAC.
>>> > >
>>> > > "GET
>>> > >
>>> >
>>> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
>>> >
>>> > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
>>> > >
>>> > > I've anonymized the output just to be cautious. Reports are run
>>> > off the
>>> > > backup database server, so it cannot be an auto generated
>>> > report, and it
>>> > > doesn't happen often enough for that either. At this point I'm
>>> > tempted
>>> > > to block the IP addresses. What strategies are you all using
>>> to deal
>>> > > with crawlers, and does anyone have an idea what is causing
>>> this?
>>> > > -Jon
>>> > >
>>> > > _______________________________________________
>>> > > Evergreen-general mailing list
>>> > > Evergreen-general at list.evergreen-ils.org
>>> > <mailto:Evergreen-general at list.evergreen-ils.org>
>>> > >
>>> >
>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>> > <
>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>> >
>>> > >
>>> > _______________________________________________
>>> > Evergreen-general mailing list
>>> > Evergreen-general at list.evergreen-ils.org
>>> > <mailto:Evergreen-general at list.evergreen-ils.org>
>>> >
>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>> > <
>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>> >
>>> >
>>>
>>
>> _______________________________________________
>> Evergreen-general mailing listEvergreen-general at list.evergreen-ils.orghttp://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>
>>
>> _______________________________________________
>> Evergreen-general mailing list
>> Evergreen-general at list.evergreen-ils.org
>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20211203/08c65043/attachment-0001.html>
More information about the Evergreen-general
mailing list