[Evergreen-general] Question about search engine bots & DB CPU spikes

JonGeorg SageLibrary jongeorg.sagelibrary at gmail.com
Fri Dec 3 14:12:49 EST 2021


The DorkBot queries I'm referring to look like this:
[02/Dec/2021:12:08:13 -0800] "GET
/eg/opac/results?do_basket_action=Go&query=1&detail_record_view=1&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword%27%22&fg%3Amat_format=1&locg=176&sort=1
HTTP/1.0" 200 62417 "-" "UT-Dorkbot/1.0"

they vary after metabib, but all are using the basket feature. They come
from different library branch URLs.
-Jon

On Fri, Dec 3, 2021 at 10:45 AM JonGeorg SageLibrary <
jongeorg.sagelibrary at gmail.com> wrote:

> Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs.
> Is DorkBot used legitimately for querying the opac?
> -Jon
>
> On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary <
> jongeorg.sagelibrary at gmail.com> wrote:
>
>> Thank you!
>> -Jon
>>
>> On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general <
>> evergreen-general at list.evergreen-ils.org> wrote:
>>
>>> JonGeorg,
>>>
>>> This reminds me of a similar issues that we had. We resolved it with
>>> this change to NGINX. Here's the link:
>>>
>>>
>>> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits
>>>
>>> and the bug:
>>> https://bugs.launchpad.net/evergreen/+bug/1913610
>>>
>>> I'm not sure that it's the same issue though, as you've shared a search
>>> SQL query and this solution addresses external requests to
>>> "/opac/extras/unapi"
>>> But you might be able to apply the same nginx rate limiting technique
>>> here if you can detect the URL they are using.
>>>
>>> There is a tool called "apachetop" which I used in order to see the
>>> URL's that were being used.
>>>
>>> apt-get -y install apachetop && apachetop -f
>>> /var/log/apache2/other_vhosts_access.log
>>>
>>> and another useful command:
>>>
>>> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort |
>>> uniq -c | sort -rn
>>>
>>> You have to ignore (not limit) all the requests to the Evergreen gateway
>>> as most of that traffic is the staff client and should (probably) not be
>>> limited.
>>>
>>> I'm just throwing some ideas out there for you. Good luck!
>>>
>>> -Blake-
>>> Conducting Magic
>>> Can consume data in any format
>>> MOBIUS
>>>
>>> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
>>>
>>> I tried that and still got the loopback address, after restarting
>>> services. Any other ideas? And the robots.txt file seems to be doing
>>> nothing, which is not much of a surprise. I've reached out to the people
>>> who host our network and have control of everything on the other side of
>>> the firewall.
>>> -Jon
>>>
>>>
>>> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <jason at sigio.com> wrote:
>>>
>>>> JonGeorg,
>>>>
>>>> If you're using nginx as a proxy, that may be the configuration of
>>>> Apache and nginx.
>>>>
>>>> First, make sure that mod_remote_ip is installed and enabled for Apache
>>>> 2.
>>>>
>>>> Then, in eg_vhost.conf, find the 3 lines the begin with
>>>> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.
>>>>
>>>> Next, see what header Apache checks for the remote IP address. In my
>>>> example it is "RemoteIPHeader X-Forwarded-For"
>>>>
>>>> Next, make sure that the following two lines appear in BOTH "location
>>>> /"
>>>> blocks in the ngins configuration:
>>>>
>>>>          proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>>>>          proxy_set_header X-Forwarded-Proto $scheme;
>>>>
>>>> After reloading/restarting nginx and Apache, you should start seeing
>>>> remote IP addresses in the Apache logs.
>>>>
>>>> Hope that helps!
>>>> Jason
>>>>
>>>>
>>>> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
>>>> > Because we're behind a firewall, all the addresses display as
>>>> 127.0.0.1.
>>>> > I can talk to the people who administer the firewall though about
>>>> > blocking IP's. Thanks
>>>> > -Jon
>>>> >
>>>> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via
>>>> Evergreen-general
>>>> > <evergreen-general at list.evergreen-ils.org
>>>> > <mailto:evergreen-general at list.evergreen-ils.org>> wrote:
>>>> >
>>>> >     JonGeorg,
>>>> >
>>>> >     Check your Apache logs for the source IP addresses. If you can't
>>>> find
>>>> >     them, I can share the correct configuration for Apache with Nginx
>>>> so
>>>> >     that you will get the addresses logged.
>>>> >
>>>> >     Once you know the IP address ranges, block them. If you have a
>>>> >     firewall,
>>>> >     I suggest you block them there. If not, you can block them in
>>>> Nginx or
>>>> >     in your load balancer configuration if you have one and it allows
>>>> that.
>>>> >
>>>> >     You may think you want your catalog to show up in search engines,
>>>> but
>>>> >     bad bots will lie about who they are. All you can do with
>>>> misbehaving
>>>> >     bots is to block them.
>>>> >
>>>> >     HtH,
>>>> >     Jason
>>>> >
>>>> >     On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general
>>>> wrote:
>>>> >      > Question. We've been getting hammered by search engine bots
>>>> [?], but
>>>> >      > they seem to all query our system at the same time. Enough
>>>> that it's
>>>> >      > crashing the app servers. We have a robots.txt file in place.
>>>> I've
>>>> >      > increased the crawling delay speed from 3 to 10 seconds, and
>>>> have
>>>> >      > explicitly disallowed the specific bots, but I've seen no
>>>> change
>>>> >     from
>>>> >      > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k
>>>> hits
>>>> >     from
>>>> >      > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in
>>>> the
>>>> >     same
>>>> >      > timeframe. All a couple hours after I made the changes to the
>>>> robots
>>>> >      > file and restarted apache services. Which out of 100k entries
>>>> in the
>>>> >      > vhosts files in that time frame doesn't sound like a lot, but
>>>> the
>>>> >     rest
>>>> >      > of the traffic looks normal. This issue has been happening
>>>> >      > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and
>>>> >     the only
>>>> >      > thing that seems to work is to manually kill the services on
>>>> the DB
>>>> >      > servers and restart services on the application servers.
>>>> >      >
>>>> >      > The symptom is an immediate spike in the Database CPU load. I
>>>> start
>>>> >      > killing all queries older than 2 minutes, but it still usually
>>>> >      > overwhelms the system causing the app servers to stop serving
>>>> >     requests.
>>>> >      > The stuck queries are almost always ones along the lines of:
>>>> >      >
>>>> >      > -- bib search: #CD_documentLength #CD_meanHarmonic
>>>> #CD_uniqueWords
>>>> >      > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
>>>> >      > badge_orgs(1,138,151) estimation_strategy(inclusion)
>>>> skip_check(0)
>>>> >      > check_limit(1000) sort(1) filter_group_entry(1) 1
>>>> >      > site(*/LIBRARY_BRANCH/*) depth(2)
>>>> >      >                      +
>>>> >      >                   |       |         WITH w AS (
>>>> >      >                  |       | WITH */STRING/*_keyword_xq AS
>>>> (SELECT
>>>> >      >                                                  +
>>>> >      >                   |       |       (to_tsquery('english_nostop',
>>>> >      > COALESCE(NULLIF( '(' ||
>>>> >      >
>>>> >
>>>>  btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>>>> >
>>>> >      > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'),
>>>> >     '')) ||
>>>> >      > to_tsquery('simple', COALESCE(NULLIF( '(' ||
>>>> >      >
>>>> >
>>>>  btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>>>> >
>>>> >      > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'),
>>>> >     ''))) AS
>>>> >      > tsq,+
>>>> >      >                   |       |       (to_tsquery('english_nostop',
>>>> >      > COALESCE(NULLIF( '(' ||
>>>> >      > btrim(regexp_replace(split_date_range(search_normalize
>>>> >      >   00:02:17.319491 | */STRING/* |
>>>> >      >
>>>> >      > And the queries by DorkBot look like they could be starting the
>>>> >     query
>>>> >      > since it's using the basket function in the OPAC.
>>>> >      >
>>>> >      > "GET
>>>> >      >
>>>> >
>>>>  /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
>>>> >
>>>> >      > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
>>>> >      >
>>>> >      > I've anonymized the output just to be cautious. Reports are run
>>>> >     off the
>>>> >      > backup database server, so it cannot be an auto generated
>>>> >     report, and it
>>>> >      > doesn't happen often enough for that either. At this point I'm
>>>> >     tempted
>>>> >      > to block the IP addresses. What strategies are you all using
>>>> to deal
>>>> >      > with crawlers, and does anyone have an idea what is causing
>>>> this?
>>>> >      > -Jon
>>>> >      >
>>>> >      > _______________________________________________
>>>> >      > Evergreen-general mailing list
>>>> >      > Evergreen-general at list.evergreen-ils.org
>>>> >     <mailto:Evergreen-general at list.evergreen-ils.org>
>>>> >      >
>>>> >
>>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>>> >     <
>>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>>> >
>>>> >      >
>>>> >     _______________________________________________
>>>> >     Evergreen-general mailing list
>>>> >     Evergreen-general at list.evergreen-ils.org
>>>> >     <mailto:Evergreen-general at list.evergreen-ils.org>
>>>> >
>>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>>> >     <
>>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>>> >
>>>> >
>>>>
>>>
>>> _______________________________________________
>>> Evergreen-general mailing listEvergreen-general at list.evergreen-ils.orghttp://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>>
>>>
>>> _______________________________________________
>>> Evergreen-general mailing list
>>> Evergreen-general at list.evergreen-ils.org
>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20211203/699bdfbf/attachment-0001.html>


More information about the Evergreen-general mailing list