<div dir="ltr">The DorkBot queries I'm referring to look like this:<br>[02/Dec/2021:12:08:13 -0800] "GET /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=1&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword%27%22&fg%3Amat_format=1&locg=176&sort=1 HTTP/1.0" 200 62417 "-" "UT-Dorkbot/1.0"<br><br>they vary after metabib, but all are using the basket feature. They come from different library branch URLs. <div>-Jon</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 3, 2021 at 10:45 AM JonGeorg SageLibrary <<a href="mailto:jongeorg.sagelibrary@gmail.com">jongeorg.sagelibrary@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs. <br>Is DorkBot used legitimately for querying the opac? <div>-Jon</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary <<a href="mailto:jongeorg.sagelibrary@gmail.com" target="_blank">jongeorg.sagelibrary@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thank you!<div>-Jon</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general <<a href="mailto:evergreen-general@list.evergreen-ils.org" target="_blank">evergreen-general@list.evergreen-ils.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    JonGeorg,<br>
    <br>
    This reminds me of a similar issues that we had. We resolved it with
    this change to NGINX. Here's the link:<br>
    <br>
    <a href="https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits" target="_blank">https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits</a><br>
    <br>
    and the bug:<br>
    <a href="https://bugs.launchpad.net/evergreen/+bug/1913610" target="_blank">https://bugs.launchpad.net/evergreen/+bug/1913610</a><br>
    <br>
    I'm not sure that it's the same issue though, as you've shared a
    search SQL query and this solution addresses external requests to
    "/opac/extras/unapi"<br>
    But you might be able to apply the same nginx rate limiting
    technique here if you can detect the URL they are using.<br>
    <br>
    There is a tool called "apachetop" which I used in order to see the
    URL's that were being used.<br>
    <br>
    apt-get -y install apachetop && apachetop -f
    /var/log/apache2/other_vhosts_access.log<br>
    <br>
    and another useful command:<br>
    <br>
    cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' |
    sort | uniq -c | sort -rn<br>
    <br>
    You have to ignore (not limit) all the requests to the Evergreen
    gateway as most of that traffic is the staff client and should
    (probably) not be limited.<br>
    <br>
    I'm just throwing some ideas out there for you. Good luck!<br>
    <pre cols="72">-Blake-
Conducting Magic
Can consume data in any format
MOBIUS</pre>
    <div>On 12/2/2021 9:07 PM, JonGeorg
      SageLibrary via Evergreen-general wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">I tried that and still got the loopback address,
        after restarting services. Any other ideas? And the robots.txt
        file seems to be doing nothing, which is not much of a surprise.
        I've reached out to the people who host our network and have
        control of everything on the other side of the firewall.
        <div>-Jon<br>
          <br>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Wed, Dec 1, 2021 at
              3:57 AM Jason Stephenson <<a href="mailto:jason@sigio.com" target="_blank">jason@sigio.com</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">JonGeorg,<br>
              <br>
              If you're using nginx as a proxy, that may be the
              configuration of <br>
              Apache and nginx.<br>
              <br>
              First, make sure that mod_remote_ip is installed and
              enabled for Apache 2.<br>
              <br>
              Then, in eg_vhost.conf, find the 3 lines the begin with <br>
              "RemoteIPInternalProxy <a href="http://127.0.0.1/24" rel="noreferrer" target="_blank">127.0.0.1/24</a>"
              and uncomment them.<br>
              <br>
              Next, see what header Apache checks for the remote IP
              address. In my <br>
              example it is "RemoteIPHeader X-Forwarded-For"<br>
              <br>
              Next, make sure that the following two lines appear in
              BOTH "location /" <br>
              blocks in the ngins configuration:<br>
              <br>
                       proxy_set_header X-Forwarded-For
              $proxy_add_x_forwarded_for;<br>
                       proxy_set_header X-Forwarded-Proto $scheme;<br>
              <br>
              After reloading/restarting nginx and Apache, you should
              start seeing <br>
              remote IP addresses in the Apache logs.<br>
              <br>
              Hope that helps!<br>
              Jason<br>
              <br>
              <br>
              On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:<br>
              > Because we're behind a firewall, all the addresses
              display as 127.0.0.1. <br>
              > I can talk to the people who administer the firewall
              though about <br>
              > blocking IP's. Thanks<br>
              > -Jon<br>
              > <br>
              > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via
              Evergreen-general <br>
              > <<a href="mailto:evergreen-general@list.evergreen-ils.org" target="_blank">evergreen-general@list.evergreen-ils.org</a>
              <br>
              > <mailto:<a href="mailto:evergreen-general@list.evergreen-ils.org" target="_blank">evergreen-general@list.evergreen-ils.org</a>>>
              wrote:<br>
              > <br>
              >     JonGeorg,<br>
              > <br>
              >     Check your Apache logs for the source IP
              addresses. If you can't find<br>
              >     them, I can share the correct configuration for
              Apache with Nginx so<br>
              >     that you will get the addresses logged.<br>
              > <br>
              >     Once you know the IP address ranges, block them.
              If you have a<br>
              >     firewall,<br>
              >     I suggest you block them there. If not, you can
              block them in Nginx or<br>
              >     in your load balancer configuration if you have
              one and it allows that.<br>
              > <br>
              >     You may think you want your catalog to show up in
              search engines, but<br>
              >     bad bots will lie about who they are. All you can
              do with misbehaving<br>
              >     bots is to block them.<br>
              > <br>
              >     HtH,<br>
              >     Jason<br>
              > <br>
              >     On 11/30/21 9:34 PM, JonGeorg SageLibrary via
              Evergreen-general wrote:<br>
              >      > Question. We've been getting hammered by
              search engine bots [?], but<br>
              >      > they seem to all query our system at the
              same time. Enough that it's<br>
              >      > crashing the app servers. We have a
              robots.txt file in place. I've<br>
              >      > increased the crawling delay speed from 3
              to 10 seconds, and have<br>
              >      > explicitly disallowed the specific bots,
              but I've seen no change<br>
              >     from<br>
              >      > the worst offenders - Bingbot and
              UT-Dorkbot. We had over 4k hits<br>
              >     from<br>
              >      > Dorkbot alone from 2pm-5pm today, and over
              5k from Bingbot in the<br>
              >     same<br>
              >      > timeframe. All a couple hours after I made
              the changes to the robots<br>
              >      > file and restarted apache services.
              Which out of 100k entries in the<br>
              >      > vhosts files in that time frame doesn't
              sound like a lot, but the<br>
              >     rest<br>
              >      > of the traffic looks normal. This issue has
              been happening<br>
              >      > intermittently [last 3 are 11/30, 11/3,
              7/20] for a while, and<br>
              >     the only<br>
              >      > thing that seems to work is to manually
              kill the services on the DB<br>
              >      > servers and restart services on the
              application servers.<br>
              >      ><br>
              >      > The symptom is an immediate spike in the
              Database CPU load. I start<br>
              >      > killing all queries older than 2 minutes,
              but it still usually<br>
              >      > overwhelms the system causing the app
              servers to stop serving<br>
              >     requests.<br>
              >      > The stuck queries are almost always ones
              along the lines of:<br>
              >      ><br>
              >      > -- bib search: #CD_documentLength
              #CD_meanHarmonic #CD_uniqueWords<br>
              >      > from_metarecord(*/BIB_RECORD#/*)
              core_limit(100000)<br>
              >      > badge_orgs(1,138,151)
              estimation_strategy(inclusion) skip_check(0)<br>
              >      > check_limit(1000) sort(1)
              filter_group_entry(1) 1<br>
              >      > site(*/LIBRARY_BRANCH/*) depth(2)<br>
              >      >                      +<br>
              >      >                   |       |         WITH w
              AS (<br>
              >      >                  |       | WITH
              */STRING/*_keyword_xq AS (SELECT<br>
              >      >                                           
                    +<br>
              >      >                   |       |      
              (to_tsquery('english_nostop',<br>
              >      > COALESCE(NULLIF( '(' ||<br>
              >      ><br>
              >   
 btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),<br>
              > <br>
              >      >
              */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')
               || ')', '()'),<br>
              >     '')) ||<br>
              >      > to_tsquery('simple', COALESCE(NULLIF( '('
              ||<br>
              >      ><br>
              >   
 btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),<br>
              > <br>
              >      >
              */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')
               || ')', '()'),<br>
              >     ''))) AS<br>
              >      > tsq,+<br>
              >      >                   |       |      
              (to_tsquery('english_nostop',<br>
              >      > COALESCE(NULLIF( '(' ||<br>
              >      >
              btrim(regexp_replace(split_date_range(search_normalize<br>
              >      >   00:02:17.319491 | */STRING/* |<br>
              >      ><br>
              >      > And the queries by DorkBot look like they
              could be starting the<br>
              >     query<br>
              >      > since it's using the basket function in the
              OPAC.<br>
              >      ><br>
              >      > "GET<br>
              >      ><br>
              >   
 /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1<br>
              > <br>
              >      > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"<br>
              >      ><br>
              >      > I've anonymized the output just to be
              cautious. Reports are run<br>
              >     off the<br>
              >      > backup database server, so it cannot be an
              auto generated<br>
              >     report, and it<br>
              >      > doesn't happen often enough for that
              either. At this point I'm<br>
              >     tempted<br>
              >      > to block the IP addresses. What strategies
              are you all using to deal<br>
              >      > with crawlers, and does anyone have an idea
              what is causing this?<br>
              >      > -Jon<br>
              >      ><br>
              >      >
              _______________________________________________<br>
              >      > Evergreen-general mailing list<br>
              >      > <a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
              >     <mailto:<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a>><br>
              >      ><br>
              >     <a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
              >     <<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a>><br>
              >      ><br>
              >     _______________________________________________<br>
              >     Evergreen-general mailing list<br>
              >     <a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
              >     <mailto:<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a>><br>
              >     <a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
              >     <<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a>><br>
              > <br>
            </blockquote>
          </div>
        </div>
      </div>
      <br>
      <fieldset></fieldset>
      <pre>_______________________________________________
Evergreen-general mailing list
<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a>
<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a>
</pre>
    </blockquote>
    <br>
  </div>
_______________________________________________<br>
Evergreen-general mailing list<br>
<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
</blockquote></div>
</blockquote></div>
</blockquote></div>