<div dir="ltr">I tried that and still got the loopback address, after restarting services. Any other ideas? And the robots.txt file seems to be doing nothing, which is not much of a surprise. I've reached out to the people who host our network and have control of everything on the other side of the firewall.<div>-Jon<br><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <<a href="mailto:jason@sigio.com" target="_blank">jason@sigio.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">JonGeorg,<br>
<br>
If you're using nginx as a proxy, that may be the configuration of <br>
Apache and nginx.<br>
<br>
First, make sure that mod_remote_ip is installed and enabled for Apache 2.<br>
<br>
Then, in eg_vhost.conf, find the 3 lines the begin with <br>
"RemoteIPInternalProxy <a href="http://127.0.0.1/24" rel="noreferrer" target="_blank">127.0.0.1/24</a>" and uncomment them.<br>
<br>
Next, see what header Apache checks for the remote IP address. In my <br>
example it is "RemoteIPHeader X-Forwarded-For"<br>
<br>
Next, make sure that the following two lines appear in BOTH "location /" <br>
blocks in the ngins configuration:<br>
<br>
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;<br>
proxy_set_header X-Forwarded-Proto $scheme;<br>
<br>
After reloading/restarting nginx and Apache, you should start seeing <br>
remote IP addresses in the Apache logs.<br>
<br>
Hope that helps!<br>
Jason<br>
<br>
<br>
On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:<br>
> Because we're behind a firewall, all the addresses display as 127.0.0.1. <br>
> I can talk to the people who administer the firewall though about <br>
> blocking IP's. Thanks<br>
> -Jon<br>
> <br>
> On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general <br>
> <<a href="mailto:evergreen-general@list.evergreen-ils.org" target="_blank">evergreen-general@list.evergreen-ils.org</a> <br>
> <mailto:<a href="mailto:evergreen-general@list.evergreen-ils.org" target="_blank">evergreen-general@list.evergreen-ils.org</a>>> wrote:<br>
> <br>
> JonGeorg,<br>
> <br>
> Check your Apache logs for the source IP addresses. If you can't find<br>
> them, I can share the correct configuration for Apache with Nginx so<br>
> that you will get the addresses logged.<br>
> <br>
> Once you know the IP address ranges, block them. If you have a<br>
> firewall,<br>
> I suggest you block them there. If not, you can block them in Nginx or<br>
> in your load balancer configuration if you have one and it allows that.<br>
> <br>
> You may think you want your catalog to show up in search engines, but<br>
> bad bots will lie about who they are. All you can do with misbehaving<br>
> bots is to block them.<br>
> <br>
> HtH,<br>
> Jason<br>
> <br>
> On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general wrote:<br>
> > Question. We've been getting hammered by search engine bots [?], but<br>
> > they seem to all query our system at the same time. Enough that it's<br>
> > crashing the app servers. We have a robots.txt file in place. I've<br>
> > increased the crawling delay speed from 3 to 10 seconds, and have<br>
> > explicitly disallowed the specific bots, but I've seen no change<br>
> from<br>
> > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits<br>
> from<br>
> > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the<br>
> same<br>
> > timeframe. All a couple hours after I made the changes to the robots<br>
> > file and restarted apache services. Which out of 100k entries in the<br>
> > vhosts files in that time frame doesn't sound like a lot, but the<br>
> rest<br>
> > of the traffic looks normal. This issue has been happening<br>
> > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and<br>
> the only<br>
> > thing that seems to work is to manually kill the services on the DB<br>
> > servers and restart services on the application servers.<br>
> ><br>
> > The symptom is an immediate spike in the Database CPU load. I start<br>
> > killing all queries older than 2 minutes, but it still usually<br>
> > overwhelms the system causing the app servers to stop serving<br>
> requests.<br>
> > The stuck queries are almost always ones along the lines of:<br>
> ><br>
> > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords<br>
> > from_metarecord(*/BIB_RECORD#/*) core_limit(100000)<br>
> > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0)<br>
> > check_limit(1000) sort(1) filter_group_entry(1) 1<br>
> > site(*/LIBRARY_BRANCH/*) depth(2)<br>
> > +<br>
> > | | WITH w AS (<br>
> > | | WITH */STRING/*_keyword_xq AS (SELECT<br>
> > +<br>
> > | | (to_tsquery('english_nostop',<br>
> > COALESCE(NULLIF( '(' ||<br>
> ><br>
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),<br>
> <br>
> > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),<br>
> '')) ||<br>
> > to_tsquery('simple', COALESCE(NULLIF( '(' ||<br>
> ><br>
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),<br>
> <br>
> > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),<br>
> ''))) AS<br>
> > tsq,+<br>
> > | | (to_tsquery('english_nostop',<br>
> > COALESCE(NULLIF( '(' ||<br>
> > btrim(regexp_replace(split_date_range(search_normalize<br>
> > 00:02:17.319491 | */STRING/* |<br>
> ><br>
> > And the queries by DorkBot look like they could be starting the<br>
> query<br>
> > since it's using the basket function in the OPAC.<br>
> ><br>
> > "GET<br>
> ><br>
> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1<br>
> <br>
> > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"<br>
> ><br>
> > I've anonymized the output just to be cautious. Reports are run<br>
> off the<br>
> > backup database server, so it cannot be an auto generated<br>
> report, and it<br>
> > doesn't happen often enough for that either. At this point I'm<br>
> tempted<br>
> > to block the IP addresses. What strategies are you all using to deal<br>
> > with crawlers, and does anyone have an idea what is causing this?<br>
> > -Jon<br>
> ><br>
> > _______________________________________________<br>
> > Evergreen-general mailing list<br>
> > <a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
> <mailto:<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a>><br>
> ><br>
> <a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
> <<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a>><br>
> ><br>
> _______________________________________________<br>
> Evergreen-general mailing list<br>
> <a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
> <mailto:<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a>><br>
> <a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
> <<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a>><br>
> <br>
</blockquote></div></div></div>