[Evergreen-general] Re: Internal Evergreen traffic question

16 Jun 2025

      Since both of those source IPs are from Alibaba (I spend a lot of time on
whois.arin.net and the other regional registrars) those two at least are
fake. I've seen a lot of obviously fake user agent strings and referral
urls (which I think is where https://google.com/ is in those urls). I've
also seen a lot of presumably hacked residential and business equipment
used in botnets which usually only make a single search or record retrieval
request per IP and then another IP will follow up with a different request
(and never, ever, any js, css, or images), which means there are limits to
what geo blocking can be used for. I assume these would be related to the
"third party scrapers" that Anthropic (or whoever) alluded to a long time
ago when they explained why they didn't respect robots.txt and the wild
west type of scraping that everyone with a GeForce and a dream are taking
part in before the bubble bursts.

All that to say that blocking them is fairly hard without going full
Cloudflare (or similar). One thing we've put together here is this LP:
https://bugs.launchpad.net/evergreen/+bug/2113979 which will usually just
throw a 302 at a bot and because they aren't actual browsers they just sort
of run out of steam while human users may be redirected a single time in a
session or likely not at all. I complained a lot more about things in that
ticket so I won't rehash all of that here, but you may be able to lower
your resource use and spend more time serving real users by trying out that
patch.

As for your cover 404's, so long as you're not blocking anything from
internal ranges and aren't blocking outgoing connections that would prevent
your system from reaching a cover provider those are probably just fine.
One thing to note, I don't know who you use for cover images, but
OpenLibrary has lowered their image request limits so much that we really
should remove them as a provider. Unless you contact them directly there's
a limit of 10 image retrievals in X time (I don't recall off hand; maybe 1
or more hours?) and because cover image retrievals are run through the
server, 1 person loading a search results page will blow up the limit
immediately.

Jason

-- 
Jason Boyer
Senior System Administrator
Equinox Open Library Initiative
JBoyer@equinoxOLI.org
+1 (877) Open-ILS (673-6457)
https://equinoxOLI.org/

On Mon, Jun 16, 2025 at 12:13 PM JonGeorg SageLibrary via Evergreen-general
<evergreen-general@list.evergreen-ils.org> wrote:
...
One thing I am seeing a ton of is google.com entries rather than GoogleBot
our_domain:443 47.79.206.79 - - [16/Jun/2025:00:00:09 -0700] "GET
/eg/opac/record/2620408?query=Fathers%20Juvenile%20fiction HTTP/1.0" 500
21258 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile
Safari/537.36"
our_domain:443 47.79.206.22 - - [16/Jun/2025:00:00:08 -0700] "GET
/eg/opac/record/2621426?query=Allingham%20William%201824%201889 HTTP/1.0"
500 21258 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Mobile
Safari/537.36"
Do you think those are legitimate patron searches or more likely Google
scraping in a different way?
-Jon
On Mon, Jun 16, 2025 at 8:44 AM JonGeorg SageLibrary <
jongeorg.sagelibrary@gmail.com> wrote:
...
But that many? I just tried to reboot the app server and it froze on the
advanced key value. I'm wondering if it's unrelated and like you said
normal, and instead the docker managing the SSL cert is locked or something
similar. I've reached out to the people hosting the servers to see if they
have any insight. Thank you!
-Jon
On Mon, Jun 16, 2025 at 8:41 AM Bill Erickson <berickxx@gmail.com> wrote:
...
Hi Jon,
Those would be the patron catalog performing added content lookups.
Instead of directly reaching out to the vendor for the data, it leverages
the existing web api via internal requests (in asynchronous batches) to
collect the data.  Those are expected.
-b
On Mon, Jun 16, 2025 at 11:20 AM JonGeorg SageLibrary via
Evergreen-general <evergreen-general@list.evergreen-ils.org> wrote:
...
Greetings.
We've been slammed by bot traffic and had to take counter measures. We
geoblocked international traffic at the host firewall level, and recently
added a nginx bot blocker for bots based on servers in the US and Canada. I
then scraped bot IPs out of the apache logs and began adding the IPs that
were still coming through. Yes, I've updated the robots.txt file- they're
ignoring it.
The issue is that after a day or two of reprieve, we started getting a
ton of 404's with loopback addresses. I've reverted the blacklist config
file back to blank, and restarted all services on all servers. We're still
getting a ton of traffic that appears to be internally generated.
I don't see anything obvious within crontab. Since it appears to be
internally generated, the opac stays up longer than it normally would with
the number of sessions on the load balancer.
Is there an Evergreen or Apache service that indexes the entire
catalog? We have our external IP whitelisted. Do internal vlan IP addresses
need whitelisted?
Here's an example of the traffic I'm seeing. It's all on port 80 too,
external traffic all comes on 443.
our_domain:80 127.0.0.1 - - [16/Jun/2025:08:18:31 -0700] "HEAD
/opac/extras/ac/anotes/html/r/2621889 HTTP/1.1" 404 159 "-" "-"
-Jon
_______________________________________________
Evergreen-general mailing list --
evergreen-general@list.evergreen-ils.org
To unsubscribe send an email to
evergreen-general-leave@list.evergreen-ils.org
_______________________________________________
Evergreen-general mailing list -- evergreen-general@list.evergreen-ils.org
To unsubscribe send an email to
evergreen-general-leave@list.evergreen-ils.org