
Since both of those source IPs are from Alibaba (I spend a lot of time on whois.arin.net and the other regional registrars) those two at least are fake. I've seen a lot of obviously fake user agent strings and referral urls (which I think is where https://google.com/ is in those urls). I've also seen a lot of presumably hacked residential and business equipment used in botnets which usually only make a single search or record retrieval request per IP and then another IP will follow up with a different request (and never, ever, any js, css, or images), which means there are limits to what geo blocking can be used for. I assume these would be related to the "third party scrapers" that Anthropic (or whoever) alluded to a long time ago when they explained why they didn't respect robots.txt and the wild west type of scraping that everyone with a GeForce and a dream are taking part in before the bubble bursts. All that to say that blocking them is fairly hard without going full Cloudflare (or similar). One thing we've put together here is this LP: https://bugs.launchpad.net/evergreen/+bug/2113979 which will usually just throw a 302 at a bot and because they aren't actual browsers they just sort of run out of steam while human users may be redirected a single time in a session or likely not at all. I complained a lot more about things in that ticket so I won't rehash all of that here, but you may be able to lower your resource use and spend more time serving real users by trying out that patch. As for your cover 404's, so long as you're not blocking anything from internal ranges and aren't blocking outgoing connections that would prevent your system from reaching a cover provider those are probably just fine. One thing to note, I don't know who you use for cover images, but OpenLibrary has lowered their image request limits so much that we really should remove them as a provider. Unless you contact them directly there's a limit of 10 image retrievals in X time (I don't recall off hand; maybe 1 or more hours?) and because cover image retrievals are run through the server, 1 person loading a search results page will blow up the limit immediately. Jason -- Jason Boyer Senior System Administrator Equinox Open Library Initiative JBoyer@equinoxOLI.org +1 (877) Open-ILS (673-6457) https://equinoxOLI.org/ On Mon, Jun 16, 2025 at 12:13 PM JonGeorg SageLibrary via Evergreen-general <evergreen-general@list.evergreen-ils.org> wrote:
One thing I am seeing a ton of is google.com entries rather than GoogleBot
our_domain:443 47.79.206.79 - - [16/Jun/2025:00:00:09 -0700] "GET /eg/opac/record/2620408?query=Fathers%20Juvenile%20fiction HTTP/1.0" 500 21258 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36" our_domain:443 47.79.206.22 - - [16/Jun/2025:00:00:08 -0700] "GET /eg/opac/record/2621426?query=Allingham%20William%201824%201889 HTTP/1.0" 500 21258 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Mobile Safari/537.36"
Do you think those are legitimate patron searches or more likely Google scraping in a different way? -Jon
On Mon, Jun 16, 2025 at 8:44 AM JonGeorg SageLibrary < jongeorg.sagelibrary@gmail.com> wrote:
But that many? I just tried to reboot the app server and it froze on the advanced key value. I'm wondering if it's unrelated and like you said normal, and instead the docker managing the SSL cert is locked or something similar. I've reached out to the people hosting the servers to see if they have any insight. Thank you! -Jon
On Mon, Jun 16, 2025 at 8:41 AM Bill Erickson <berickxx@gmail.com> wrote:
Hi Jon,
Those would be the patron catalog performing added content lookups. Instead of directly reaching out to the vendor for the data, it leverages the existing web api via internal requests (in asynchronous batches) to collect the data. Those are expected.
-b
On Mon, Jun 16, 2025 at 11:20 AM JonGeorg SageLibrary via Evergreen-general <evergreen-general@list.evergreen-ils.org> wrote:
Greetings. We've been slammed by bot traffic and had to take counter measures. We geoblocked international traffic at the host firewall level, and recently added a nginx bot blocker for bots based on servers in the US and Canada. I then scraped bot IPs out of the apache logs and began adding the IPs that were still coming through. Yes, I've updated the robots.txt file- they're ignoring it.
The issue is that after a day or two of reprieve, we started getting a ton of 404's with loopback addresses. I've reverted the blacklist config file back to blank, and restarted all services on all servers. We're still getting a ton of traffic that appears to be internally generated.
I don't see anything obvious within crontab. Since it appears to be internally generated, the opac stays up longer than it normally would with the number of sessions on the load balancer.
Is there an Evergreen or Apache service that indexes the entire catalog? We have our external IP whitelisted. Do internal vlan IP addresses need whitelisted?
Here's an example of the traffic I'm seeing. It's all on port 80 too, external traffic all comes on 443.
our_domain:80 127.0.0.1 - - [16/Jun/2025:08:18:31 -0700] "HEAD /opac/extras/ac/anotes/html/r/2621889 HTTP/1.1" 404 159 "-" "-"
-Jon
_______________________________________________ Evergreen-general mailing list -- evergreen-general@list.evergreen-ils.org To unsubscribe send an email to evergreen-general-leave@list.evergreen-ils.org
_______________________________________________ Evergreen-general mailing list -- evergreen-general@list.evergreen-ils.org To unsubscribe send an email to evergreen-general-leave@list.evergreen-ils.org