[Evergreen-dev] Problematic bot traffic
Blake Graham-Henderson
blake at mobiusconsortium.org
Thu Feb 13 16:46:42 EST 2025
All,
I almost replied with the arstechnica article that Josh linked when the
thread was started. But I decided not to put it out there until I had
setup a test system to see if I could get that code working. A tarpit, I
think, serves them right. And, of course, the whole issue is destined to
receive the fate of spam and spam filters forever and ever.
It was a serendipitous timed article. It's existence at this moment in
time signals to me that this isn't a "just us" problem. It's the entire
planet.
-Blake-
Conducting Magic
Will consume any data format
MOBIUS
On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote:
> Jeff, thanks for bringing this up on the list.
>
> We are seeing a lot of requests like
> "GET /eg/opac/mylist/delete?anchor=record_184821&record=184821" from
> never seen before IPs, and they make 1-12 requests and then stop.
>
> And they seem like they usually have a random out of date chrome
> version in the user agent string.
> Chrome/88.0.4324.192
> Chrome/86.0.4240.75
>
> I've been trying to slow down the bots by collecting logs and grabbing
> all the obvious patterns and blocking netblocks for non US ranges.
> ipinfo.io <http://ipinfo.io> offers a free country & ASN database
> download that I've been using to look up the ranges and countries.
> (https://ipinfo.io/products/free-ip-database) I would be happy to
> share a link to our current blocklist that has 10K non US ranges.
>
> I've also been reporting the non US bot activity to
> https://www.abuseipdb.com/ just to bring some visibility to these bad
> bots. I noticed initially that many of the IPs that we were getting
> hit from didn't seem to be listed on any blocklists already, so I
> figured some reporting might help. I'm kind of curious if Evergreen
> sites are getting hit from the same IPs, so an evergreen specific
> blocklist would be useful. If you look up your bot IPs on
> abuseipdb.com <http://abuseipdb.com> you can see if I've already
> reported any of them.
>
> I've also been making use of block lists from https://iplists.firehol.org/
> Such as
> https://iplists.firehol.org/files/cleantalk_30d.ipset
> https://iplists.firehol.org/files/botscout_7d.ipset
> https://iplists.firehol.org/files/firehol_abusers_1d.netset
>
> We are using HAProxy so I did some looking into the CrowdSec HAProxy
> Bouncer (https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not
> sure that would help since these IPs don't seem to be on blocklists.
> But I may just not quite understand how CrowdSec is supposed to work.
>
> HAProxy Enterprise has a ReCaptcha module that I think would allow us
> to feed any non-us connections that haven't connected before through a
> recaptcha, but the price for HAProxy Enterprise is out of our budget.
> https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules
>
> There is also a fairly up to date project for adding Captchas through
> haproxy at
> https://github.com/ndbiaw/haproxy-protection, This looks promising as
> a transparent method, requires new connections to perform a javascript
> proof of work calculation before allowing access. Could be a good
> transparent way of handling it.
>
> We were taken out by ChatGTP bots back in December, which were a bit
> easier to block the netblocks since they were not as spread out. I
> recently saw this article about how some people are fighting back
> against bots that ignore robots.txt,
> https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
>
> Josh
>
> On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via Evergreen-dev
> <evergreen-dev at list.evergreen-ils.org> wrote:
>
> Hi folks,
>
> Our Evergreen environment has been experiencing a
> higher-than-usual volume of unwanted bot traffic in recent months.
> Much of this traffic looks like webcrawlers hitting
> Evergreen-specific URLs from an enormous number of different IP
> addresses. Judging from discussion in IRC last week, it sounds
> like other EG admins have been seeing the same thing. Does anyone
> have any recommendations for managing this traffic and mitigating
> its impact?
>
> Some solutions that have been suggested/implemented so far:
> - Geoblocking entire countries.
> - Using Cloudflare's proxy service. There's some trickiness in
> getting this to work with Evergreen.
> - Putting certain OPAC pages behind a captcha.
> - Deploying publicly-available blocklists of "bad bot"
> IPs/useragents/etc. (good but limited, and not EG-specific).
> - Teaching EG to identify and deal with bot traffic itself (but
> arguably this should happen before the traffic hits Evergreen).
>
> My organization is currently evaluating CrowdSec as another
> possible solution. Any opinions on any of these approaches?
> --
> Jeff Davis
> BC Libraries Cooperative
> _______________________________________________
> Evergreen-dev mailing list
> Evergreen-dev at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>
>
> _______________________________________________
> Evergreen-dev mailing list
> Evergreen-dev at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-dev/attachments/20250213/aa828127/attachment-0001.htm>
More information about the Evergreen-dev
mailing list