[Evergreen-general] Dealing with significant traffic increase caused by AI bots
Lolis, John
jlolis at whiteplainsny.gov
Fri Apr 19 14:21:59 EDT 2024
There's been quite a conversation on the CODE4LIB listserv about this
lately...
Scott Prater <0000007dd2c67ad2-dmarc-request at lists.clir.org>
Thu, 11 Apr, 10:43 (8 days ago)
to CODE4LIB
We've also been seeing some traffic from inconsiderate AI bots.
One of my colleagues came across this site, which tracks and documents AI
bots:
https://darkvisitors.com/
-- Scott
--
Scott Prater
Digital Library Architect
UW Digital Collections Center
University of Wisconsin - Madison
________________________________________
From: Code for Libraries <CODE4LIB at LISTS.CLIR.ORG> on behalf of Lolis, John
<jlolis at WHITEPLAINSNY.GOV>
Sent: Wednesday, April 10, 2024 12:15 PM
To: CODE4LIB at LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] blocking GPTBot?
This *sounds* as if it should help:
https://urldefense.com/v3/__https://searchengineland.com/google-extended-crawler-432636__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPPtfncyM$
John Lolis
Coordinator of Computer Systems
100 Martine Avenue
White Plains, NY 10601
tel: 1.914.422.1497
fax: 1.914.422.1452
https://urldefense.com/v3/__https://whiteplainslibrary.org/__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPwb7-RSk$
*“I would rather have questions that can’t be answered than answers that
can’t be questioned.”*
— Richard Feynman
<
https://urldefense.com/v3/__https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtP3X91XJ0$
>,
theoretical physicist and recipient of the Nobel Prize in Physics in 1965
On Mon, 8 Apr 2024 at 16:31, Jason Casden <casden at gmail.com> wrote:
> Thanks for bringing this up, Eben. We've been having a horrible time with
> these bots, including those from previously fairly well-behaved sources
> like Google. They've caused issues ranging from slow response times and
> high system load all the way up to outages for some older systems. So far,
> our systems folks have been playing whack-a-mole with a combination of IP
> range blocks and increasingly detailed robots.txt statements. A group is
> being convened to investigate more comprehensive options so I will be
> watching this thread closely.
>
> Jason
>
> On Mon, Apr 8, 2024 at 4:18 PM Eben English <eben.english at gmail.com>
> wrote:
>
> > Hi all,
> >
> > I'm wondering if other folks are seeing AI and/or ML-related crawlers
> like
> > GPTBot accessing your library's website, catalog, digital collections,
or
> > other sites.
> >
> > If so, are you blocking or disallowing these crawlers? Has anyone come
up
> > with any policies around this?
> >
> > We're debating whether to allow these types of bots to crawl our digital
> > collections, many of which contain large amounts of copyrighted or "no
> > derivatives"-licensed materials. On one hand, these materials are
> available
> > for public view, but on the other hand the type of use that GPTBot and
> the
> > like are after (integrating the content into their models) could be
> > characterized as creating a derivative work, which is expressly
> > discouraged.
> >
> > Thanks,
> >
> > Eben English (he/him/his)
> > Digital Repository Services Manager
> > Boston Public Library
> >
>
John Lolis
Coordinator of Computer Systems
100 Martine Avenue
White Plains, NY 10601
tel: 1.914.422.1497
fax: 1.914.422.1452
https://whiteplainslibrary.org/
*“I would rather have questions that can’t be answered than answers that
can’t be questioned.”*
— Richard Feynman
<https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu>,
theoretical physicist and recipient of the Nobel Prize in Physics in 1965
On Fri, 19 Apr 2024 at 07:05, Jane Sandberg via Evergreen-general <
evergreen-general at list.evergreen-ils.org> wrote:
> Hi Linda,
>
> It's not for Evergreen, but my colleague recently blocked claudebot using
> fail2ban on our load balancer
> <https://github.com/pulibrary/princeton_ansible/commit/6f9009249a168442391d90e2b75028d40a8a9e91>.
> Essentially, fail2ban is configured to watch Nginx's access log, and if
> more than 10 claudebot requests appear within the past minute from a
> particular IP, it automatically blocks all requests from that IP for the
> next 24 hours. I would think that something similar could work for
> Apache's access log.
>
> Good luck with the bots!
>
> -Jane
>
> El vie, 19 abr 2024 a la(s) 3:42 a.m., Linda Jansová via Evergreen-general
> (evergreen-general at list.evergreen-ils.org) escribió:
>
>> Dear all,
>>
>> Have any of you encountered an extensive crawling by Bytespider and
>> Bytedance (see e.g.,
>>
>> https://wordpress.org/support/topic/psa-bytedance-and-bytespider-bots-recommend-blocking/),
>>
>> Claudebot or other AI bots?
>>
>> If so, do you have any secret recipe how to disable the crawler from
>> accessing the site?
>>
>> Thank you very much for sharing your experience!
>>
>> Linda
>>
>> _______________________________________________
>> Evergreen-general mailing list
>> Evergreen-general at list.evergreen-ils.org
>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>>
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20240419/e8209c0b/attachment.htm>
More information about the Evergreen-general
mailing list