[Evergreen-general] Dealing with significant traffic increase caused by AI bots
Linda Jansová
linda.jansova at gmail.com
Fri Apr 19 16:12:48 EDT 2024
Thank you for sharing the link to the Dark Visitors website - it looks
very useful, indeed!
Linda
On 4/19/24 20:21, Lolis, John via Evergreen-general wrote:
> There's been quite a conversation on the CODE4LIB listserv about this
> lately...
>
> Scott Prater <0000007dd2c67ad2-dmarc-request at lists.clir.org>
>
> Thu, 11 Apr, 10:43 (8 days ago)
>
> to CODE4LIB
> We've also been seeing some traffic from inconsiderate AI bots.
>
> One of my colleagues came across this site, which tracks and documents
> AI bots:
>
> https://darkvisitors.com/
>
> -- Scott
>
> --
> Scott Prater
> Digital Library Architect
> UW Digital Collections Center
> University of Wisconsin - Madison
>
>
>
> ________________________________________
> From: Code for Libraries <CODE4LIB at LISTS.CLIR.ORG> on behalf of Lolis,
> John <jlolis at WHITEPLAINSNY.GOV>
> Sent: Wednesday, April 10, 2024 12:15 PM
> To: CODE4LIB at LISTS.CLIR.ORG
> Subject: Re: [CODE4LIB] blocking GPTBot?
>
> This *sounds* as if it should help:
> https://urldefense.com/v3/__https://searchengineland.com/google-extended-crawler-432636__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPPtfncyM$
>
> John Lolis
> Coordinator of Computer Systems
>
> 100 Martine Avenue
> White Plains, NY 10601
> tel: 1.914.422.1497
> fax: 1.914.422.1452
>
> https://urldefense.com/v3/__https://whiteplainslibrary.org/__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPwb7-RSk$
>
> *“I would rather have questions that can’t be answered than answers that
> can’t be questioned.”*
> — Richard Feynman
> <https://urldefense.com/v3/__https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtP3X91XJ0$
> >,
> theoretical physicist and recipient of the Nobel Prize in Physics in 1965
>
>
> On Mon, 8 Apr 2024 at 16:31, Jason Casden <casden at gmail.com> wrote:
>
> > Thanks for bringing this up, Eben. We've been having a horrible time
> with
> > these bots, including those from previously fairly well-behaved sources
> > like Google. They've caused issues ranging from slow response times and
> > high system load all the way up to outages for some older systems.
> So far,
> > our systems folks have been playing whack-a-mole with a combination
> of IP
> > range blocks and increasingly detailed robots.txt statements. A group is
> > being convened to investigate more comprehensive options so I will be
> > watching this thread closely.
> >
> > Jason
> >
> > On Mon, Apr 8, 2024 at 4:18 PM Eben English <eben.english at gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I'm wondering if other folks are seeing AI and/or ML-related crawlers
> > like
> > > GPTBot accessing your library's website, catalog, digital
> collections, or
> > > other sites.
> > >
> > > If so, are you blocking or disallowing these crawlers? Has anyone
> come up
> > > with any policies around this?
> > >
> > > We're debating whether to allow these types of bots to crawl our
> digital
> > > collections, many of which contain large amounts of copyrighted or "no
> > > derivatives"-licensed materials. On one hand, these materials are
> > available
> > > for public view, but on the other hand the type of use that GPTBot and
> > the
> > > like are after (integrating the content into their models) could be
> > > characterized as creating a derivative work, which is expressly
> > > discouraged.
> > >
> > > Thanks,
> > >
> > > Eben English (he/him/his)
> > > Digital Repository Services Manager
> > > Boston Public Library
> > >
> >
>
> John Lolis
> Coordinator of Computer Systems
>
> 100 Martine Avenue
> White Plains, NY 10601
> tel: 1.914.422.1497
> fax: 1.914.422.1452
>
> https://whiteplainslibrary.org/
>
> /“I would rather have questions that can’t be answered than answers
> that can’t be questioned.”/
> — Richard Feynman
> <https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu>,
> theoretical physicist and recipient of the Nobel Prize in Physics in 1965
>
>
> On Fri, 19 Apr 2024 at 07:05, Jane Sandberg via Evergreen-general
> <evergreen-general at list.evergreen-ils.org> wrote:
>
> Hi Linda,
>
> It's not for Evergreen, but my colleague recently blocked
> claudebot using fail2ban on our load balancer
> <https://github.com/pulibrary/princeton_ansible/commit/6f9009249a168442391d90e2b75028d40a8a9e91>.
> Essentially, fail2ban is configured to watch Nginx's access log,
> and if more than 10 claudebot requests appear within the past
> minute from a particular IP, it automatically blocks all requests
> from that IP for the next 24 hours. I would think that something
> similar could work for Apache's access log.
>
> Good luck with the bots!
>
> -Jane
>
> El vie, 19 abr 2024 a la(s) 3:42 a.m., Linda Jansová via
> Evergreen-general (evergreen-general at list.evergreen-ils.org) escribió:
>
> Dear all,
>
> Have any of you encountered an extensive crawling by
> Bytespider and
> Bytedance (see e.g.,
> https://wordpress.org/support/topic/psa-bytedance-and-bytespider-bots-recommend-blocking/),
>
> Claudebot or other AI bots?
>
> If so, do you have any secret recipe how to disable the
> crawler from
> accessing the site?
>
> Thank you very much for sharing your experience!
>
> Linda
>
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>
>
> _______________________________________________
> Evergreen-general mailing list
> Evergreen-general at list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.evergreen-ils.org/pipermail/evergreen-general/attachments/20240419/73b641b1/attachment-0001.htm>
More information about the Evergreen-general
mailing list