<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:#000000">There's been quite a conversation on the CODE4LIB listserv about this lately...<br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:#000000"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small;color:#000000">Scott Prater <<a href="mailto:0000007dd2c67ad2-dmarc-request@lists.clir.org">0000007dd2c67ad2-dmarc-request@lists.clir.org</a>><br> <br>Thu, 11 Apr, 10:43 (8 days ago)<br> <br>to CODE4LIB<br>We've also been seeing some traffic from inconsiderate AI bots.<br><br>One of my colleagues came across this site, which tracks and documents AI bots:<br><br><a href="https://darkvisitors.com/">https://darkvisitors.com/</a><br><br>-- Scott<br><br>--<br>Scott Prater<br>Digital Library Architect<br>UW Digital Collections Center<br>University of Wisconsin - Madison<br><br><br><br>________________________________________<br>From: Code for Libraries <<a href="mailto:CODE4LIB@LISTS.CLIR.ORG">CODE4LIB@LISTS.CLIR.ORG</a>> on behalf of Lolis, John <<a href="mailto:jlolis@WHITEPLAINSNY.GOV">jlolis@WHITEPLAINSNY.GOV</a>><br>Sent: Wednesday, April 10, 2024 12:15 PM<br>To: <a href="mailto:CODE4LIB@LISTS.CLIR.ORG">CODE4LIB@LISTS.CLIR.ORG</a><br>Subject: Re: [CODE4LIB] blocking GPTBot?<br><br>This *sounds* as if it should help:<br><a href="https://urldefense.com/v3/__https://searchengineland.com/google-extended-crawler-432636__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPPtfncyM$">https://urldefense.com/v3/__https://searchengineland.com/google-extended-crawler-432636__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPPtfncyM$</a><br><br>John Lolis<br>Coordinator of Computer Systems<br><br>100 Martine Avenue<br>White Plains, NY 10601<br>tel: 1.914.422.1497<br>fax: 1.914.422.1452<br><br><a href="https://urldefense.com/v3/__https://whiteplainslibrary.org/__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPwb7-RSk$">https://urldefense.com/v3/__https://whiteplainslibrary.org/__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPwb7-RSk$</a><br><br>*“I would rather have questions that can’t be answered than answers that<br>can’t be questioned.”*<br>— Richard Feynman<br><<a href="https://urldefense.com/v3/__https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtP3X91XJ0$">https://urldefense.com/v3/__https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtP3X91XJ0$</a> >,<br>theoretical physicist and recipient of the Nobel Prize in Physics in 1965<br><br><br>On Mon, 8 Apr 2024 at 16:31, Jason Casden <<a href="mailto:casden@gmail.com">casden@gmail.com</a>> wrote:<br><br>> Thanks for bringing this up, Eben. We've been having a horrible time with<br>> these bots, including those from previously fairly well-behaved sources<br>> like Google. They've caused issues ranging from slow response times and<br>> high system load all the way up to outages for some older systems. So far,<br>> our systems folks have been playing whack-a-mole with a combination of IP<br>> range blocks and increasingly detailed robots.txt statements. A group is<br>> being convened to investigate more comprehensive options so I will be<br>> watching this thread closely.<br>><br>> Jason<br>><br>> On Mon, Apr 8, 2024 at 4:18 PM Eben English <<a href="mailto:eben.english@gmail.com">eben.english@gmail.com</a>><br>> wrote:<br>><br>> > Hi all,<br>> ><br>> > I'm wondering if other folks are seeing AI and/or ML-related crawlers<br>> like<br>> > GPTBot accessing your library's website, catalog, digital collections, or<br>> > other sites.<br>> ><br>> > If so, are you blocking or disallowing these crawlers? Has anyone come up<br>> > with any policies around this?<br>> ><br>> > We're debating whether to allow these types of bots to crawl our digital<br>> > collections, many of which contain large amounts of copyrighted or "no<br>> > derivatives"-licensed materials. On one hand, these materials are<br>> available<br>> > for public view, but on the other hand the type of use that GPTBot and<br>> the<br>> > like are after (integrating the content into their models) could be<br>> > characterized as creating a derivative work, which is expressly<br>> > discouraged.<br>> ><br>> > Thanks,<br>> ><br>> > Eben English (he/him/his)<br>> > Digital Repository Services Manager<br>> > Boston Public Library<br>> ><br>></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><span style="font-family:"trebuchet ms",sans-serif"><br></span></div><div><span style="font-family:"trebuchet ms",sans-serif">John Lolis</span><br></div><div><font face="'trebuchet ms', sans-serif">Coordinator of Computer Systems</font></div></div></div></div></div></div></div></div></div><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><font face="'trebuchet ms', sans-serif"><img src="https://ci3.googleusercontent.com/mail-sig/AIorK4yAa9EjDQ2k8l4yaH291EHSWgLbJdQy0f9fH1JVWo2xP7YdQ39AmkOBuxTJTUeD9AfPu5AgLXmd1tQijRk7sw9bgFk2NgaYP6AHCa-eCnMnQRY"><br></font></div><div><span style="font-family:"trebuchet ms",sans-serif">100 Martine Avenue</span><br></div><div><span style="font-family:"trebuchet ms",sans-serif">White Plains, NY 10601</span></div></div></div></div></div></div></div></div></div><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><font face="'trebuchet ms', sans-serif"></font></div><div><font face="'trebuchet ms', sans-serif">tel: 1.914.422.1497</font></div><div><font face="'trebuchet ms', sans-serif">fax: 1.914.422.1452</font></div><div><font face="'trebuchet ms', sans-serif"><br></font></div><div><font face="'trebuchet ms', sans-serif"><a href="https://whiteplainslibrary.org/" target="_blank">https://whiteplainslibrary.org/</a></font></div><div><br></div><div><span style="color:rgb(45,45,47);font-family:adelle,sans-serif,Helvetica"><i>“I would rather have questions that can’t be answered than answers that can’t be questioned.”</i><br></span><font size="1"><span style="color:rgb(45,45,47);font-family:adelle,sans-serif,Helvetica">— </span><a href="https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu" rel="noopener noreferrer" style="color:rgb(8,117,193);font-family:adelle,sans-serif,Helvetica" target="_blank">Richard Feynman</a><span style="color:rgb(45,45,47);font-family:adelle,sans-serif,Helvetica">, theoretical physicist and recipient of the Nobel Prize in Physics in 1965</span></font><br></div><font size="2" face="Verdana, Arial, Helvetica"><span style="font-family:georgia,serif"></span></font></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 19 Apr 2024 at 07:05, Jane Sandberg via Evergreen-general <<a href="mailto:evergreen-general@list.evergreen-ils.org">evergreen-general@list.evergreen-ils.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Linda,<div><br></div><div>It's not for Evergreen, but my colleague <a href="https://github.com/pulibrary/princeton_ansible/commit/6f9009249a168442391d90e2b75028d40a8a9e91" target="_blank">recently blocked claudebot using fail2ban on our load balancer</a>. Essentially, fail2ban is configured to watch Nginx's access log, and if more than 10 claudebot requests appear within the past minute from a particular IP, it automatically blocks all requests from that IP for the next 24 hours. I would think that something similar could work for Apache's access log.</div><div><br></div><div>Good luck with the bots!</div><div><br></div><div> -Jane</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El vie, 19 abr 2024 a la(s) 3:42 a.m., Linda Jansová via Evergreen-general (<a href="mailto:evergreen-general@list.evergreen-ils.org" target="_blank">evergreen-general@list.evergreen-ils.org</a>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear all,<br>
<br>
Have any of you encountered an extensive crawling by Bytespider and <br>
Bytedance (see e.g., <br>
<a href="https://wordpress.org/support/topic/psa-bytedance-and-bytespider-bots-recommend-blocking/" rel="noreferrer" target="_blank">https://wordpress.org/support/topic/psa-bytedance-and-bytespider-bots-recommend-blocking/</a>), <br>
Claudebot or other AI bots?<br>
<br>
If so, do you have any secret recipe how to disable the crawler from <br>
accessing the site?<br>
<br>
Thank you very much for sharing your experience!<br>
<br>
Linda<br>
<br>
_______________________________________________<br>
Evergreen-general mailing list<br>
<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
</blockquote></div>
_______________________________________________<br>
Evergreen-general mailing list<br>
<a href="mailto:Evergreen-general@list.evergreen-ils.org" target="_blank">Evergreen-general@list.evergreen-ils.org</a><br>
<a href="http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general" rel="noreferrer" target="_blank">http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general</a><br>
</blockquote></div>