Re: Bot scraping/DDoS against ILS and discovery layers

From: Filipe MS Bento (sBIDM/UA) <fsb_at_nyob> Date: Wed, 26 Mar 2025 14:53:43 +0000 To: CODE4LIB_at_LISTS.CLIR.ORG

I was stopping by just to share the same note you mentioned, Tim, after reading the mention to Cloudflare Turnstile brought up by Tod:

	Trapping misbehaving bots in an AI Labyrinth
	https://blog.cloudflare.com/ai-labyrinth/

And the good news is that it "is available on an opt-in basis to all customers, including the Free plan".

Disclaimer: big fan of Cloudflare services (free plan), use them across a lot of my own services.

Best,
- Filipe

-----Original Message-----
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Tim Spalding
Sent: 26 de março de 2025 14:26
To: CODE4LIB_at_LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] Bot scraping/DDoS against ILS and discovery layers

Not a library, but we run several library products and have several bookish websites with many millions of pages.

* We've seen an overall rise in scraping over the last two years. We and others attribute the rise to bots scraping for LLM development.
* We have anti-LLM stuff in our robots.txt, but it doesn't matter. The problem is the bad actors.
* We put ourselves by Cloudflare several years ago after a multi-day DDoS attack—a real one, with actual extortion demands. The rise of AI scraping has meant we spend time tweaking our Cloudflare settings. CF is free, but we pay for a higher-level of service.
* Much or most of the traffic is China and Singapore, which where a lot of cloud-computing resources are located. On several occasions we'd literally shut down all traffic from China, but, alas, we have a big customer in Singapore.
* We reduced our attack surface. In our case this meant killing off our many translated language sites (LibraryThing.fr, LibraryThing.de,
dk.LibraryThing.com) in favor of having language-pickers on the main site.
* Cloudflare has specific anti-AI filters, as well as a new "maze" feature to lead bots on a merry chase forever.

Tim

On Wed, Mar 26, 2025 at 10:08 AM Tod Olson <tod_at_uchicago.edu> wrote:

> I can also say that we've seen a fair amount of this sort of 
> scraping/DDoS, it's been happening since late December. (We've also 
> had one or two incidents in that timeframe of harvesting coming from 
> single IPs, which of course are easier to deal with.) We're a FOLIO 
> shop running VuFind locally, and have also seen similar scraping/DDoS 
> against our image database.
>
> We have also implemented Cloudflare's Turnstile to good effect. We are 
> also exploring some Web Application Firewall options, in case things 
> evolve past the point of where Turnstile is effective.
>
> Best,
>
> -Tod
>
> Tod Olson <tod_at_uchicago.edu> (he/him)
> Director of Integrated Library Systems University of Chicago Library
>
> Local Host Committee, Open Repositories 2025< 
> https://or2025.openrepositories.org>
> [Image.png]
>
> On Mar 26, 2025, at 6:56 AM, Esmé Cowles <escowles_at_ticklefish.org> wrote:
>
> Eric-
>
> We have seen a lot of bot traffic in the last few weeks, and we are a 
> Clarivate (Alma) shop, though our discovery layer is Blacklight. 
> Something we've noticed as we've tried to block the bot traffic, is 
> that the spikes of bot activity that have been DOSing us for many 
> months now is only part of the picture, and we actually have a very 
> high baseline level of bot activity at all times. So much so that 
> we're reconsidering our analytics picture because so much of our 
> recent historical traffic is undetected bots (e.g., in one report China represented about 90% of our traffic).
>
> We've also heard of similar levels of problems from digital 
> collections and other kinds of sites (e.g., SourceHut 
> https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/). So my general 
> impression is that this isn't targeted at one technology stack or 
> libraries, but is basically everybody with any content on the internet.
>
> The thing we've implemented recently, which is the first thing that's 
> been really successful is using Turnstile. Jonathan Rochind wrote up 
> this
> approach:
>
>
> https://bibwild.wordpress.com/2025/01/16/using-cloudflare-turnstile-to

> -protect-certain-pages-on-a-rails-app/
>
> And we adapted that to our setup using Traefik:
>
> https://github.com/pulibrary/princeton_ansible/tree/main/nomad/traefik

> -wall
>
> There has been a fair amount of discussion of this on the Code4Lib and 
> Samvera Slack workspaces (in the #bots channel in each), so I'd 
> encourage anyone who's battling this to check those out.
>
> -Esmé
> --
> Esmé Cowles <escowles_at_princeton.edu>
> Asst. Director, Library Software Engineering Princeton University 
> Library
>
> On Mar 26, 2025, at 7:26 AM, Eric Blevins < 
> 000001d7eb585e16-dmarc-request_at_LISTS.CLIR.ORG> wrote:
>
> Good morning,
>
> First time posting to Code4Lib, but have been a watcher for several years.
> I'm curious from strictly a numbers standpoint how many libraries 
> might've been impacted recently (say the last couple of weeks or so) 
> by massive bot harvesting of data, basically resulting in a DDoS 
> attack, against your ILS, Discovery Layers, or other systems. I'm 
> actually also curious if non-Innovative/Clarivate product libraries 
> are seeing similar issues. We are an innovative/Clarivate product 
> shop, so we have some awareness that others with those products were 
> impacted. Again, aside from curiosity if you're a non-Clarivate shop, 
> I'm not looking for specifics just wondering about the scope of the attacks against other institutions/orgs.
>
> Regards,
>
> Eric C. Blevins
> Sr. Manager of Library Technology
> RIT Libraries
> Rochester Institute of Technology
> Email: Eric.Blevins_at_rit.edu<mailto:Eric.Blevins_at_rit.edu>
>
>

--
Check out my library at https://www.librarything.com/profile/timspalding