Using lots of little tools to aggressively reject the bots

41 archargelod 13 5/31/2025, 8:06:21 AM lambdacreate.com ↗

Comments (13)

Proofread0592 · 1h ago
It is nice that the AI crawler bots honestly fill out the `User-Agent` header, I'm shocked that they were the source of that much traffic though. 99% of all websites do not change often enough to warrant this much traffic, let alone a dev blog.
grishka · 1h ago
They also respect robots.txt.

However, I've also seen reports that after getting blocked one way or another, they start crawling with browser user-agents from residential IPs. But it might also be someone else misrepresenting their crawlers as OpenAI/Amazon/Facebook/whatever to begin with.

rovr138 · 1h ago
We ended up writing similar rules to the article. It was just based on frequency.

While we were rate limiting bots based on UA, we ended up also having to apply wider rules because traffic started spiking from other places.

I can't say if it's the traffic shifting, but there's definitely a big amount of automated traffic not identifying itself properly.

If you look at all your web properties, look at historic traffic to calculate <hits per IP> in <time period>. Then look at the new data and see how it's shifting. You should be able to identify the real traffic and the automated very quickly.

vachina · 46m ago
I’ve turned off logging on my servers precisely because it’s growing too quickly due to these bots. They’re that relentless, and would fill every form, even access APIs otherwise accessible only by clicking around the site. Anthropic, openAI and Facebook are still scraping to this day.
reconnecting · 1h ago
Creator of tirreno [1] here.

While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.

We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account. You can then use the built-in rule engine to automatically generate blacklists based on specific conditions, such as excessive 500 or 404 errors, brute-force login attempts, or traffic from data center IPs.

Finally, you can integrate tirreno blacklist API into your application logic to redirect unwanted traffic to an error page.

Bonus: a dashboard [2] is available to help you monitor activity and fine-tune the blacklist to avoid blocking legitimate users.

[1] https://github.com/tirrenotechnologies/tirreno

[2] https://play.tirreno.com/login (admin/tirreno)

reconnecting · 45m ago
We also have work in progress to block bots based on publicly available IP ranges through the same dashboard. Any suggestions are welcome.
loloquwowndueo · 2h ago
Nice - I like that most of the ai scraper bot blocking was done using Nginx configuration. Still, once fail2ban was added to the mix (meaning: additional service and configuration), I wonder if considering something like Anubis (https://anubis.techaro.lol/) would have been more automatic. I’ve seen Anubis verification pages pop up more frequently around the web!
rovr138 · 1h ago
FWIW, the reason I like their approach is that fail2ban is still lean, works off of the same logs, and doesn't start with the requirement to affect everyone's experience due to bad actors.
rovr138 · 1h ago
Great article and sleuthing to find the information.

I know you're processing them dynamically as they come in and break the rules. But if you wanted to supplement the list, might be worth sourcing the ones from https://github.com/ai-robots-txt/ai.robots.txt at some frequency.

sneak · 54m ago
You don’t have to fend off anything, you just have to fix your server to support this modest amount of traffic.

Everyone else is visiting your site for entirely self-serving purposes, too.

I don’t understand why people are ok with Google scraping their site (when it is called indexing), fine with users scraping their site (when it is called RSS reading), but suddenly not ok with AI startups scraping their site.

If you publish data to the public, expect the public to access it. If you don’t want the public (this includes AI startups) to access it, don’t publish it.

Your website is not being misused when the data is being downloaded to train AI. That’s literally what public data is for.

red369 · 48m ago
Is it because people viewed it as Google scraping the site to make an index so that people could find the site, while the AI scraping is intended so people won’t need to visit the site at all?
owebmaster · 50s ago
The AI apps (namely chatgpt and Claude) are evolving to display external data with widgets that will potentially drive more traffic than google has been doing for a long time. Might be worth changing focus as SEO killed google.
ehutch79 · 33m ago
Also, google is relatively considerate when crawling.