Show HN: I made a tool that finds your customers before your competitors do (smarketly.lema-lema.com)

It is nice that the AI crawler bots honestly fill out the `User-Agent` header, I'm shocked that they were the source of that much traffic though. 99% of all websites do not change often enough to warrant this much traffic, let alone a dev blog.

grishka · 1h ago

They also respect robots.txt.

However, I've also seen reports that after getting blocked one way or another, they start crawling with browser user-agents from residential IPs. But it might also be someone else misrepresenting their crawlers as OpenAI/Amazon/Facebook/whatever to begin with.

rovr138 · 1h ago

We ended up writing similar rules to the article. It was just based on frequency.

While we were rate limiting bots based on UA, we ended up also having to apply wider rules because traffic started spiking from other places.

I can't say if it's the traffic shifting, but there's definitely a big amount of automated traffic not identifying itself properly.

If you look at all your web properties, look at historic traffic to calculate <hits per IP> in <time period>. Then look at the new data and see how it's shifting. You should be able to identify the real traffic and the automated very quickly.

vachina · 46m ago

I’ve turned off logging on my servers precisely because it’s growing too quickly due to these bots. They’re that relentless, and would fill every form, even access APIs otherwise accessible only by clicking around the site. Anthropic, openAI and Facebook are still scraping to this day.

reconnecting · 1h ago

Creator of tirreno [1] here.

While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.

We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account. You can then use the built-in rule engine to automatically generate blacklists based on specific conditions, such as excessive 500 or 404 errors, brute-force login attempts, or traffic from data center IPs.

Finally, you can integrate tirreno blacklist API into your application logic to redirect unwanted traffic to an error page.

Bonus: a dashboard [2] is available to help you monitor activity and fine-tune the blacklist to avoid blocking legitimate users.

[1] https://github.com/tirrenotechnologies/tirreno

[2] https://play.tirreno.com/login (admin/tirreno)

reconnecting · 45m ago

We also have work in progress to block bots based on publicly available IP ranges through the same dashboard. Any suggestions are welcome.

loloquwowndueo · 2h ago

Nice - I like that most of the ai scraper bot blocking was done using Nginx configuration. Still, once fail2ban was added to the mix (meaning: additional service and configuration), I wonder if considering something like Anubis (https://anubis.techaro.lol/) would have been more automatic. I’ve seen Anubis verification pages pop up more frequently around the web!

rovr138 · 1h ago

FWIW, the reason I like their approach is that fail2ban is still lean, works off of the same logs, and doesn't start with the requirement to affect everyone's experience due to bad actors.

rovr138 · 1h ago

Great article and sleuthing to find the information.

I know you're processing them dynamically as they come in and break the rules. But if you wanted to supplement the list, might be worth sourcing the ones from https://github.com/ai-robots-txt/ai.robots.txt at some frequency.

sneak · 54m ago

You don’t have to fend off anything, you just have to fix your server to support this modest amount of traffic.

Everyone else is visiting your site for entirely self-serving purposes, too.

I don’t understand why people are ok with Google scraping their site (when it is called indexing), fine with users scraping their site (when it is called RSS reading), but suddenly not ok with AI startups scraping their site.

If you publish data to the public, expect the public to access it. If you don’t want the public (this includes AI startups) to access it, don’t publish it.

Your website is not being misused when the data is being downloaded to train AI. That’s literally what public data is for.

red369 · 48m ago

Is it because people viewed it as Google scraping the site to make an index so that people could find the site, while the AI scraping is intended so people won’t need to visit the site at all?

owebmaster · 50s ago

The AI apps (namely chatgpt and Claude) are evolving to display external data with widgets that will potentially drive more traffic than google has been doing for a long time. Might be worth changing focus as SEO killed google.

ehutch79 · 33m ago

Also, google is relatively considerate when crawling.

Show HN: AI Peer Reviewer – Multiagent System for Scientific Manuscript Analysis (rigorous.company)

Moravec's Paradox (en.wikipedia.org)

Insect-eating Venus flytraps thrive in the Carolinas (apnews.com)

Speed, Structure, and Smarts: The Notion AI Way (notion.com)

You want to geofence my GPU? (boehs.org)

Be Careful with Dropbox (brandur.org)

How to Read a Novel (adjacentpossible.substack.com)

"We Are Not Prompts " [video] (youtube.com)

Apple Turnaround (hypercritical.co)

Best Tools for Vibe Coding (thetechtower.com)

FediDB (fedidb.com)

Bohemians at the Gate? (inferencemagazine.substack.com)

How Does Cloud Emissivity Feedback Affect Present and Future Arctic Warming? (spj.science.org)

I Switched to UTC and Never Looked Back (timestripe.com)

Against Exponential Backoff (incoherency.co.uk)

The Planetary Society reissues call to reject budget proposal for NASA (planetary.org)

World Scientists Look Elsewhere as U.S. Labs Stagger Under Trump Cuts (nytimes.com)

Show HN: I made a tool that finds your customers before your competitors do (smarketly.lema-lema.com)

Chatterbox, Resemble AI's production-grade open source TTS model (github.com)

Securing a Form on the Internet: Still Pretty Difficult (serverascode.com)

Record/Replay Debugging Tutorial (github.com)

Landslides Leave Big Sur's Beloved Landmarks Fighting for Survival (wsj.com)

Profile of the Volta River Authority (vra.com)

Breakthrough cancer drug doubles survival in trial (bbc.com)

France will ban smoking in beaches, parks and near schools from July 1 (lemonde.fr)

Earlywood: Things That Get Mispronounced in Woodworking (christopherschwarz.substack.com)

Division of Corporation Finance's Statement on Protocol Staking (sec.gov)

Show HN: TrainYatri – Using MCP to Power LLM Driven Indian Railway Queries (github.com)

It's not your imagination: AI is speeding up the pace of change (techcrunch.com)

Norway's incredibly rare Viking ship discovery (youtube.com)

USDA Employee Charged in Multimillion-Dollar Food Stamp Fraud and Bribery Scheme (justice.gov)

Show HN: InlineStyle – An open-source cloud to watch, write, play and publish

Show HN: I spent 2 years building an iOS app no one asked for (basamasa.github.io)

Ask HN: How do you improve code for future AI?

Critical Percolation Cluster Exploration (nmk.wtf)

Show HN: LessEncrypt: A light-weight certificate signer for homelab and dev envs (github.com)

Aussie businesses now have to fess up when they pay off ransomware crims (theregister.com)

Leaving Bluesky (emilyliu.me)

Hugging Face unveils two new humanoid robots (techcrunch.com)

This Website Does Not Exist (thiswebsitedoesnotexist.net)

Talking Well (fi-le.net)

AI Shopping Assistants Are Redefining ECommerce (yotpo.com)

Bookish Diversions: Reading as Help for Living (millersbookreview.com)

Show HN: Maroik – Personal Finance and Scheduling CMS Built in Asp.net Core (github.com)

Dodge Confirms Electric Charger Daytona R/T Is Dead as Unsold Cars Pile Up (thedrive.com)

Millions of Bees escape after lorry overturns in US (news.sky.com)

Show HN: I made a simple software licensing platform for developers (keyforge.dev)

Carbon footprint of Israel's war on Gaza exceeds that of many entire countries (theguardian.com)

The Talk Show Live From WWDC 2025: Tuesday June 10 (daringfireball.net)

Hermeus Flies Quarterhorse Mk 1 at Edwards Air Force Base (hermeus.com)

Using lots of little tools to aggressively reject the bots

Comments (13)