A nice attempt and another layer for the swiss cheese of technology it will take to try and ease the burden AI companies are putting on people trying to run websites.
I'd be cautious about relying on just the good will of Cloudflare.
It's unfortunate that we need honeypots and tarpits to trap AI scrapers just so that our hosting bills don't get hosed. It's taking a good chunk of value out of running a site on the Internet.
OutOfHere · 9h ago
Feel free to waste your expensive outgoing bandwidth running malware. It is a genius idea really from the cloud companies to enrich their balances.
Definitely don't rewrite your web server more efficiently in Rust instead. /s
Retric · 8h ago
Serving poisoned text can be so cheap it’s effectively free as long as you don’t give them a lot of links.
Mars008 · 2h ago
Yeh, and say goodby to google search. You didn't want to be there anyway, right?
techjamie · 6h ago
Many of these tarpits deliberately serve the data at an excruciatingly low speed to ease the burden on the server resources. It's cheaper than quickly serving the same crawlers your entire website at max speed constantly.
OutOfHere · 2h ago
If we are going for cheaper, how is it cheaper than an HTTP 429 error? It's not.
TekMol · 9h ago
Currently, what I do is that when an IP requests insane amounts of URLs on my server (especially when its all broken urls causing 404s) I look up the IP and then block the whole organization.
For example today some bot from the range 14.224.0.0-14.255.255.255 got crazy and caused a storm of 404s. Dozens per second for hours on end. So I blocked the range like this:
iptables -A INPUT -m iprange --src-range 14.224.0.0-14.255.255.255 -j DROP
That's probably not the best way and might block significant parts of whole countries. But at least it keeps my service alive for now.
What do others here do to protect their servers?
PaulDavisThe1st · 8h ago
At git.ardour.org, we block any attempt to retrieve a specific commit. Trying to do so triggers fail2ban putting the IP into blocked status for 24hrs. They also get a 404 response.
We wouldn't mind if bots simply cloned the repo every week or something. But instead they crawl through the entire reflog. Fucking stupid behavior, and one that has cost us an extra $50/month even with just the 404.
GGO · 8h ago
I like rate-limiting. I know none of my users will need more than 10qps. I set that for all routes, and all bots get throttled. I can also have much higher rate-limit for authenticated users. Have not had bots slamming me - they just get 429s
azangru · 9h ago
> Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers
How is this done, technically? User agent checking? IP range blocking?
This requires good faith on behalf of the crawler? So it's DOA; why even bother implementing this?
Also, what a piece of zero-trust shit the web is becoming thanks to a couple of shit heads who really need to extract monetary value out of everything. Even if this non-solution were to work, the prospect of putting every website behind Cloudsnare is not a good one anyway.
What the web needs right now, to be honest, is machetes. In ample quantity. Tell me who's running that crawler that is bothering you and I will put them to the sword. They won't even need to present a JWK in the header.
xg15 · 8h ago
Maybe I didn't understand the proposal completely yet, but wouldn't the crawler only have to cooperate (send the right headers, implement that auth framework, etc) if they want to pay?
The standard response to a crawler is a 402 Payment Required response, probably as a result of an aggressive bot detection.
So essentially, it's turning a site's entire content into an API: Either sign up for an API key or get blocked.
The question remains though how well they will be able to distinguish bot traffic from humans - also, will they make an exception for search engines?
grg0 · 6h ago
That is not what I understood, and it sounds terrible. What if you're not a crawler but random Joe surfing the internet? Clearly Joe should see content without payment? So they need some way to tell the crawler and Joe apart, and presumably they require the crawler to set certain request headers. The headers aren't just to issue the payment, it's to identify the crawler in the first place?
The idea behind the headers is to allow bots to bypass automatic bot filtering, not blockade all regular traffic. In other words:
- we block bots (the website owner can configure how aggressively we block)
- unless they say they're from an AI crawler we've vetted, as attested by the signature headers
- in which case we let them pay
- and then they get to access the content
(Disclosure: I wrote the web bot auth implementation Cloudflare uses for pay per crawl)
xg15 · 5h ago
The writeup doesn't talk about actively misbehaving crawlers a lot, but this bit implies for me that the headers are for the "happy path", I.e. crawlers wanting to pay:
> Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing.
I don't see how it would make sense otherwise, as the requirements for crawlers include applying for a registration with Cloudflare.
Who in their right mind would jump through registration hoops only so they can not access a site? This wouldn't even keep away the crawlers that are operating today.
I agree there has to be some way to distinguish crawlers from regular users, but the only way I can see how this could be done is with bot detection algorithms.
...which are imperfect and will likely flag some legitimate human users as bots. So yes, this will probably leading to web browsing becoming even more unpleasant.
cryptonector · 8h ago
It's Cloudflare. That means they are good at DoS and DDoS protection. AI crawlers are basically DoS agents. I think CF can start with an honor system that also has attached to it the implied threat to block crawlers from all CF hosted content, and that is a pretty big hammer to hit the abusers with.
So I'm cautiously optimistic. Well, I suppose pessimistic too: if this works what this will mean is that all contents will end up moving into big player hosting like CF.
rorylaitila · 9h ago
It's unfortunate but I think the ship has sailed. Good on them for trying but I don't see it working.
I am advising all my clients away from informational content which is easily remixed by LLMs. And I'm not bothering anymore with targeting informational search queries on my own sites.
I'm doubling down on community and interaction. Finding ways to interact with original content with smaller audiences, rather than produce information for a global search audience.
mhuffman · 8h ago
So are they going to try and IP gate them or trust that AI companies that literally stole the info they used to make the base models will now respect robots.txt entries?
trhway · 8h ago
Every one likes net neutrality when the one is benefitting from it, yet the one immediately jumps at the opportunity to break net neutrality on their services if it allows to increase profit by price discrimination (which may take a shape of extracting a rent from some subset of consumers like seems to be in this case) .
mzs · 9h ago
Is there a cut that cloudfare gets or is that behind an NDA?
yladiz · 8h ago
Would this be preferable to something like Anubis?
jmole · 9h ago
> Imagine an AI engine like a block of swiss cheese. New, original content that fills one of the holes in the AI engine’s block of cheese is more valuable than repetitive, low-value content that unfortunately dominates much of the web today.
Great statement in theory - but in practice, the whole people-as-a-service industry for AI data generation is IMO more damaging to the knowledge ecosystem than open data. e.g. companies like pareto.ai
"Proprietary data for pennies on the dollar" is the late-stage capitalism equivalent of the postdoctoral research trap.
I'd be cautious about relying on just the good will of Cloudflare.
It's unfortunate that we need honeypots and tarpits to trap AI scrapers just so that our hosting bills don't get hosed. It's taking a good chunk of value out of running a site on the Internet.
Definitely don't rewrite your web server more efficiently in Rust instead. /s
For example today some bot from the range 14.224.0.0-14.255.255.255 got crazy and caused a storm of 404s. Dozens per second for hours on end. So I blocked the range like this:
iptables -A INPUT -m iprange --src-range 14.224.0.0-14.255.255.255 -j DROP
That's probably not the best way and might block significant parts of whole countries. But at least it keeps my service alive for now.
What do others here do to protect their servers?
We wouldn't mind if bots simply cloned the repo every week or something. But instead they crawl through the entire reflog. Fucking stupid behavior, and one that has cost us an extra $50/month even with just the 404.
How is this done, technically? User agent checking? IP range blocking?
Also, what a piece of zero-trust shit the web is becoming thanks to a couple of shit heads who really need to extract monetary value out of everything. Even if this non-solution were to work, the prospect of putting every website behind Cloudsnare is not a good one anyway.
What the web needs right now, to be honest, is machetes. In ample quantity. Tell me who's running that crawler that is bothering you and I will put them to the sword. They won't even need to present a JWK in the header.
The standard response to a crawler is a 402 Payment Required response, probably as a result of an aggressive bot detection.
So essentially, it's turning a site's entire content into an API: Either sign up for an API key or get blocked.
The question remains though how well they will be able to distinguish bot traffic from humans - also, will they make an exception for search engines?
The idea behind the headers is to allow bots to bypass automatic bot filtering, not blockade all regular traffic. In other words:
- we block bots (the website owner can configure how aggressively we block) - unless they say they're from an AI crawler we've vetted, as attested by the signature headers - in which case we let them pay - and then they get to access the content
(Disclosure: I wrote the web bot auth implementation Cloudflare uses for pay per crawl)
> Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing.
I don't see how it would make sense otherwise, as the requirements for crawlers include applying for a registration with Cloudflare.
Who in their right mind would jump through registration hoops only so they can not access a site? This wouldn't even keep away the crawlers that are operating today.
I agree there has to be some way to distinguish crawlers from regular users, but the only way I can see how this could be done is with bot detection algorithms.
...which are imperfect and will likely flag some legitimate human users as bots. So yes, this will probably leading to web browsing becoming even more unpleasant.
So I'm cautiously optimistic. Well, I suppose pessimistic too: if this works what this will mean is that all contents will end up moving into big player hosting like CF.
I am advising all my clients away from informational content which is easily remixed by LLMs. And I'm not bothering anymore with targeting informational search queries on my own sites.
I'm doubling down on community and interaction. Finding ways to interact with original content with smaller audiences, rather than produce information for a global search audience.
Great statement in theory - but in practice, the whole people-as-a-service industry for AI data generation is IMO more damaging to the knowledge ecosystem than open data. e.g. companies like pareto.ai
"Proprietary data for pennies on the dollar" is the late-stage capitalism equivalent of the postdoctoral research trap.
Cloudflare to introduce pay-per-crawl for AI bots
https://news.ycombinator.com/item?id=44432385
This ends up hurting individuals and small companies that are harmless and cannot afford to pay