Everyone loves the dream of a free for all and open web.
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
tonetegeatinst · 8m ago
Onion sites have bots and scrapers.
They don't use cloudlfare AFAIK.
They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.
Gud · 1h ago
By developing Free Software combating these hostile softwares.
Corporations develop hostile AI agents,
Capable hackers develop anti-AI-agents.
This defeatist atittude "we have no power".
TIPSIO · 1h ago
Yes, I obviously agree with you. My comment's point is missed a little I think by you. CF is making these tools and giving access to it to millions of people.
supriyo-biswas · 1h ago
Well there's open source stuff like https://github.com/TecharoHQ/anubis; one doesn't need a top-down mandated solution coming from a corporation.
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
Klonoar · 51m ago
Anubis doesn’t necessarily stop the most well funded actors.
If anything we’ve seen the rise in complaints about it just annoying average users.
supriyo-biswas · 42m ago
The actual response to which Anubis was created is seemingly a strange kind of DDOS attack that has been misattributed to LLMs, but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies. (Yes, it doesn’t help that the author of Anubis also isn’t fully aware of the mechanics of the attack. In fact, there is no proper write up of the mechanism of the attack which I hope to write about someday).
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
victorbjorklund · 54m ago
So basically cloudflare but self-hosted (with all the pain that comes from that)?
Gud · 41m ago
What’s so painful about self hosting? I’ve been self hosting since before I hit puberty. If 12 year old me can run a httpd, anyone can.
And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web
victorbjorklund · 39m ago
I self-host lots of stuff. But yes it is more pain to host a WAF that can handle billions of request per minute. Even harder to do it for free like Cloudflare. And in the end the end result for the user is exactly the same if you use a self-hosted WAF or let someone else host it for you.
xg15 · 1h ago
That's a mantra, not a solution.
esseph · 1h ago
Sometimes it's a hardware problem, not a software problem.
nimih · 1h ago
For that matter, sometimes it's a social/political problem and not a technological problem.
Analemma_ · 1h ago
How does an agent help my website not get crushed by traffic load, and how is this proposal any different from the gatekeeping problem to the open web, except even less transparent and accountable because now access is gated by logic inside an impenetrable web of NN weights?
This seems like slogan-based planning with no actual thought put into it.
Gud · 1h ago
Whatever is working against the AI doesn’t have to be an AI agent.
hoppp · 47m ago
So proof of work checks everywhere?
banku_brougham · 1h ago
This is the attitude I like to see. As they say, actually I hate this because of past connotations but "freedom isn't free"
gausswho · 1h ago
What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.
quectophoton · 1h ago
I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).
To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
quectophoton · 25m ago
Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
blibble · 53m ago
> This means I'd get sued for using a feed reader on Codeberg
you think codeberg would sue you?
quectophoton · 42m ago
Probably not.
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
qwerty456127 · 25m ago
What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.
vkou · 12m ago
> What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build.
Well, I'm glad you speak for the entire Internet.
Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.
notatoad · 1h ago
It wouldn’t stop anyone. The bots you want to block already operate out of places where those laws wouldn’t be enforced.
qbane · 1h ago
Then that is a good reason to deny the requests from those IPs
literalAardvark · 32m ago
I've run a few hundred small domains for various online stores with an older backend that didn't scale very well for crawlers and at some point we started blocking by continent.
It's getting really, really ugly out there.
stronglikedan · 1h ago
It should have the same protections as an EULA, where the crawler is the end user, and crawlers should be required to read it and apply it.
edm0nd · 25m ago
No we dont
Galanwe · 42m ago
- Moral rules are never really effective
- Legal threats are never really effective
Effective solutions are:
- Technical
- Monetary
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
edm0nd · 22m ago
>This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that.
This seems flawed.
Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.
>But if you want to spam 1,000,000 everyday that becomes prohibitive.
Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.
clvx · 1h ago
You can lock it up with a user account and payment system. The fact the site is up on the internet doesn’t mean you can or cannot profit from it. It’s up to you. What I would like it’s a way to notify my isp and say, block this traffic to my site.
inetknght · 1h ago
> What I would like it’s a way to notify my isp and say, block this traffic to my site.
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
No comments yet
eldenring · 1h ago
I recently found out my website has been blocked by AI agents, when I had never asked for it. It seems to be opt-out by default, but in an obscure way. Very frustrating. I think some of these companies (one in particular) are risking burning a lot of goodwill, although I think they have been on that path for a while now.
PaulRobinson · 1h ago
You might have this the wrong way around.
It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.
We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.
That's the system that needs to change.
I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.
This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.
No comments yet
FlyingSnake · 1h ago
It is a service available to Cloudflare customers and is opt-in. I fail to see how they’re being gatekeepers when site owners have option not to use it.
andy99 · 1h ago
> But the reality is how can someone small protect their blog or content from AI training bots?
A paywall.
In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.
littlecranky67 · 1h ago
This. We need to get rid of the ad-supported free internet economy. If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.
rustc · 1h ago
> If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
What about licenses like CC-BY-NC (Creative Commons - Non Commercial)?
notatoad · 1h ago
Which is really all that cloudflare is building here that people are mad about. It’s a way to give bots access to paywalled content.
positiveblue · 1h ago
Where everyone needs a cloudflare account to be able to pay*
notatoad · 1h ago
“Everyone” in this context being bot operators who want to access websites who have decided to use cloudflare to block unauthenticated bot traffic.
Which is not everyone.
jMyles · 1h ago
> Everyone loves the dream of a free for all and open web.
> protect their blog or content from AI training bots
It strikes me that one needs to chose one of these as their visionary future.
Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.
So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.
pessimizer · 14m ago
It's the new "ban cassette tapes to prevent people from listening to unauthorized music," but wrapped in an anti-corporate skin delivered by a massive, powerful corporation that could sell themselves to Microsoft tomorrow.
The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.
One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.
Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?
deadbabe · 49m ago
I care more about the dream of a wide open free web than a small time blogger’s fears of their content being trained on by an AI that might only ever emit text inspired by their content a handful of times in their life.
m463 · 22m ago
nonsense.
I'm routinely denied access to websites now.
enable javascript and unblock cookies to continue
buyucu · 52m ago
Everyone loves a free for all and open web because it works really well.
Basic tools like Anubis and fail2ban are very effective at keeping most of this evil at bay.
mannanj · 34m ago
how about we discuss and design and implement a system that charges them for their actions? we could put some dark patterns in our sites that specifically have this cost through some sort of problem solving thing in the site that harvests their energetic scraping/LLM tools into directing their energy onto causes that give us profit on our site, in exchange for revealing some content in return that achieves their mission of scraping too. Looks like these exist to degrees.
sneak · 1h ago
> But the reality is how can someone small protect their blog or content from AI training bots?
First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.
Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.
It's impossible to make something "public, but not like that". Either you publish or you don't.
If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.
I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.
There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.
avazhi · 38m ago
Nobody cares about robots.txt, nor should they.
If this is your primary argument against being scraped (viz that your robots.txt said not to) then you’re naive and you’re doing it wrong.
If the internet is open, then data on it is going to be scraped lol. You can’t have it both ways.
verdverm · 33m ago
It seems the Open Internet is idealistic.
If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.
matt-p · 2h ago
I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.
chatmasta · 58m ago
I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
asats · 1h ago
I've some personal apps online and I had to turn the cloudflare ai bot protection on because one of them got 1.6TB of data accessed by the bots in the last month, 1.3 million requests per day, just non stop hammering it with no limits.
Operyl · 2h ago
They're getting to the point of 200-300RPS for some of my smaller marketing sites, hallucinating URLs like crazy. It's fucking insane.
palmfacehn · 1h ago
You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.
On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
matt-p · 1h ago
Honestly it's just tragedy of the commons. Why put the effort in when you don't have to identify yourself, just crawl and if you get blocked move the job to another server.
palmfacehn · 1h ago
At this point I'm blocking several ASNs. Most are cloud provider related, but there are also some repurposed consumer ASNs coming out of the PRC. Long term, this devalues the offerings of those cloud providers, as prospective customers will not be able to use them for crawling.
matt-p · 1h ago
I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.
swed420 · 41m ago
> I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI
Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.
Then, all these AI companies could interface directly with that single entity on terms that are agreeable.
teitoklien · 15m ago
you think they care about that ? they’d still crawl like this just in case which is why they don’t rate limit atm
rikafurude21 · 1h ago
Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.
jsheard · 1h ago
> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
rikafurude21 · 1h ago
If you look at the current LLM landscape, the frontier is not being pushed by labs throwing more data at their models - most improvements come from using more compute and improving training methods. In that sense I dont have to take their word, more data just hasnt been the problem for a long time.
jsheard · 1h ago
Just today Anthropic announced that they will begin using their users data for training by default - they still want fresh data so badly that they risked alienating their own paying customers to get some more. They're at the stage of pulling the copper out of the walls to feed their crippling data addiction.
impure · 1h ago
Well, if you have a better way to solve this that’s open I’m all ears. But what Cloudflare is doing is solving the real problem of AI bots. We’ve tried to solve this problem with IP blocking and user agents, but they do not work. And this is actually how other similar problems have been solved. Certificate authorities aren’t open and yet they work just fine. Attestation providers are also not open and they work just fine.
lucb1e · 14m ago
Certificate authorities don't block humans if they 'look' like a bot
ShakataGaNai · 1h ago
Agreed. It might not be THE BEST solution, but it is a solution that appears to work well.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
positiveblue · 23m ago
yep, that's why I am writing this now :)
You can see it in the web vs mobile apps.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
viktorcode · 1h ago
AI poisoning is a better protection. Cloudflare is capable of serving stashes of bad data to AI bots as protective barrier to their clients.
verdverm · 29m ago
You don't think that the AI companies will take efforts to detect and filter bad data for training? Do you suppose they are already doing this, knowing that data quality has an impact on model capabilities?
viktorcode · 21m ago
They will learn to pay for high quality data instead of blindly relying on internet contents.
esseph · 1h ago
AI poisoning is going to get a lot of people killed, be cause the AI won't stop being used.
viktorcode · 20m ago
By that logic AI already killing people. We can't presume that whatever can be found on the internet is reliable data, can't we?
lucb1e · 13m ago
If science taught us anything it's that no data is ever reliable. We are pretty sure about so many things, and it's the best available info so we might as well use it, but in terms of "the internet can be wrong" -> any source can be wrong! And I'd not even be surprised if internet in aggregate (with the bot reading all of it) is right more often than individual authors of pretty much anything
TYPE_FASTER · 45m ago
I think the reality is, we need identity on both the client and server sides.
At some point soon, if not now, assume everything is generated by AI unless proven otherwise using a decentralized ID.
Likewise, on the server side, assume it’s a bot unless proven otherwise using a decentralized ID.
We can still have anonymity using decentralized IDs. An identity can be an anonymous identity, it’s not all (verified by some central official party) or nothing.
It comes down to different levels of trust.
Decoupling identity and trust is the next step.
lucb1e · 9m ago
It's called an IP address. Since some ISPs don't assign a fixed IP to a subscriber, a timestamp is nowadays necessary. The combination is traceable to a subscriber who is responsible for the line, either to work with law enforcement if subpoenaed or to not send abusive traffic via the line themselves
Why law enforcement doesn't do their job, resulting in people not bothering to report things anymore, is imo the real issue here. Third party identification services to replace a failing government branch is pretty ugly as a workaround, but perhaps less ugly than the commercial gatekeepers popping up today
verdverm · 26m ago
DID spec, also used in ATProto, is quite flexible. It would be nice to see it used in more places and processes
I use uncommon web browsers that don't leak a lot of information. To Cloudflare, I am indistingushable from a bot.
Privacy cannot exist in an environment where the host gets to decide who access the web page. I'm okay with rate limiting or otherwise blocking activity that creates too much of a load, but trying to prevent automated access is impossible withou preventing access from real people.
verdverm · 1h ago
The website owner has rights too. Are you arguing they cannot choose to implement such gatekeeping to keep their site operating in a financially viable manner?
SoftTalker · 1h ago
If you put your information freely on the web, you should have minimal expectations on who uses it and how. If you want to make money from it, put up a paywall.
If you want the best of both worlds, i.e. just post freely but make money from ads, or inserting hidden pixels to update some profile about me, well good luck. I'll choose whether I want to look at ads, or load tracking pixels, and my answer is no.
verdverm · 39m ago
I'm not talking about ads or pixels, I'm referring to bot operators creating so much traffic that the network bill makes the hosting financially impossible
> my answer is no.
Rights for me, but not for thee?
rustc · 1h ago
> If you put your information freely on the web, you should have minimal expectations on who uses it and how.
Does this only apply to "information" or should we treat all open source code as public domain?
pessimizer · 32m ago
All "open source" code was already pretty much public domain. All they'd have to do was put a page of OSI-approved licenses up on the site, right? An index of Open Source projects and their authors? Is this more than a weeks work to comply?
Free Software is the only place where this is a real abridgement of rights and intention, and it's over. They've already been trained on all of it, and no judge will tell them to stop, and no congressman will tell them to stop.
ehnto · 41m ago
I also do the same and get caught up by bot blockers.
However, I do believe the host can do whatever they want with my request also.
This issue becomes more complex when you start talking about government sites, since ideally they have a much stronger mandate to serve everyone fairly.
avtar · 1h ago
> An allowlist run by ONE company?
An allowlist run by one company that site owners chose to engage with. But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
positiveblue · 1h ago
> An allowlist run by one company that site owners chose to engage with.
Exactly, no problem with that, just hinting that's not a protocol.
> But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts
I know the image, what I do not understand is the argument between using it being incompatible with "fairness" and "openness"
glenstein · 1h ago
It's a frying pan/fire choice that could create a de-facto standard we end up depending on, during a critical moment where the hot topic could have a protocol or standards based solution. Cloudflare is actively trying to make a blue ocean for themselves of a real issue affecting everyone.
>But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
"But you participate in society!"
sdsd · 2h ago
Maybe the title means something more like "The web should not have gatekeepers (Cloudflare)". They do seem to say as much toward the end:
>We need protocols, not gatekeepers.
But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.
I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.
But like I said, I hope I'm wrong.
skybrian · 1h ago
This is sort of like how email is based on Internet standards but a large percentage of email users use Gmail. The Internet standards Cloudflare is promoting are open, but Cloudflare has a lot of power due to having so many customers.
(What are some good alternatives to Cloudflare?)
Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.
nromiun · 1h ago
It is a big problem. There is no good alternative to Cloudflare as a free CDN. They put servers all over the world and they are giving them away for free. And making their money on premium serverless services.
Not to mention the big cloud providers are unhinged with their egress pricing.
ACCount37 · 1h ago
We have far too many gatekeepers as it is. Any attempt to add any more should be treated as an act of aggression.
Cloudflare seems very vocal about its desire to become yet another digital gatekeeper as of late, and so is Google. I want both reduced to rubble if they persist in it.
timshell · 1h ago
I think about this as a startup founder building a 'proof-of-human' layer on the Internet.
One of the hard parts in this space is what level of transparency should you have. We're advancing the thesis that behavioral biometrics offers robust continuous authentication that helps with bot/human and good/bad, but people are obviously skeptical to trust black-box models for accuracy and/or privacy reasons.
We've defaulted to a lot of transparency in terms of publishing research online (and hopefully in scientific journals), but we've seen the downside: competitors fake claims about their own best in-house behavioral tools that is behind their company walls in addition to investors constantly worried about an arms race.
As someone genuinely interested (and incentivized!) to build a great solution in this space, what are good protocols/examples to follow?
seanvelasco · 1h ago
as a Cloudflare customer, I am happy with their proposition. I personally do not want companies like Perplexity that fake their user-agent and ignore my robots.txt to trespass.
and isn't this why people sign up with Cloudflare in the first place? for bot protection? to me, this is just the same, but with agents.
i love the idea of an open internet, but this requires all party to be honest. a company like Perplexity that fakes their user-agent to get around blocks disrespects that idea.
my attitude towards agents is positive. if a user used an LLM to access my websites and web apps, i'm all for it. but the LLM providers must disclose who they are - that they are OpenAI, Google, Meta, or the snake oil company Perplexity
chrisweekly · 37m ago
Your complaints about "faking their user-agent" reminds me of this 15-year-old but still-relevant, classic post about the history of the user-agent string:
TLDR the UA string has always been "faked", even in the scenarios you might think are most legitimate.
positiveblue · 1h ago
The point is "should everyone just have an account with Cloudflare then"
mmaunder · 1h ago
Brought to you by substack. ;-) Seriously though, great post and a great conversation starter.
positiveblue · 1h ago
I actually thought about this before publishing it hahaha
Good thing they are not the only place to post!
Havoc · 34m ago
Don't think the base game plan here is necessarily all that bad. It being concentrated in one for profit entity however very much is
tzury · 1h ago
at its foundation, the bots issue is in fact 3 main issues:
bots vs humans:
humans are trying to buy tickets that were sold out to a bot
data scrapping:
you index my data (real estate listing) to not to route traffic to my site as people search for my product, as a search engine will do, rather to become my competitor.
spam (and scam):
digital pollution, or even worse, trying to input credit card, gift cards, passwords, etc.
(obviously there are more, most which will fall into those categories, but those are the main ones)
now, in the human assisted AI, the first issue is no longer an issue, since it is obvious that each of us, the internet users, will soon have an agent built into our browser. so we will all have the speedy automated select, click and checkout at our disposal.
Prior to LLM era, there were search engines and academic research on the right side of the internet bots, and scrappers and north to that, on the wrong side of the map. but now we have legitimate human users extending their interaction with an LLM agent, and on top of it, we have new AI companies, larger and smaller which thrive for data in order to train their models.
Cloudflare simply trying to make sense of this, whilst maintaining their bot protection relevant.
I do not appreciate the post content whatsoever, since it lacks or consistency and maturity (a true understanding of how the internet works, rather than a naive one).
when you talk about "the internet", what exactly are you referring to?
a blog? a bank account management app? a retail website? social media?
those are all part of the internet and each is a complete different type of operation.
EDIT:
I've written a few words about this back in January [1] and in fact suggested something similar:
Leading CDNs, CAPTCHA providers, and AI vendors—think
Cloudflare, Google reCAPTCHA, OpenAI, or Anthropic
could collaborate to develop something akin to a
“tokenized machine ID.”
The private tracker community have long figured this out. Put content behind invite-only user registration, and treeban users if they ever break the rules.
PaulRobinson · 1h ago
This doesn't scale to the general web, does it? I think invite-only might work to build communities, but you end up in the situation we're in today where people are buying/selling invites, and that's with treebans in place.
I do fear the actions of the current bot landscape is going to lead to almost everything going behind auth walls though, and perhaps even paid auth walls.
jmarbach · 2h ago
I recently ran a test on the page load reliability of Browserbase and I was shocked to see how unreliable it was for a standard set of websites - the top 100 websites in the US by traffic according to SimilarWeb. 29% of page load requests failed. Without an open standard for agent identification, it will always be a cat and mouse game to trap agents, and many agents will predictably fail simple tasks.
Here's to working together to develop a new protocol that works for agents and website owners alike.
viktorcode · 1h ago
I wish Cloudflare would roll out AI poisoning attack as protection for their clients (providing bad data cache to AI bots), instead of this. Would work like a charm.
This is like saying companies don't need security gates and checkpoints. Unfortunately the world is filled with bad people, and you need security to keep them off your property.
bendigedig · 1h ago
If the broader economic system wasn't based on what is essentially theft, security wouldn't be as necessary as it is.
jjangkke · 2h ago
I would love to get off Cloudflare but there are no real good alternatives
bob1029 · 1h ago
Writing backends that can actually handle public traffic and using authentication for expensive resources are fantastic alternatives.
Also, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.
acdha · 6m ago
This sounds pretty unrealistic: the web is not better off if the only people who can host content are locking it behind authentication and/or have significant infrastructure budgets and the ability to create heavily tuned static stacks.
nromiun · 1h ago
Even if you write the best backend in the world where do you host them? AFAIK Cloudflare is the only free CDN.
esseph · 1h ago
You still have the network traffic issues which is very substantial
didibus · 2h ago
AWS is an alternative no?
nromiun · 1h ago
Bankruptcy as a surprise gift is not an alternative. Even those that use big cloud providers like AWS and GCP use CDNs like Cloudflare to protect themselves. And there is no free CDN like Cloudflare.
didibus · 1h ago
> And there is no free CDN like Cloudflare.
Their pricing page says:
No-nonsense Free Tier
As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.
Included in Always Free Tier
1 TB of data transfer out to the internet per month
10,000,000 HTTP or HTTPS Requests per month
2,000,000 CloudFront Function invocations per month
2,000,000 CloudFront KeyValueStore reads per month
10 Distribution Tenants
Free SSL certificates
No limitations, all features available
nromiun · 1h ago
1 TB per month of data is literally nothing. A kid could rent a VPS for an hour and drain all that. What do you do after that? AWS is not going to stop your bill going up is it?
I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.
rustc · 1h ago
> Included in Always Free Tier
> 1 TB of data
Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.
miohtama · 2h ago
AWS needs a dedicated AWS engineer while any technical person and some non-technical people have skill to set up Cloudflare. Esp. Without surprise bills.
didibus · 1h ago
I always hear this, but honestly I'm not sure it's true.
It's hard to assess the validity of this versus Cloudflare having a really good marketing department.
I've used neither, so I can't say, but I've also never seen anyone truly explain why/why-not.
ryoshu · 1h ago
Why not use both and find out? Cloudflare is much less technical than AWS, but still a bit technical.
jbrisson · 32m ago
"In the 90s, Microsoft tried to “embrace and extend” the web, but failed. And that failure was a blessing."
Basically MS tried to kill the web with their Win95 release, the infamous Internet Explorer and their shitty IIS/Frontpage tandem.
I deeply hate them since that day.
positiveblue · 22m ago
many people don't remember/know history though
hoppp · 55m ago
Should use a public blockchain for this? Its good for it, store public keys, verify signatures etc.. none of that token stuff tho
positiveblue · 51m ago
No
pluto_modadic · 1h ago
I think it shouldn't require registering /with/ cloudflare. cloudflare should just look up the .well-known referenced and double check for impersonation, and keep score on how well behaved each one is.
positiveblue · 1h ago
This is one of the main points :+1:
cyberlurker · 2h ago
I would love that vision to become reality but what Cloudflare is doing is unfortunately necessary atm.
TheCraiggers · 2h ago
Ok, I'll bite. Why is turning the Internet into a walled garden necessary now?
esseph · 1h ago
Commercial, criminal, and state interests have far more resources than you do, and their interests are in direct conflict with yours.
That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.
Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.
Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).
giantrobot · 1h ago
Multi-Tbps DDoS attacks, pervasive scanning of sites for exploits, comically expensive egress bandwidth on services like AWS, and ISPs disallowing hosting services on residential accounts.
doublerabbit · 1h ago
Start forcing tighter security on the devices causing the Multi-Tbps DDoS attacks would be a better option, no? Cheap unsecured IoT devices are a problem.
It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.
esseph · 1h ago
And home routers, printers, and end user devices themselves. Residential ISP networks can be infiltrated and remote CVE'd through browser calls at this point from a remote website. It's not even hard.
habinero · 1h ago
How would you secure someone else's devices?
madrox · 7m ago
The web doesn't need gatekeepers the way you don't need a bank account, driver's license, or a credit card. You can do without it, but it sure makes it harder to interact with modern society. The days of the mainstream internet being a libertarian frontier are more or less over. The capitalist internet is firmly in charge.
The real question is whether there is more business opportunity in supporting "unsigned" agents than signed ones. My hope is that the industry rejects this because there's more money to be made in catering to agents than blocking them. This move is mostly to create a moat for legacy business.
Also, if agents do become the de-facto way of browsing the internet, I'm not a fan of more ways of being tracked for ads and more ways for censorship groups to have leverage.
But the author is making a strawman argument over a "steelman" argument against signed agents. The strongest argument I can see is not that we don't need gatekeepers, but that regulation is anti-business.
johnnienaked · 1h ago
I can see a future where I don't use the internet at all.
jimmyl02 · 2h ago
I understand the concerns around a central gatekeeper but I'm confused as to why this specifically is viewed negatively. Don't website owners have to choose to enable cloudflare and to opt-in to this gate that the site owners control?
If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.
Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users
No comments yet
derefr · 21m ago
> Without that, I can simply hand the passport to another agent, and they can act as if they were me.
This isn't the problem Cloudflare are trying to solve here. AI scraping bots are a trigger for them to discuss this, but this is actually just one instance of a much larger problem — one that Cloudflare have been trying to solve for a while now, and which ~all other cloud providers have been ignoring.
My company runs a public data API. For QoS, we need to do things like blocking / rate-limiting traffic on a per-customer basis.
This is usually easy enough — people send an API key with their request, and we can block or rate-limit on those.
But some malicious (or misconfigured) systems, may sometimes just start blasting requests at our API without including an API key.
We usually just want to block these systems "at the edge" — there's no point to even letting those requests hit our infra. But to do that, without affecting any of our legitimate users, we need to have some key by which to recognize these systems, and differentiate them from legitimate traffic.
In the case where they're not sending an API key, that distinguishing key is normally the request's IP address / IP range / ASN.
The problematic exception, then, is Workers/Lambda-type systems (a.k.a. Function-as-a-Service [FaaS] providers) — where all workloads of all users of these systems come from the same pool of shared IP addresses.
---
And, to interrupt myself for a moment, in case the analogy isn't clear: centralized LLM-service web-browsing/tool-use backends, and centralized "agent" orchestrators, are both effectively just FaaS systems, in terms of how the web/MCP requests they originate, relate to their direct inbound customers and/or registered "agent" workloads.
Every problem of bucketing traditional FaaS outbound traffic, also applies to FaaSes where the "function" in question happens to be an LLM inference process.
"Agents" have made this concern more urgent/salient to increasingly-smaller parts of the ecosystem, who weren't previously considering themselves to be "data API providers." But you can actually forget about AI, and focus on just solving the problem for the more-general category of FaaS hosts — and any solution you come up with, will also be a solution applicable to the "agent formulation" of the problem.
---
Back to the problem itself:
The naive approach would be to block the entire FaaS's IP range the first time we see an attack coming from it. (And maybe some API providers can get away with that.)
But as long as we have at least one legitimate customer whose infrastructure has been designed around legitimate use of that FaaS to send requests to us, then we can't just block the entire Workers IP range.
(And sure, we could block these IP ranges by default, and then try to get such FaaS-using customers to send some additional distinguishing header in their requests to us, that would take priority over the FaaS-IP-range block... but getting a client engineer to implement an implementation-level change to their stack, by describing the needed change in a support ticket as a resolution to their problem, is often an extreme uphill battle. Better to find a way around needing to do it.)
So we really want/need some non-customer-controlled request metadata to match on, to block these bad FaaS workloads. Ideally, metadata that comes from the FaaS itself.
As it turns out, CF Workers itself already provides such a signal. Each outbound subrequest from a Worker gets forcibly annotated "on the way out" with a request header naming the Worker it came from. We can block on / rate-limit by this header. Works great!
But other FaaS providers do not provide anything similar. For example, it's currently impossible to determine which AWS Lambda customer is making requests to our API, unless that customer specifically deigns to attach some identifying info to their requests. (I actually reported this as a security bug to the Lambda team, over three years ago now.)
---
So, the point of an infrastructure-level-enforced public-visible workload-identity system, like what CF is proposing for their "signed agents", isn't just about being able to whitelist "good bots."
It's also about having some differentiable key that can cleanly bucket bot traffic, where any given bucket then contains purely legitimate or purely malicious/misbehaving bot traffic; so that if you set up rate-limiting, greylisting, or heuristic blocking by this distinguishing key, then the heuristic you use will ensure that your legitimate (bot) users never get punished, while your misbehaving/malicious (bot) users automatically trip the heuristic. Which means you never need to actually hunt through logs and manually blacklist specific malicious/misbehaving (bot) users.
If you look at this proposal as an extension/enhancement of what CF has already been doing for years with Workers subrequest originating-identity annotation, the additional thing that the "signed agents" would give the ecosystem on behalf of an adopting FaaS, is an assurance that random other bots not running on one of these FaaS platforms, can't masquerade as your bot (in order to take advantage of your preferential rate-limiting tier; or round-robin your and many others' identities to avoid such rate-limiting; or even to DoS-attack you by flooding requests that end up attributed to you.) Which is nice, certainly. It means that you don't have to first check that the traffic you're looking at originated from one of the trustworthy FaaS providers, before checking / trusting the workload-identity request header as a distinguishing key.
But in the end, that's a minor gain, compared to just having any standard at all — that other FaaSes would sign on to support — that would require them to emit a workload-identity header on outbound requests. The rest can be handled just by consuming+parsing the published IP-ranges JSON files from FaaS providers (something our API backend already does for CF in particular.)
Wertulen · 1h ago
I suppose it’s time AI proselytization rediscovered the tragedy of the commons.
jmtame · 1h ago
I pretty much use Perplexity exclusively at this point, instead of Google. I'd rather just get my questions answered than navigate all of the ads and slowness that Google provides. I'm fine with paying a small monthly fee, but I don't want Cloudflare being the gatekeeper.
Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.
verdverm · 20m ago
Perplexity has been one of the AI companies that created the problem that gave rise to this CF proposal. Why doesn't Perplexity invest more into being a responsible scraper?
I’m not necessarily coming to the defense of CF’s proposed solution, but it’s ridiculous and rather telling that the article mounts such a strong defense for agents around the notion they are simply completing user-directed tasks the user would otherwise do themselves, while avoiding the blatantly obvious issues of copyright, attribution, resource overusage, etc. presented by agents.
It’s somewhat ironic to let fly the “free and open internet” battle cry on behalf of an industry that is openly destroying it.
fortran77 · 27m ago
Cloudflare lost a lot of credibility by backing off its "neutral" stance and booting certain sites--some which were admittedly horrible--from the their service. Now it seems they want to be even more of a gatekeeper.
Wait till these robots get out in the real world and start overwhelming real world resources.
theideaofcoffee · 1h ago
Your ideas are intriguing to me and wish to subscribe to your newsletter.
Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.
Otherwise, rock on.
hn_throw_250829 · 1h ago
Good. Accelerate.
imiric · 1h ago
> When I’m driving, I hand my phone to a friend and say, “Reply ‘on my way’ to my Mom.” They act on my behalf, through my identity, even though the software has no built-in concept of delegation. That is the world we are entering.
That is a very small part of the world we're entering.
The other vast majority of use cases will come from even more abusive bots than we have today, filling the internet with spam, disinformation, and garbage. The dead internet is no longer a theory, and the future we're building will make the internet for bots, by bots. Humans will retreat into niche corners of it, and those who wish to participate in the broader internet will either have to live with this, or abide by new government regulations that invade their privacy and undermine their security.
So, yes, confirming human identity is the only path forward if we want to make the internet usable by humans, but I do agree that the ideal solution will not come from a single company, or a single government, for that matter. It will be a bumpy ride until we figure this out.
IshKebab · 1h ago
> allowlist
Off topic but are people ever going to give up on this nonsense? It's so grating.
thanatos_dem · 1h ago
Allowlist is arguably fitting for a list of things which are allowed.
IshKebab · 18m ago
It's called a whitelist. A perfectly good word that isn't racist and one that normal people are quite happy to use. As far as I can tell the allow/blocklist craze hasn't made it out of the software world.
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
They don't use cloudlfare AFAIK.
They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.
Corporations develop hostile AI agents,
Capable hackers develop anti-AI-agents.
This defeatist atittude "we have no power".
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
If anything we’ve seen the rise in complaints about it just annoying average users.
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web
This seems like slogan-based planning with no actual thought put into it.
[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
you think codeberg would sue you?
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
Well, I'm glad you speak for the entire Internet.
Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.
It's getting really, really ugly out there.
- Legal threats are never really effective
Effective solutions are:
- Technical
- Monetary
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
This seems flawed.
Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.
>But if you want to spam 1,000,000 everyday that becomes prohibitive.
Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
No comments yet
It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.
We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.
That's the system that needs to change.
I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.
This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.
No comments yet
A paywall.
In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.
We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.
What about licenses like CC-BY-NC (Creative Commons - Non Commercial)?
Which is not everyone.
> protect their blog or content from AI training bots
It strikes me that one needs to chose one of these as their visionary future.
Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.
So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.
The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.
One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.
Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?
I'm routinely denied access to websites now.
enable javascript and unblock cookies to continue
Basic tools like Anubis and fail2ban are very effective at keeping most of this evil at bay.
First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.
Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.
It's impossible to make something "public, but not like that". Either you publish or you don't.
If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.
I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.
There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.
If this is your primary argument against being scraped (viz that your robots.txt said not to) then you’re naive and you’re doing it wrong.
If the internet is open, then data on it is going to be scraped lol. You can’t have it both ways.
If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.
Then, all these AI companies could interface directly with that single entity on terms that are agreeable.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
You can see it in the web vs mobile apps.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
At some point soon, if not now, assume everything is generated by AI unless proven otherwise using a decentralized ID.
Likewise, on the server side, assume it’s a bot unless proven otherwise using a decentralized ID.
We can still have anonymity using decentralized IDs. An identity can be an anonymous identity, it’s not all (verified by some central official party) or nothing.
It comes down to different levels of trust.
Decoupling identity and trust is the next step.
Why law enforcement doesn't do their job, resulting in people not bothering to report things anymore, is imo the real issue here. Third party identification services to replace a failing government branch is pretty ugly as a workaround, but perhaps less ugly than the commercial gatekeepers popping up today
https://www.w3.org/TR/did-1.1/
Privacy cannot exist in an environment where the host gets to decide who access the web page. I'm okay with rate limiting or otherwise blocking activity that creates too much of a load, but trying to prevent automated access is impossible withou preventing access from real people.
If you want the best of both worlds, i.e. just post freely but make money from ads, or inserting hidden pixels to update some profile about me, well good luck. I'll choose whether I want to look at ads, or load tracking pixels, and my answer is no.
> my answer is no.
Rights for me, but not for thee?
Does this only apply to "information" or should we treat all open source code as public domain?
Free Software is the only place where this is a real abridgement of rights and intention, and it's over. They've already been trained on all of it, and no judge will tell them to stop, and no congressman will tell them to stop.
However, I do believe the host can do whatever they want with my request also.
This issue becomes more complex when you start talking about government sites, since ideally they have a much stronger mandate to serve everyone fairly.
An allowlist run by one company that site owners chose to engage with. But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
Exactly, no problem with that, just hinting that's not a protocol.
> But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts
Wait, what?
I was referring to the following image:
https://substackcdn.com/image/fetch/$s_!zRK-!,w_1250,h_703,c...
>But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
"But you participate in society!"
>We need protocols, not gatekeepers.
But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.
I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.
But like I said, I hope I'm wrong.
(What are some good alternatives to Cloudflare?)
Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.
Not to mention the big cloud providers are unhinged with their egress pricing.
Cloudflare seems very vocal about its desire to become yet another digital gatekeeper as of late, and so is Google. I want both reduced to rubble if they persist in it.
One of the hard parts in this space is what level of transparency should you have. We're advancing the thesis that behavioral biometrics offers robust continuous authentication that helps with bot/human and good/bad, but people are obviously skeptical to trust black-box models for accuracy and/or privacy reasons.
We've defaulted to a lot of transparency in terms of publishing research online (and hopefully in scientific journals), but we've seen the downside: competitors fake claims about their own best in-house behavioral tools that is behind their company walls in addition to investors constantly worried about an arms race.
As someone genuinely interested (and incentivized!) to build a great solution in this space, what are good protocols/examples to follow?
and isn't this why people sign up with Cloudflare in the first place? for bot protection? to me, this is just the same, but with agents.
i love the idea of an open internet, but this requires all party to be honest. a company like Perplexity that fakes their user-agent to get around blocks disrespects that idea.
my attitude towards agents is positive. if a user used an LLM to access my websites and web apps, i'm all for it. but the LLM providers must disclose who they are - that they are OpenAI, Google, Meta, or the snake oil company Perplexity
https://webaim.org/blog/user-agent-string-history/
TLDR the UA string has always been "faked", even in the scenarios you might think are most legitimate.
Good thing they are not the only place to post!
bots vs humans:
humans are trying to buy tickets that were sold out to a bot
data scrapping:
you index my data (real estate listing) to not to route traffic to my site as people search for my product, as a search engine will do, rather to become my competitor.
spam (and scam): digital pollution, or even worse, trying to input credit card, gift cards, passwords, etc.
(obviously there are more, most which will fall into those categories, but those are the main ones)
now, in the human assisted AI, the first issue is no longer an issue, since it is obvious that each of us, the internet users, will soon have an agent built into our browser. so we will all have the speedy automated select, click and checkout at our disposal.
Prior to LLM era, there were search engines and academic research on the right side of the internet bots, and scrappers and north to that, on the wrong side of the map. but now we have legitimate human users extending their interaction with an LLM agent, and on top of it, we have new AI companies, larger and smaller which thrive for data in order to train their models.
Cloudflare simply trying to make sense of this, whilst maintaining their bot protection relevant.
I do not appreciate the post content whatsoever, since it lacks or consistency and maturity (a true understanding of how the internet works, rather than a naive one).
when you talk about "the internet", what exactly are you referring to? a blog? a bank account management app? a retail website? social media?
those are all part of the internet and each is a complete different type of operation.
EDIT:
I've written a few words about this back in January [1] and in fact suggested something similar:
https://blog.tarab.ai/p/bot-management-reimagined-in-theI do fear the actions of the current bot landscape is going to lead to almost everything going behind auth walls though, and perhaps even paid auth walls.
https://anchorbrowser.io/blog/page-load-reliability-on-the-t...
Here's to working together to develop a new protocol that works for agents and website owners alike.
Also, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.
Their pricing page says:
No-nonsense Free Tier
As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.
Included in Always Free Tier
1 TB of data transfer out to the internet per month 10,000,000 HTTP or HTTPS Requests per month 2,000,000 CloudFront Function invocations per month 2,000,000 CloudFront KeyValueStore reads per month 10 Distribution Tenants Free SSL certificates No limitations, all features available
I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.
> 1 TB of data
Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.
It's hard to assess the validity of this versus Cloudflare having a really good marketing department.
I've used neither, so I can't say, but I've also never seen anyone truly explain why/why-not.
Basically MS tried to kill the web with their Win95 release, the infamous Internet Explorer and their shitty IIS/Frontpage tandem.
I deeply hate them since that day.
That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.
Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.
Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).
It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.
The real question is whether there is more business opportunity in supporting "unsigned" agents than signed ones. My hope is that the industry rejects this because there's more money to be made in catering to agents than blocking them. This move is mostly to create a moat for legacy business.
Also, if agents do become the de-facto way of browsing the internet, I'm not a fan of more ways of being tracked for ads and more ways for censorship groups to have leverage.
But the author is making a strawman argument over a "steelman" argument against signed agents. The strongest argument I can see is not that we don't need gatekeepers, but that regulation is anti-business.
If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.
Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users
No comments yet
This isn't the problem Cloudflare are trying to solve here. AI scraping bots are a trigger for them to discuss this, but this is actually just one instance of a much larger problem — one that Cloudflare have been trying to solve for a while now, and which ~all other cloud providers have been ignoring.
My company runs a public data API. For QoS, we need to do things like blocking / rate-limiting traffic on a per-customer basis.
This is usually easy enough — people send an API key with their request, and we can block or rate-limit on those.
But some malicious (or misconfigured) systems, may sometimes just start blasting requests at our API without including an API key.
We usually just want to block these systems "at the edge" — there's no point to even letting those requests hit our infra. But to do that, without affecting any of our legitimate users, we need to have some key by which to recognize these systems, and differentiate them from legitimate traffic.
In the case where they're not sending an API key, that distinguishing key is normally the request's IP address / IP range / ASN.
The problematic exception, then, is Workers/Lambda-type systems (a.k.a. Function-as-a-Service [FaaS] providers) — where all workloads of all users of these systems come from the same pool of shared IP addresses.
---
And, to interrupt myself for a moment, in case the analogy isn't clear: centralized LLM-service web-browsing/tool-use backends, and centralized "agent" orchestrators, are both effectively just FaaS systems, in terms of how the web/MCP requests they originate, relate to their direct inbound customers and/or registered "agent" workloads.
Every problem of bucketing traditional FaaS outbound traffic, also applies to FaaSes where the "function" in question happens to be an LLM inference process.
"Agents" have made this concern more urgent/salient to increasingly-smaller parts of the ecosystem, who weren't previously considering themselves to be "data API providers." But you can actually forget about AI, and focus on just solving the problem for the more-general category of FaaS hosts — and any solution you come up with, will also be a solution applicable to the "agent formulation" of the problem.
---
Back to the problem itself:
The naive approach would be to block the entire FaaS's IP range the first time we see an attack coming from it. (And maybe some API providers can get away with that.)
But as long as we have at least one legitimate customer whose infrastructure has been designed around legitimate use of that FaaS to send requests to us, then we can't just block the entire Workers IP range.
(And sure, we could block these IP ranges by default, and then try to get such FaaS-using customers to send some additional distinguishing header in their requests to us, that would take priority over the FaaS-IP-range block... but getting a client engineer to implement an implementation-level change to their stack, by describing the needed change in a support ticket as a resolution to their problem, is often an extreme uphill battle. Better to find a way around needing to do it.)
So we really want/need some non-customer-controlled request metadata to match on, to block these bad FaaS workloads. Ideally, metadata that comes from the FaaS itself.
As it turns out, CF Workers itself already provides such a signal. Each outbound subrequest from a Worker gets forcibly annotated "on the way out" with a request header naming the Worker it came from. We can block on / rate-limit by this header. Works great!
But other FaaS providers do not provide anything similar. For example, it's currently impossible to determine which AWS Lambda customer is making requests to our API, unless that customer specifically deigns to attach some identifying info to their requests. (I actually reported this as a security bug to the Lambda team, over three years ago now.)
---
So, the point of an infrastructure-level-enforced public-visible workload-identity system, like what CF is proposing for their "signed agents", isn't just about being able to whitelist "good bots."
It's also about having some differentiable key that can cleanly bucket bot traffic, where any given bucket then contains purely legitimate or purely malicious/misbehaving bot traffic; so that if you set up rate-limiting, greylisting, or heuristic blocking by this distinguishing key, then the heuristic you use will ensure that your legitimate (bot) users never get punished, while your misbehaving/malicious (bot) users automatically trip the heuristic. Which means you never need to actually hunt through logs and manually blacklist specific malicious/misbehaving (bot) users.
If you look at this proposal as an extension/enhancement of what CF has already been doing for years with Workers subrequest originating-identity annotation, the additional thing that the "signed agents" would give the ecosystem on behalf of an adopting FaaS, is an assurance that random other bots not running on one of these FaaS platforms, can't masquerade as your bot (in order to take advantage of your preferential rate-limiting tier; or round-robin your and many others' identities to avoid such rate-limiting; or even to DoS-attack you by flooding requests that end up attributed to you.) Which is nice, certainly. It means that you don't have to first check that the traffic you're looking at originated from one of the trustworthy FaaS providers, before checking / trusting the workload-identity request header as a distinguishing key.
But in the end, that's a minor gain, compared to just having any standard at all — that other FaaSes would sign on to support — that would require them to emit a workload-identity header on outbound requests. The rest can be handled just by consuming+parsing the published IP-ranges JSON files from FaaS providers (something our API backend already does for CF in particular.)
Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
It’s somewhat ironic to let fly the “free and open internet” battle cry on behalf of an industry that is openly destroying it.
Web Bot Auth
https://news.ycombinator.com/item?id=45055452
and associated blog post:
The age of agents: cryptographically recognizing agent traffic
https://blog.cloudflare.com/signed-agents/
Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.
Otherwise, rock on.
That is a very small part of the world we're entering.
The other vast majority of use cases will come from even more abusive bots than we have today, filling the internet with spam, disinformation, and garbage. The dead internet is no longer a theory, and the future we're building will make the internet for bots, by bots. Humans will retreat into niche corners of it, and those who wish to participate in the broader internet will either have to live with this, or abide by new government regulations that invade their privacy and undermine their security.
So, yes, confirming human identity is the only path forward if we want to make the internet usable by humans, but I do agree that the ideal solution will not come from a single company, or a single government, for that matter. It will be a bumpy ride until we figure this out.
Off topic but are people ever going to give up on this nonsense? It's so grating.