Ask HN: Why hasn't x86 caught up with Apple M series?
437 points by stephenheron 3d ago 616 comments
Ask HN: Best codebases to study to learn software design?
106 points by pixelworm 5d ago 92 comments
The web does not need gatekeepers: Cloudflare’s new “signed agents” pitch
252 positiveblue 237 8/29/2025, 4:35:24 PM positiveblue.substack.com ↗
how does this person think jwt’s work?
The web doesn't need to know if you're a human, a bot, or a dog. It just needs to serve bytes to whoever asks, within reasonable resource constraints. That's it. That's the open web. You'll miss it when it's gone.
A basic Varnish setup should get you most of the way there, no agent signing required!
So no, this advice has been outdated for decades.
Also you're doing some sort of victim blaming where everyone on earth has to engineer their service to withstand DoS instead of outsourcing that to someone else. Abusers outsource their attacks to everyone else's machine (decentralization ftw!), but victims can't outsource their defense because centralization goes against your ideals.
At least lament the naive infrastructure of the internet or something, sheesh.
Delegation of authorization can be useful for things that require it (as in some of the examples given in the article), but public files should not require authorization nor authentication for accessing it. Even if delegation of authorization is helpful for some uses, Cloudflare (or anyone else, other than whoever is delegating the authorization) does not need to be involved in them.
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".
It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.
This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.
I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.
We can't have nice things because the powerful cannot be held accountable. The powerful are powerful due to their legal teams and money, and power is the ability to carve exceptions to rules.
I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.
Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.
Bots aren't people.
You can want public water fountains without wanting a company attaching a hose to the base to siphon municipal water for corporate use, rendering them unusable for everyone else.
You can want free libraries without companies using their employees' library cards to systematically check out all the books at all times so they don't need to wait if they want to reference one.
Ultimately it is the users of AI (and am I one of them) that benefit from that service. I put out a lot of open code and I hope that people are able to make use of it however they can. If that's through AI, go ahead.
Yes it does, that's the entire point.
The flood of AI bots is so bad that (mainly older) servers are literally being overloaded and (newer servers) have their hosting costs spike so high that it's unaffordable to keep the website alive.
I've had to pull websites offline because badly designed & ban-evading AI scraper bots would run up the bandwidth into the TENS OF TERABYTES, EACH. Downloading the same jpegs every 2-3 minutes into perpetuity. Evidently all that vibe coding isn't doing much good at Anthropic and Perplexity.
Even with my very cheap transfer racks up $50-$100/mo in additional costs. If I wanted to use any kind of fanciful "app" hosting it'd be thousands.
This is such a bad faith argument.
We want a town center for the whole community to enjoy! What, you don't like those people shooting up drugs over there? But they're enjoying it too, this is what you wanted right? They're not harming you by doing their drugs. Everyone is enjoying it!
If the person using illegal drugs is on no way harming anyone but themselves and not being a nuisance, then yeah, I can get behind that. Put whatever you want in your body, just don't let it negatively impact anyone around you. Seems reasonable?
Seems to be a lot of conflating of badly coded (intentionally or not) scrapers and AI. That is a problem that predates AI's existence.
Freedom, the word, while implies no boundaries, is always bound by ethics, mutual respect and "do no harm" principle. The moment you trip either one of these wires and break them, the mechanisms to counter it becomes active.
Then we cry "but, freedom?!". Freedom also contains the consequences of one's actions.
Freedom without consequences is tyranny of the powerful.
What licenses? Free and open web. Go crazy. What ethical considerations? Do I police how users use the information on my site? No. If they make a pipe bomb using an 6502 CPU using code taken from my website -- am I supposed to do something about that?
Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!
The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.
Nothing is truly free unless you give equal respect to fellow hobbyists and megacorps using your labor for their profit.
Corporations develop hostile AI agents,
Capable hackers develop anti-AI-agents.
This defeatist atittude "we have no power".
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
If anything we’ve seen the rise in complaints about it just annoying average users.
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
um, no? Where did you get this strange bit of info.
The original reports say nothing of that sort: https://news.ycombinator.com/item?id=42790252 ; and even original motivation for Anubis was Amazon AI crawler https://news.ycombinator.com/item?id=42750420
(I've seen more posts with the analysis, including one which showed an AI crawler which would identify properly, but once it hits the ratelimit, would switch to fake user agent from proxies.. but I cannot find it now)
And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web
To put that in perspective, even if they're sending empty TCP packets, "several billion" pps is 200 to 1800 gigabits of traffic, depending on what you mean by that. Add a cookieless HTTP payload and you're at many terabits per second. The average self hoster is more likely to get struck by lightning than encounter and need protection from this (even without considering the, probably modest, consequences of being offline a few hours if it does happen)
Edit: off by a factor of 60, whoops. Thanks to u/Gud for pointing that out. I stand by the conclusion though: less likely to occur than getting struck by lightning (or maybe it's around equally likely now? But somewhere in that ballpark) and the consequences of being down for a few hours are generally not catastrophic anyway. You can always still put big brother in front if this event does happen to you and your ISP can't quickly drop the abusive traffic
That does make it a bit less ludicrous even if I think the conclusion of my response still applies
This seems like slogan-based planning with no actual thought put into it.
Here's an even greater video: https://www.youtube.com/watch?v=mAUpxN-EIgU&t=4m24s
[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....
Robots.txt is meant for crawlers, not user agents such as a feed reader or git client
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
you think codeberg would sue you?
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
It's getting really, really ugly out there.
> preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go.
What will you do when the bots ignore your instructions, and send a million requests a day to these URLs from half a million different IP addresses?
Misbehaving scrapers have been a problem for years not just from AI. I've written posts on how to properly handle scraping and the legal grey area it puts you in and how to be a responsible one. If companies don't want to be responsible the solution isn't abdicate an open web. It's make better law and enforcement of said law.
Well, I'm glad you speak for the entire Internet.
Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.
So
Which legal teeth?
Dumb bots that don't respect robot.txt or nofollow are the ones trying all combinations of the filters available in your search options and requesting all pages for each such combination.
The number of search pages can easily be exponential in the number of filters you offer.
Bots walking around in these traps, do it because they are dumb. But even a small degenerate bot can send more requests than 1M MAUs.
At least that's my impression of the problem we're sometimes facing.
Signed agents seems like a horrific solution. And many serving the traffic is just better.
- Legal threats are never really effective
Effective solutions are:
- Technical
- Monetary
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
This seems flawed.
Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.
>But if you want to spam 1,000,000 everyday that becomes prohibitive.
Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
No comments yet
It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.
We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.
That's the system that needs to change.
I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.
This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.
No comments yet
I'm old enough to remember when people asked the same questions of Hotbot, Lycos, Altavista, Ask Jeeves, and -- eventually -- Google.
Then, as now, it never felt like the right way to frame the question. If you want your content freely available, make it freely available... including to the bots. If you want your content restricted, make it restricted... including to the humans.
It's also not clear to me that AI materially changes the equation, since Google has for many years tried to cut out links to the small sites anyway in favor of instant answers.
(FWIW, the big companies typically do honor robots.txt. It's everyone else that does what they please.)
Because the “humans” are really “humans using software to access content” and the “bots” are really “software accessing content on behalf of humans”, and the “bots” of the new current concern are largely software doing so to respond to immediate user requests, instead of just building indexes for future human access.
They don't use cloudlfare AFAIK.
They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.
A paywall.
In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.
We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.
What about licenses like CC-BY-NC (Creative Commons - Non Commercial)?
Which is not everyone.
> protect their blog or content from AI training bots
It strikes me that one needs to chose one of these as their visionary future.
Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.
So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.
The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.
One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.
Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?
Why would you need to?
If your inability to assemble basic HTML forces you to adopt enormous, bloated frameworks that require two full cores of a cpu to render your post…
… or if you think your online missives are a step in the road to content creator riches …
… then I suppose I see the problem.
Otherwise there’s no problem.
There’s going to be bad actors taking advantage of people who cannot fight back without regulations and gatekeepers, suggesting otherwise is about as reasonable as ancaps idea of government
"Okay, that means AI companies can train on your content."
"Well, actually, we need some protections..."
"So you want a closed web with access controls?"
"No no no, I support openness! Can't we just have, like, ethical openness? Where everyone respects boundaries but there's no enforcement mechanism? Why are you making this so black and white?"
Basic tools like Anubis and fail2ban are very effective at keeping most of this evil at bay.
First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.
Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.
It's impossible to make something "public, but not like that". Either you publish or you don't.
If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.
I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.
There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.
I'm routinely denied access to websites now.
enable javascript and unblock cookies to continue
I would much rather have it open for all, including companies, than the coming dystopian landscape of paywall gates. I don’t care about respecting robots.txt or any other types of rules. If it’s on the internet it’s for all to consume. The moment you start carving out certain parties is the moment it becomes a slippery slope.
If this is your primary argument against being scraped (viz that your robots.txt said not to) then you’re naive and you’re doing it wrong.
If the internet is open, then data on it is going to be scraped lol. You can’t have it both ways.
If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.
On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.
Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.
Then, all these AI companies could interface directly with that single entity on terms that are agreeable.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
Privacy cannot exist in an environment where the host gets to decide who access the web page. I'm okay with rate limiting or otherwise blocking activity that creates too much of a load, but trying to prevent automated access is impossible withou preventing access from real people.
There are people behind those connection requests. I don't try to guess on my server who is a bot and who is not; I'll make mistakes and probably bias against people who use uncommon setups (those needing accessibility aids or using e.g. experimental software that improves some aspect like privacy or functionality)
Sure, I have rights as a website owner. I can take the whole thing offline; I can block every 5th request; I can allow each /16 block to make 1000 requests per day; I can accept requests only from clients that have a Firefox user agent string. So long as it's equally applied to everyone and it's not based on a prohibited category such as gender or religious conviction, I am free to decide on such cuts and I'd encourage everyone to apply a policy that they believe is fair
Cloudflare and its competitors, as far as I can tell, block arbitrary subgroups of people based on secret criteria. It does not appear to be applied fairly, such as allowing everyone to make the same number of requests per unit time. I'm probably bothered even more because I happen to be among the blocked subgroup regularly (but far from all the time, just little enough to feel the pain)
Some states are more stringent with their own disability regulations or state constitutions, but no state anywhere in the U.S. has a law that says every visitor to a website has to be treated equally.
Equal protection is indeed not the same as equal treatment. No, it really does say that everyone shall be treated equally so long as the circumstances are equal (gelijke behandeling in gelijke gevallen)
There's a whole spectrum of gatekeeping on communications with users, from static sites that broadcast their information to anyone, and stores that let you order without even making an account, to organizations that require you install local software to even access data and perform transactions. The latter means 90%+ of your users will hate you for it, and half will walk away, but it's still very common, collectively causing business that do so billions of dollers a year. (https://www.forbes.com/sites/johnkoetsier/2021/02/15/91-of-u... to-install-apps-to-do-business-costing-brands-billions/)
When companies get big enough to have entire departments devoted tasks, those departments will follow the fads that bring them the most prestige, at the cost of the rest of the company. Eventually the company will lose out to newer more efficient businesses that forgo fads in favor of serving customers, and the cycle continues.
I'm just point out how a new fad is hurting businesses, but by no means wish to limit their ability to do so. They just won't be getting my business, nor business from a quickly growing cohort that desires anonymitiy, or even requires it to get around growing local censorship.
If you want the best of both worlds, i.e. just post freely but make money from ads, or inserting hidden pixels to update some profile about me, well good luck. I'll choose whether I want to look at ads, or load tracking pixels, and my answer is no.
> my answer is no.
Rights for me, but not for thee?
Does this only apply to "information" or should we treat all open source code as public domain?
Free Software is the only place where this is a real abridgement of rights and intention, and it's over. They've already been trained on all of it, and no judge will tell them to stop, and no congressman will tell them to stop.
However, I do believe the host can do whatever they want with my request also.
This issue becomes more complex when you start talking about government sites, since ideally they have a much stronger mandate to serve everyone fairly.
They were working on an idea that looked a bit like an RSS feed for an entire website, where you would run your own spider and then our search engine could hit an endpoint to get a delta instead of having to scan your entire site.
If they’d made the protocol open instead of proprietary, we maybe could have gotten spiders to play nicer since each spider after the first would be cheaper, and eventually maybe someone could build pub sub hooks into common web frameworks to potentially skip the scan entirely for read-mostly websites, generating delta data when your data changed.
But of course when the next round of funding came due nobody was buying.
I thought about this a lot on my last project, where spiders were our customers’ biggest users. One of those apps where customer interactions were intense but brief and the rank in Google mattered equally with all other concerns. Nobody had architected for the actual read/write workflow of the system of course, and that company sold to a competitor after I left. Who migrated all customers to their solution and EOLed ours for being too fat in a down economy.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
You can see it in the web vs mobile apps.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
You can use the admin certificate issued to you, to issue a certificate to the agent which will contain an extension limiting what it can be used for (and might also expire in a few hours, and also might be revoked later). This certificate can be used to issue an even more restricted certificate to sub-agents.
This is already possible (and would be better than the "fine-grained personal access tokens" that GitHub uses), but does not seem to be commonly implemented. It also improves security in other ways.
So, it can be done in such a way that Cloudflare does not need to issue authorization to you, or necessarily to be involved at all. Google does not need to be involved either.
However, that is only for things where would should normally require authorization to do anyways. Reading public data is not something that should requires authorization to do; the problem with this is excessive scraping (there seems to be too many LLM scraping and others which is too excessive) and excessive blocking (e.g. someone using a different web browser, or curl to download one file, or even someone using a common browser and configuration but something strange unexpected happens, etc); the above is something unrelated to that, so certificates and stuff like that does not help, because it solves a different problem.
An allowlist run by one company that site owners chose to engage with. But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
The discourse around this is a little wild and I'm glad you said this. The allowlist is a Cloudflare feature and their customers are free to use it. The core functionality involving HTTP Message Signatures is decentralized and open, so anyone can adopt it and benefit.
Exactly, no problem with that, just hinting that's not a protocol.
> But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts
Wait, what?
I was referring to the following image:
https://substackcdn.com/image/fetch/$s_!zRK-!,w_1250,h_703,c...
also: “Cloudelare” ;-P
>But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…
"But you participate in society!"
(What are some good alternatives to Cloudflare?)
Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.
Not to mention the big cloud providers are unhinged with their egress pricing.
>We need protocols, not gatekeepers.
But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.
I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.
But like I said, I hope I'm wrong.
At some point soon, if not now, assume everything is generated by AI unless proven otherwise using a decentralized ID.
Likewise, on the server side, assume it’s a bot unless proven otherwise using a decentralized ID.
We can still have anonymity using decentralized IDs. An identity can be an anonymous identity, it’s not all (verified by some central official party) or nothing.
It comes down to different levels of trust.
Decoupling identity and trust is the next step.
Why law enforcement doesn't do their job, resulting in people not bothering to report things anymore, is imo the real issue here. Third party identification services to replace a failing government branch is pretty ugly as a workaround, but perhaps less ugly than the commercial gatekeepers popping up today
https://www.w3.org/TR/did-1.1/
Just because you can, doesn't mean you should and I don't feel any one entity (private or public) should be an arbiter on these matters.
This is something that can, and should, be negotiated at the "last virtual mile".
Cloudflare seems very vocal about its desire to become yet another digital gatekeeper as of late, and so is Google. I want both reduced to rubble if they persist in it.
I do fear the actions of the current bot landscape is going to lead to almost everything going behind auth walls though, and perhaps even paid auth walls.
One of the practical problems I rather saw was bootstrapping: how to convince any website owner to use it, when very few people are on the system? Where should they find someone to get invites from?
As for tracking (auth walls), the website needs not know who you are. They just see random tokens with signatures and can verify the signature. If there's abuse, they send evidence to the tree system, where it could be handled similarly to HN: lots of flags from different systems will make an automated system kick in, but otherwise a person looks at the issue and decides whether to issue a warning or timeout. (Of course, the abuse reporting mechanism can also be abused so, again similar to HN, if you abuse the abuse mechanism then you don't count towards future reports.)
Ideally, we'd not need this and let real judges do the job of convicting people of abuse and computer fraud, but until such time, I'd rather use the internet anonymously with whatever setup I like than face blocks regularly while doing nothing wrong
one potential solution: https://www.l402.org/
One of the hard parts in this space is what level of transparency should you have. We're advancing the thesis that behavioral biometrics offers robust continuous authentication that helps with bot/human and good/bad, but people are obviously skeptical to trust black-box models for accuracy and/or privacy reasons.
We've defaulted to a lot of transparency in terms of publishing research online (and hopefully in scientific journals), but we've seen the downside: competitors fake claims about their own best in-house behavioral tools that is behind their company walls in addition to investors constantly worried about an arms race.
As someone genuinely interested (and incentivized!) to build a great solution in this space, what are good protocols/examples to follow?
and isn't this why people sign up with Cloudflare in the first place? for bot protection? to me, this is just the same, but with agents.
i love the idea of an open internet, but this requires all party to be honest. a company like Perplexity that fakes their user-agent to get around blocks disrespects that idea.
my attitude towards agents is positive. if a user used an LLM to access my websites and web apps, i'm all for it. but the LLM providers must disclose who they are - that they are OpenAI, Google, Meta, or the snake oil company Perplexity
https://webaim.org/blog/user-agent-string-history/
TLDR the UA string has always been "faked", even in the scenarios you might think are most legitimate.
Good thing they are not the only place to post!
https://anchorbrowser.io/blog/page-load-reliability-on-the-t...
Here's to working together to develop a new protocol that works for agents and website owners alike.
bots vs humans:
humans are trying to buy tickets that were sold out to a bot
data scrapping:
you index my data (real estate listing) to not to route traffic to my site as people search for my product, as a search engine will do, rather to become my competitor.
spam (and scam): digital pollution, or even worse, trying to input credit card, gift cards, passwords, etc.
(obviously there are more, most which will fall into those categories, but those are the main ones)
now, in the human assisted AI, the first issue is no longer an issue, since it is obvious that each of us, the internet users, will soon have an agent built into our browser. so we will all have the speedy automated select, click and checkout at our disposal.
Prior to LLM era, there were search engines and academic research on the right side of the internet bots, and scrappers and north to that, on the wrong side of the map. but now we have legitimate human users extending their interaction with an LLM agent, and on top of it, we have new AI companies, larger and smaller which thrive for data in order to train their models.
Cloudflare simply trying to make sense of this, whilst maintaining their bot protection relevant.
I do not appreciate the post content whatsoever, since it lacks or consistency and maturity (a true understanding of how the internet works, rather than a naive one).
when you talk about "the internet", what exactly are you referring to? a blog? a bank account management app? a retail website? social media?
those are all part of the internet and each is a complete different type of operation.
EDIT:
I've written a few words about this back in January [1] and in fact suggested something similar:
https://blog.tarab.ai/p/bot-management-reimagined-in-theAlso, cheaply rate limiting malicious web clients should be something that is trivial to accomplish with competent web tooling (i.e., on your own servers). If this seems out of scope or infeasible, you might be using the wrong tools for the job.
1) hard block without having done any requests yet. No clue why. Same browser (Burp's built-in Chromium), same clean state, same IP address, but one person got a captcha and the other one didn't. It would just say "reload the page to try again" forever. This person simply couldn't use the site at all; not sure if that would happen if you're on any other browser, but since it allowed the other Burp Suite browser, that doesn't seem to be the trigger for this perma-ban. (The workaround was to clone the cookie state from the other consultant, but normal users won't have that option.)
2) captcha. I got so many captchas, like every 4th request. It broke the website (async functionality) constantly. At some point I wanted to try a number of passwords for an admin username that we had found and, to my surprise, it allowed hundreds of requests without captcha. It blocks humans more than this automated bot...
3) "this website is under construction" would sometimes appear. Similar to situation#1, but it seemed to be for specific requests rather than specific persons. Inputting the value "1e9" was fine, "1e999" also fine, but "1e99" got blocked, but only on one specific page (entering it on a different page was fine). Weird stuff. If it doesn't like whatever text you wrote on a support form, I guess you're just out of luck. There's no captcha or anything you can do about it (since it's pretending the website isn't online at all). Not sure if this was AWS or the customer's own wonky mod_security variant
I dread to think if I were a customer of this place and I urgently needed them (it's not a regular webshop but something you might need in a pinch) and the only thing it ever gives me is "please reload the page to try again". Try what again?? Give me a human to talk to, any number to dial!
Their pricing page says:
No-nonsense Free Tier
As part of the AWS free Usage Tier you can get started with Amazon CloudFront for free.
Included in Always Free Tier
1 TB of data transfer out to the internet per month 10,000,000 HTTP or HTTPS Requests per month 2,000,000 CloudFront Function invocations per month 2,000,000 CloudFront KeyValueStore reads per month 10 Distribution Tenants Free SSL certificates No limitations, all features available
I don't care about any of those fancy serverless services. I am just talking about the cheapest CDN.
> 1 TB of data
Someone can rent a 1Gbps server for cheap (under $50 on OVH) and pull 330TB in a month from your site. That's about $30k of egress on AWS if you don't do anything to stop it.
It's hard to assess the validity of this versus Cloudflare having a really good marketing department.
I've used neither, so I can't say, but I've also never seen anyone truly explain why/why-not.
That would be fine, you could walk away and go home, but if you're going to drive on their digital highways, you're going to need "insurance" just protect you from everyone else.
Ongoing multi-nation WWIII-scale hacking and infiltration campaigns of infrastructure, AI bot crawling, search company and startup crawling, security researchers crawling, and maybe somebody doesn't like your blog and decides to rent a botnet for a week or so.
Bet your ISP shuts you off before then to protect themselves. (Happens all the time via BGP blackholing, DDoS scrubbing services, BGP FlowSpec, etc).
It's not just computers anymore. Web enabled CCTV, doorbell cameras are all culprits.
Basically MS tried to kill the web with their Win95 release, the infamous Internet Explorer and their shitty IIS/Frontpage tandem.
I deeply hate them since that day.
If this was cloudflare going into some centralized routing of the internet and saying everything must do X then that would be a lot more alarming but at the end of the day the internet is decentralized and site owners are the ones who are using this capability.
Additionally I don't think that I as an individual website owner would actually want / be capable of knowing which agents are good and bad and cloudflare doing this would be helpful to me as a site owner as long as they act in good faith. And the moment they stop acting in good faith I would be able to disable them. This is definitely a problem right now as unrestricted access to the bots means bad bots are taking up many cycles raising costs and taking away resources from real users
No comments yet
It’s somewhat ironic to let fly the “free and open internet” battle cry on behalf of an industry that is openly destroying it.
Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
This isn't the problem Cloudflare are trying to solve here. AI scraping bots are a trigger for them to discuss this, but this is actually just one instance of a much larger problem — one that Cloudflare have been trying to solve for a while now, and which ~all other cloud providers have been ignoring.
My company runs a public data API. For QoS, we need to do things like blocking / rate-limiting traffic on a per-customer basis.
This is usually easy enough — people send an API key with their request, and we can block or rate-limit on those.
But some malicious (or misconfigured) systems, may sometimes just start blasting requests at our API without including an API key.
We usually just want to block these systems "at the edge" — there's no point to even letting those requests hit our infra. But to do that, without affecting any of our legitimate users, we need to have some key by which to recognize these systems, and differentiate them from legitimate traffic.
In the case where they're not sending an API key, that distinguishing key is normally the request's IP address / IP range / ASN.
The problematic exception, then, is Workers/Lambda-type systems (a.k.a. Function-as-a-Service [FaaS] providers) — where all workloads of all users of these systems come from the same pool of shared IP addresses.
---
And, to interrupt myself for a moment, in case the analogy isn't clear: centralized LLM-service web-browsing/tool-use backends, and centralized "agent" orchestrators, are both effectively just FaaS systems, in terms of how the web/MCP requests they originate, relate to their direct inbound customers and/or registered "agent" workloads.
Every problem of bucketing traditional FaaS outbound traffic, also applies to FaaSes where the "function" in question happens to be an LLM inference process.
"Agents" have made this concern more urgent/salient to increasingly-smaller parts of the ecosystem, who weren't previously considering themselves to be "data API providers." But you can actually forget about AI, and focus on just solving the problem for the more-general category of FaaS hosts — and any solution you come up with, will also be a solution applicable to the "agent formulation" of the problem.
---
Back to the problem itself:
The naive approach would be to block the entire FaaS's IP range the first time we see an attack coming from it. (And maybe some API providers can get away with that.)
But as long as we have at least one legitimate customer whose infrastructure has been designed around legitimate use of that FaaS to send requests to us, then we can't just block that entire FaaS's IP range.
(And sure, we could block these IP ranges by default, and then try to get such FaaS-using customers to send some additional distinguishing header in their requests to us, that would take priority over the FaaS-IP-range block... but getting a client engineer to implement an implementation-level change to their stack, by describing the needed change in a support ticket as a resolution to their problem, is often an extreme uphill battle. Better to find a way around needing to do it.)
So we really want/need some non-customer-controlled request metadata to match on, to block these bad FaaS workloads. Ideally, metadata that comes from the FaaS itself.
As it turns out, CF Workers itself already provides such a signal. Each outbound subrequest from a Worker gets forcibly annotated "on the way out" with a request header naming the Worker it came from. We can block on / rate-limit by this header. Works great!
But other FaaS providers do not provide anything similar. For example, it's currently impossible to determine which AWS Lambda customer is making requests to our API, unless that customer specifically deigns to attach some identifying info to their requests. (I actually reported this as a security bug to the Lambda team, over three years ago now.)
---
So, the point of an infrastructure-level-enforced public-visible workload-identity system, like what CF is proposing for their "signed agents", isn't just about being able to whitelist "good bots."
It's also about having some differentiable key that can cleanly bucket bot traffic, where any given bucket then contains purely legitimate or purely malicious/misbehaving bot traffic; so that if you set up rate-limiting, greylisting, or heuristic blocking by this distinguishing key, then the heuristic you use will ensure that your legitimate (bot) users never get punished, while your misbehaving/malicious (bot) users automatically trip the heuristic. Which means you never need to actually hunt through logs and manually blacklist specific malicious/misbehaving (bot) users.
If you look at this proposal as an extension/enhancement of what CF has already been doing for years with Workers subrequest originating-identity annotation, the additional thing that the "signed agents" would give the ecosystem on behalf of an adopting FaaS, is an assurance that random other bots not running on one of these FaaS platforms, can't masquerade as your bot (in order to take advantage of your preferential rate-limiting tier; or round-robin your and many others' identities to avoid such rate-limiting; or even to DoS-attack you by flooding requests that end up attributed to you.) Which is nice, certainly. It means that you don't have to first check that the traffic you're looking at originated from one of the trustworthy FaaS providers, before checking / trusting the workload-identity request header as a distinguishing key.
But in the end, that's a minor gain, compared to just having any standard at all — that other FaaSes would sign on to support — that would require them to emit a workload-identity header on outbound requests. The rest can be handled just by consuming+parsing the published IP-ranges JSON files from FaaS providers (something our API backend already does for CF in particular.)
I'm sorry, but the "agents" of "agentic AI" is completely different from the original purpose of the World-Wide Web which was to support user agents. User agents are used directly by users—aka browsers. API access came later, but even then it was often directed by user activity…and otherwise quite normally rate-limited or paywalled.
The idea that now every web server must comply with servicing an insane number of automated bots doing god-knows-what without users even understanding what's happening a lot of the time, or without the consent of content owners to have all their IP scraped into massive training datasets is, well, asinine.
That's not the web we built, that's not the web we signed up for; and yes, we will take drastic measures to block your ass.
The real question is whether there is more business opportunity in supporting "unsigned" agents than signed ones. My hope is that the industry rejects this because there's more money to be made in catering to agents than blocking them. This move is mostly to create a moat for legacy business.
Also, if agents do become the de-facto way of browsing the internet, I'm not a fan of more ways of being tracked for ads and more ways for censorship groups to have leverage.
But the author is making a strawman argument over a "steelman" argument against signed agents. The strongest argument I can see is not that we don't need gatekeepers, but that regulation is anti-business.
Web Bot Auth
https://news.ycombinator.com/item?id=45055452
and associated blog post:
The age of agents: cryptographically recognizing agent traffic
https://blog.cloudflare.com/signed-agents/
Joking aside, I think the ideas and substance are great and sorely needed. However, I can only see the idea of a sort of token chain verification as running into the same UX problems that plagued (plagues?) PGP and more encryption-focused processes. The workflow is too opaque, requires too much specialized knowledge that is out of reach for most people. It would have to be wrapped up into something stupid simple like an iOS FaceID modal to have any hope of succeeding with the general public. I think that's the idea, that these agents would be working on behalf of their owners on their own devices, so it has to be absolutely seamless.
Otherwise, rock on.
Off topic but are people ever going to give up on this nonsense? It's so grating.
/shrug. FoxReplace [1] is a simple way to remove compelled speech from the internet for the client. People do not even realize when they say "Fart", I see "Fart". No idea if Chrome has an addon like this.
[1] - https://addons.mozilla.org/en-US/firefox/addon/foxreplace/
The problem with changing whitelist to "allowlist" is that it implies that people who use whitelist are racists. You're not just virtue signaling (and confusing my spellchecker) but causing discord.
It would be perfectly fine if people switched to "allowlist" because they think it's a better term, but that's not the reason. They do it because they want to virtue signal or they're afraid of their peers (because they'll be called racists).
Using "allowlist" is actually bad because it gives agitators power and they keep changing more words to get more power.
The reasons that they usually actually have are not very good though, like you say, but nevertheless sometimes it can result in something better and sometimes not. But, banning words is not the solution, though.
It's weird that people will claim that "politics" have no place in software while insisting that there is one and only one term "normal" people should use because the politics of the people who object to it are bad and wrong.
Whitelist means that anything explicitly listed (in the "whitelist" or "allow list") is allowed (or included, etc) and other stuff is disallowed (or excluded) by default (although in some cases, a program (or something else) might ask instead of forcibly blocking access). It is a compound word; you should not use a space or hyphen. (Using two words "white list" may be appropriate when you are refering to colours, e.g. the white list includes the list of whatever documents are to be copied on white paper, or "white list" might mean the list that is printed on white paper.)
Allow list (I do not like the compound word; I think they should be separated and it looks better that way) is the list of what is allowed. (So, normally, this would mean that other stuff is not allowed, so it is still whitelisting.)
In situations where colours would be involved and using words such as "whitelist" would be confusing, such words should be avoided, in order to avoid confusion.
That is a very small part of the world we're entering.
The other vast majority of use cases will come from even more abusive bots than we have today, filling the internet with spam, disinformation, and garbage. The dead internet is no longer a theory, and the future we're building will make the internet for bots, by bots. Humans will retreat into niche corners of it, and those who wish to participate in the broader internet will either have to live with this, or abide by new government regulations that invade their privacy and undermine their security.
So, yes, confirming human identity is the only path forward if we want to make the internet usable by humans, but I do agree that the ideal solution will not come from a single company, or a single government, for that matter. It will be a bumpy ride until we figure this out.