I actually don't understand who Anubis is supposed to "make sure you're not a bot". It seems to be more of a rate limiter than anything else. It self-describes:
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
yabones · 50m ago
My understanding is that it just increases the "expense" of mass crawling just enough to put it out of reach. If it costs fractional pennies per page scrape with just a python or go bot, it costs nickels and dimes to run a headless chromium instance to do the same thing. The purpose is economical - make it too expensive to scrape the "open web". Whether it achieves that goal is another thing.
blibble · 28m ago
what do AI companies have more than everyone else? compute
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
dathinab · 48m ago
it's indeed not a "bot/crawler protection"
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
homebrewer · 57m ago
I think the only requests it was able to block are plain http requests made over curl or Go's stdlib http client. I see enough of both in httpd logs. Now the cancer has adapted by using a fully featured headless web browser that can complete challenges just like any other client.
As other commenters say, it was completely predictable from the start.
No comments yet
joe_the_user · 59m ago
Near as I can guess, the idea is that the code is optimized for what browsers can do and gpus/servers/crawlers/etc can't do as easily (or relatively as easily, just taking up the whole server for a bit might a big cost). Indeed it seems like only a matter of time before something like that would be broken.
xena · 49m ago
I just found out about this when it came to the front page of Hacker News. I really wish I was given advanced notice. I haven't been able to put as much energy into Anubis as I've wanted because I've been incredibly overwhelmed by life and need to be able to afford to make this my full time job. Support contracts are being roadblocked, and I just wish I had the time and energy to focus on this without having to worry about being the single income for the household.
electroly · 1h ago
Presumably they just finally decided they were willing to spend ($) the CPU time to pass the Anubis check. That was always my understanding of Anubis--of course a bot can pass it, it's just going to cost them a bunch of CPU time (and therefore money) to do it.
zelphirkalt · 54m ago
I think so too. Maybe the compute cost needs to be upped some more. I am OK with waiting a bit longer when I access the site.
logicprog · 1h ago
I'm not anti-the-tech-behind-AI, but this behavior is just awful, and makes the world worse for everyone. I wish AI companies would instead, I don't know, fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it, instead of having a bunch of different AI companies doing duplicated work and resulting in a swath of duplicated requests. Also, I don't understand why they have to make so many requests so often. Why wouldn't like one crawl of each site a day, at a reasonable rate, be enough? It's not like up to the minute info is actually important since LLM training cutoffs are always out of date anyway. I don't get it.
oortoo · 46m ago
The time to regulate tech was like 15 years ago, and we didn't. Why would any tech company expect to have to start following "rules" now?
logicprog · 22m ago
Yeah, I don't think we can regulate this problem away personally. Because whatever regulations will be made will either be technically impossible and nonsensical products of people who don't understand what they're regulating that will produce worse side effects (@simonw extracted a great quote from recent Doctorow post on this: https://simonwillison.net/2025/Aug/14/cory-doctorow/) or just increase regulatory capture and corporate-state bonds, or even facilitate corp interests, because the big corps are the ones with economic and lobbying power.
barbazoo · 1h ago
Greed. It's never enough money, never enough data, we must have everything all the time and instantly. It's also human nature it seems, looking at how we consume like there's no tomorrow.
logicprog · 1h ago
Which is why internalizing externalities is so important, but that's also extremely hard to do right (leads to a lot of "nerd harder" problems).
msgodel · 56m ago
It doesn't even make sense to crawl this way. It's just destructive for almost no beinifit.
barbazoo · 47m ago
Maybe they assume there'll be only one winner and think, "what if this gives me an edge over the others". And money is no object. Imagine if they cared about "the web".
logicprog · 24m ago
That's what's annoying and confusing about it to me.
nektro · 48m ago
if those companies cared about acting in good faith, they wouldnt be in AI
thewebguyd · 46m ago
> fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it
That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.
logicprog · 25m ago
> That, or, they could just respect robots.txt
IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.
How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.
thewebguyd · 11m ago
> IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
superkuh · 43m ago
This isn't AI. This is corporations doing things because they have a profit motive. The issue here is the non-human corporations and their complete lack of accountability even if someone brings legal charges against them. Their structure is designd to abstract away responsibility and they behave that way.
Same old problem. Corps are gonna corp.
logicprog · 28m ago
Yeah, that's why I said I'm not against AI as a technology, but against the behavior of the corporations currently building it. What I'm confused by (not really confused, I understand its just negligence and not giving a fuck, but, frustrated and confused in a sort of helpless sense of not being able to get into the mindset) is just that while there isn't a profit motive against doing this (obviously) there's also not clearly a profit motive to do it, it seems like they're wasting their own resources too on unnecessarily frequent data collection, and also it'd be cheaper to pool data collection efforts.
rpcope1 · 1h ago
I'm calling it now, this is the beginning of all of the remaining non-commerical properties on the web either going away, or getting hidden inside of some trusted overlay network. Unless the "AI" race slows down or changes or some other act of god happens, the incentives are aligned that I foresee wide swaths of the net getting flogged to death.
homebrewer · 50m ago
Also increasing balkanization of the internet. I now routinely run into sites that geoblock my whole country, this wasn't something I would see more than once or twice a year, and usually only with sites like Walmart that don't care about clients from outside the US.
Now it's 2-5 sites per day, including web forums and such.
bananalychee · 40m ago
If you live in Europe it probably has more to do with over-regulation than anything AI-related.
bananalychee · 44m ago
I self-host a few servers and have not seen significant traffic increases from crawlers, so I can't agree with that without seeing some evidence of this issue's scale and scope. As far as I know it mostly affects commercial content aggregators.
_ikke_ · 32m ago
It affects many open source projects as well, they just scrape everything repeatedly without abandon.
First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.
weinzierl · 58m ago
I think the answer for the non-commercial web is to stop worrying.
I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.
If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.
myaccountonhn · 39m ago
The issue is the insane amount of traffic from crawlers that DDOS websites.
> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.
MYEUHD · 50m ago
About 3 hours ago the codeberg website was really slow.
Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers
herval · 1h ago
Hasn’t that been the case for a while? I’d imagine the combined traffic to all sites on the web combined doesn’t match a single hour of the traffic to the top 5 social media sites. The web is pretty much dead for a while now, many companies don’t even bother maintaining websites anymore
superkuh · 42m ago
I could see it being the end of commercial and institutional web applications which cannot handle traffic. But actual websites which are html and files in folders served by webservers don't have problems with this.
v5v3 · 1h ago
Could it be a 'correct' continuation of Darwin's survival of the fittest?
Retr0id · 1h ago
Last time I checked, Anubis used SHA256 for PoW. This is very GPU/ASIC friendly, so there's a big disparity between the amount of compute available in a legit browser vs a datacentre-scale scraping operation.
A more memory-hard "mining" algorithm could help.
jsnell · 58m ago
A different algorithm would not help.
Here's the basic problem: the fully loaded cost of a server CPU core is ~1 cent/hour. The most latency you can afford to inflict on real users is a couple of seconds. That means the cost of passing a challenge the way the users pass it, with a CPU running Javascript, is about 1/1000th of a cent. And then that single proof of work will let them scrape at a minimum hundreds, but more likely thousands, of pages.
So a millionth of a cent per page. How much engineering effort is worth spending on optimizing that? Basically none, certainly not enough to offload to GPUs or ASICs.
Retr0id · 50m ago
No matter where the bar is there will always be scrapers willing to jump over it, but if you can raise the bar while holding the user-facing cost constant, that's a win.
jsnell · 46m ago
No, but what I'm saying is that these scrapers are already not using GPUs or ASICs. It just doesn't make any economical sense to do that in the first place. They are running the same Javascript code on the same commodity CPUs and the same Javascript engine as the real users. So switching to an ASIC-resistant algorithm will not raise the bar. It's just going to be another round of the security theater that proof of work was in the first place.
Retr0id · 43m ago
They might not be using GPUs but their servers definitely have finite RAM. Memory-hard PoW reduces the number of concurrent sessions you can maintain per fixed amount of RAM.
The more sites get protected by Anubis, the stronger the incentives are for scrapers to actually switch to GPUs etc. It wouldn't take all that much engineering work to hook the webcrypto apis up to a GPU impl (although it would still be fairly inefficient like that). If you're scraping a billion pages then the costs add up.
jsnell · 28m ago
The duration you'd need the memory for is a couple of seconds, during which time you're pegging a CPU core on the computation anyway. It is not needed for the entirety of the browsing session.
Now, could you construct a challenge that forced the client to keep a ton of data in memory, and then regularly be forced to prove they still have that data during the entire session? I don't think so. The problem is that for that kind of intermittent proof scenario there's no need to actually keep the data in low latency memory. It can just be stored on disk, and paged in when needed (not often). It's a very different access pattern from the cryptocurrency use case.
Havoc · 57m ago
Really feels like this needs some sort of unified possibly legal approach to get these fkers to behave.
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
dathinab · 31m ago
the problem in many cases is that even if such a law is made it likely
- is hard to enforce
- misses bite, i.e. it makes you more money to break it then any penalties
but in general yes, a site which indicates they don't want to be crawled by AI bots but still gets crawled should be handled similar to someone with house ban on a shop forcing them self into the shop
given how severely messed up some millennia cyber security laws are I wonder if crawlers bypassing Anubis could be interpreted as "circumventing digital access controls/protections" or similar, especially given that its done to make copies of copyrighted material ;=)
hollow-moe · 1h ago
Really looks like the last solution is a legal one, using the DMCA against them using the digital protection or access control circumvention clause or smth.
hyghjiyhu · 1h ago
Crazy thought but what if you made the work required to access the site equal the work required to host site. Host the public part of the database on something like webtorrent. Render website from db locally. You want to ruin expensive queries? Suit yourself. Not easy, but maybe possible?
nine_k · 1h ago
Why not ask it to directly mine some bitcoin, or do some protein folding? Let's make proof-of-work challenges proof-of-useful-work challenges. The server could even directly serve status 402 with the challenge.
thanks for making everything that much shittier just so you can steal everyone's data and present it as your own, AI companies!
OutOfHere · 11m ago
They failed to properly block/throttle the IP subnet as per their admission, and are now blaming others for their failure.
jsnell · 1h ago
This was beyond predictable. The monetary cost of proof of work is several orders of magnitude too small to deter scraping (let alone higher yield abuse), and passing the challenges requires no technical finesse basically by construction.
zahlman · 1h ago
We need to revive 402 Payment Required, clearly. If we lived in a world where we could easily set up a small trusted online balance for microtransactions that's interoperable with everyone, and where giving others a literal penny for their thoughts could allow for running up a significant bill for abusers, I'd gladly play along.
logicprog · 1h ago
Me too. I wouldn't mind Project Xanadu style micro payments for blogs, and it'd both fix the AI scraper issue and the ads issue, and help people fund hosting costs sustainably. I think the issue is taxes and transaction fees would push the prices too high, and it'd price out people with very low income possibly. It'd also create really perverse incentives for even more tight copyright control, since your content appearing even in part on anyone else's website is then directly losing you money, so it'd destroy the public Commons even more, which would be bad. But maybe not, who knows.
myaccountonhn · 35m ago
Pay to visit would be great, and would force these AI companies to actually pay for their data.
sumtechguy · 53m ago
For someone doing spamming that low level would work well. As their cost is determinatively low to make it work. For someone doing scraping to get data and feeding it to an AI not so much. The AI groups usually have some pretty heavy hitting hardware sitting behind it. They could even break off some hardware that is to be retired and have it munch away on it. To make it non cost effective the calculations would need to be much bigger.
WD-42 · 1h ago
This is sad, but predictable. At the end of the day if I can follow a link to an Anubis protected site and view it on my phone, the crawlers will be able to as well.
I see a lot more private networks in our future, unfortunately.
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
As other commenters say, it was completely predictable from the start.
No comments yet
That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.
IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.
How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
Same old problem. Corps are gonna corp.
Now it's 2-5 sites per day, including web forums and such.
First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.
I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.
If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.
For example: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.
Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers
A more memory-hard "mining" algorithm could help.
Here's the basic problem: the fully loaded cost of a server CPU core is ~1 cent/hour. The most latency you can afford to inflict on real users is a couple of seconds. That means the cost of passing a challenge the way the users pass it, with a CPU running Javascript, is about 1/1000th of a cent. And then that single proof of work will let them scrape at a minimum hundreds, but more likely thousands, of pages.
So a millionth of a cent per page. How much engineering effort is worth spending on optimizing that? Basically none, certainly not enough to offload to GPUs or ASICs.
The more sites get protected by Anubis, the stronger the incentives are for scrapers to actually switch to GPUs etc. It wouldn't take all that much engineering work to hook the webcrypto apis up to a GPU impl (although it would still be fairly inefficient like that). If you're scraping a billion pages then the costs add up.
Now, could you construct a challenge that forced the client to keep a ton of data in memory, and then regularly be forced to prove they still have that data during the entire session? I don't think so. The problem is that for that kind of intermittent proof scenario there's no need to actually keep the data in low latency memory. It can just be stored on disk, and paged in when needed (not often). It's a very different access pattern from the cryptocurrency use case.
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
- is hard to enforce
- misses bite, i.e. it makes you more money to break it then any penalties
but in general yes, a site which indicates they don't want to be crawled by AI bots but still gets crawled should be handled similar to someone with house ban on a shop forcing them self into the shop
given how severely messed up some millennia cyber security laws are I wonder if crawlers bypassing Anubis could be interpreted as "circumventing digital access controls/protections" or similar, especially given that its done to make copies of copyrighted material ;=)
I see a lot more private networks in our future, unfortunately.