I wouldn't be surprised if just delaying the server response by some 3 seconds will have the same effect on those scrapers as Anubis claims.
lxgr · 1h ago
> This isn’t perfect of course, we can debate the accessibility tradeoffs and weaknesses, but conceptually the idea makes some sense.
It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.
ksymph · 1h ago
Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
fluoridation · 1h ago
Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?
VMG · 1h ago
crawlers can run JS, and also invest into running the Proof-Of-JS better than you can
tjhorner · 12m ago
Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.
fluoridation · 59m ago
If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.
jimmaswell · 1h ago
What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?
I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
mnmalst · 1h ago
Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.
hooverd · 1h ago
less savory crawlers use residential proxies and are indistinguishable from malware traffic
WesolyKubeczek · 1h ago
You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.
rnhmjoj · 43m ago
Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans.
> Anubis – confusingly – inverts this idea.
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
No comments yet
anotherhue · 1h ago
Surely the difficulty factor scales with the system load?
WesolyKubeczek · 1h ago
I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.
Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.
Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.
Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?
lousken · 1h ago
aren't you happy? at least you see catgirl
jayrwren · 1h ago
literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html
Maybe travis needs more google-fu. maybe that includes using duckduckgo?
ksymph · 1h ago
This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.
Philpax · 1h ago
The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
davidclark · 1h ago
The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?
Philpax · 1h ago
The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.
That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.
hooverd · 1h ago
The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.
PaulHoule · 1h ago
I think a lot of it is performative and a demonstration that somebody is a member of a tribe, particularly the part about the kemonomimi [1] (e.g. people who are kinda like furries but have better test in art)
It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
[0] https://xeiaso.net/blog/2025/anubis/
> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
[1]: https://pod.geraspora.de/posts/17342163
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
No comments yet
Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.
Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.
Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.
[1] https://safebooru.donmai.us/posts?tags=animal_ears
it's a simple as having a nice picture there make this whole thing feel nicer, and give it a bit of personality
so you put in some picture/art you like
that's it
similar any site sing it can change that picture, but there isn't any fundamental problem with the picture, so most can't care to change it