A proposal for inline LLM instructions in HTML based on llms.txt (vercel.com)

Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.

That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.

[0] https://xeiaso.net/blog/2025/anubis/

jhanschoo · 48m ago

Your link explicitly says:

> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.

It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.

It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.

fluoridation · 1h ago

Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?

VMG · 1h ago

crawlers can run JS, and also invest into running the Proof-Of-JS better than you can

tjhorner · 12m ago

Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.

fluoridation · 59m ago

If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.

jimmaswell · 1h ago

What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?

Philpax · 1h ago

Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163

rnhmjoj · 1h ago

I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?

I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.

mnmalst · 1h ago

Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.

hooverd · 1h ago

less savory crawlers use residential proxies and are indistinguishable from malware traffic

WesolyKubeczek · 1h ago

You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.

rnhmjoj · 43m ago

Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.

[1]: https://pod.geraspora.de/posts/17342163

yuumei · 1h ago

> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans. > Anubis – confusingly – inverts this idea.

Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.

No comments yet

anotherhue · 1h ago

Surely the difficulty factor scales with the system load?

WesolyKubeczek · 1h ago

I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.

Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.

Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.

Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?

lousken · 1h ago

aren't you happy? at least you see catgirl

jayrwren · 1h ago

literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html Maybe travis needs more google-fu. maybe that includes using duckduckgo?

ksymph · 1h ago

This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.

Philpax · 1h ago

The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.

I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.

davidclark · 1h ago

The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?

Philpax · 1h ago

The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.

That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.

hooverd · 1h ago

The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.

PaulHoule · 1h ago

I think a lot of it is performative and a demonstration that somebody is a member of a tribe, particularly the part about the kemonomimi [1] (e.g. people who are kinda like furries but have better test in art)

[1] https://safebooru.donmai.us/posts?tags=animal_ears

dathinab · 1h ago

you are overthinking

it's a simple as having a nice picture there make this whole thing feel nicer, and give it a bit of personality

so you put in some picture/art you like

that's it

similar any site sing it can change that picture, but there isn't any fundamental problem with the picture, so most can't care to change it

Just Write (moll.dev)

A proposal for inline LLM instructions in HTML based on llms.txt (vercel.com)

Hx-optimistic: Declarative optimistic updates for Htmx (lorenstew.art)

Show HN: Yellhorn – MCP server to help coding agents 1-shot long tasks (github.com)

REITs Buying Tranches of Single-Family Homes (2024) (finance.yahoo.com)

ComputerRL: Scaling Reinforcement Learning for Computer Use Agents (arxiv.org)

Processing 24T tokens for LLM training with 0 crashes (what made it possible) (daft.ai)

Digg.com Is Back (digg.com)

Show HN: A new JavaScript runtime for writing high-performance web apps in Rust (npmjs.com)

Open Data Contract Standard (bitol-io.github.io)

Dmux: Claude Code Multiplexer (fleet management) (github.com)

OpenAI logged its first $1B month (cnbc.com)

Best Options for Using AI in Chip Design (semiengineering.com)

The shareable detection format for security professionals (sigmahq.io)

Small devices can loft into mesosphere for climate sensing (seas.harvard.edu)

Show HN: StrangerMeet – A Modern Omegle Alternative for Text and Video Chat (stranger-meet.com)

The first digital 3-D rendered film, circa 1972 (kottke.org)

Beef prices are spiraling. Meat lovers aren't deterred – yet (washingtonpost.com)

A Campaign Bookcase for Lighthouse Keepers (christopherschwarz.substack.com)

OmniNeural-4B: First NPU-Aware Multimodal AI Model (nexa.ai)

How Much Rust in Firefox? (and Chromium) (optiklab.github.io)

Erasing personal data from the devices you discard is a booming business (cnbc.com)

Microsoft reportedly fixing SSD failures caused by Windows updates (bleepingcomputer.com)

Show HN: HealthChain – Python framework for healthcare data that doesn't suck (github.com)

Show HN: Okapi – a metrics engine based on open data formats (github.com)

Getting a Leg Up with End-to-End Neural Networks – Boston Dynamics [video] (youtube.com)

Show HN: Stop hoarding content – I built a tool to make it useful

Machine learning reveals how a hidden neural code orchestrates emotion states (science.org)

Hank Green's Focus Friend app hits #1 in App Store, surpassing ChatGPT (focusfriend.me)

Inner speech in motor cortex and implications for speech neuroprostheses (cell.com)

Phone Searches at the US Border Hit a Record High (wired.com)

Ultrasonic device efficiently removes salt from sea sand for construction use (techxplore.com)

A Super-Energetic Neutrino That Reached Earth in 2023 Has Been Confirmed Real (wired.com)

SuperDevPro: Browser Extension for Developers and Designers (superdevpro.com)

Nobody Has Been Able to Solve the CIA's Famous 'Kryptos' Sculpture (smithsonianmag.com)

Building Ultra Cheap Energy Storage for Solar PV (austinvernon.substack.com)

Obituary for a Small Business (kupajo.com)

Show HN: Is it possible to secure AI Agents? We need you (github.com)

Practical Lessons Learned Using Claude Code to Automate Integrations (petercto.substack.com)

Show HN: Anchor Relay – A faster, easier way to get Let's Encrypt certificates (anchor.dev)

Edit images in Google Photos by simply asking (blog.google)

Monkeys falling from trees and baking barnacles: how heat is driving extinction (theguardian.com)

The State of Python 2025 (blog.jetbrains.com)

Pixel 10 Pro (store.google.com)

Gemini Live can now highlight objects on your screen (blog.google)

FEMA Now Requires Disaster Victims to Have an Email Address (wired.com)

Show HN: Pluely v0.1.1 – OSS Cluely alternative with custom/local LLM support (github.com)

Cosmos: Private multimodal AI that runs on your laptop (meetcosmos.com)

We Found Zero Low-Severity Bugs in 165 AI Code Reports. Zero (shamans.dev)

UH astronomers discover the biggest explosion since the Big Bang (hawaii.edu)

Why are anime catgirls blocking my access to the Linux kernel?

Comments (27)