A thought on JavaScript "proof of work" anti-scraper systems

55 zdw 49 5/26/2025, 5:01:25 AM utcc.utoronto.ca ↗

Comments (49)

hardwaresofton · 3m ago
Fantastic work by Xe here -- not the first but this seems like the most traction I've seen on a PoW anti-scraper project (with an MIT license to boot!).

PoW anti-scraper tools are a good first step, but why don't we just jump straight to the endgame? We're drawing closer to a point where information's value is actually fully realized -- people will stop sharing knowledge for free. It doesn't have to be that way, but it does in a world where people are pressed for economic means, knowledge becomes an obvious thing to convert to capital and attempt to extract rent on.

The simple way this happens is just a login wall -- for every website. It doesn't have to be a paid login wall of course (at first), but it's a super simple way to both legally and practically protect from scrapers.

I think high quality knowledge, source code (which is basically executable knowledge), being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving.

Don't get me wrong -- the doomer angle is almost always wrong -- every year humanity is almost always better off than we were the previous year on many important metrics, but it's getting harder to see a world where we cartwheel through another technological transformation that this time could possibly impact large percentages of the working population.

persnickety · 4m ago
> An LLM scraper is operating in a hostile environment [...] because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. [..] for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or [...] want to waste as much of your CPU as possible).

That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.

That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.

In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?

And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?

I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?

DaSHacka · 58m ago
Surprised there hasn't been a fork of Anubis that changes the artificial PoW into a simple Monero mining PoW yet.

Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.

At least then we would be making money off these rude bots.

xnorswap · 37m ago
The bots could check if they've hit the jackpot themselves and keep the valid hashes for themselves and only return when they're worthless.

Then it's the bots who are making money from work they need to do for the captchas.

nssnsjsjsjs · 22m ago
1. The problem is the bot needs to understand the program it is running to do that. Akin to the halting problem.

2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.

forty · 35m ago
We need an oblivious crypto currency mining algorithm ^^
sznio · 1h ago
I'd really like this, since it wouldn't impact my scraping stuff.

I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.

ChocolateGod · 1h ago
I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.

I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

jeroenhd · 14m ago
This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.

There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.

In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.

dannyw · 1h ago
This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.

Simple limits on runtime atop crypto mining from being too big of a problem.

jeroenhd · 20m ago
And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.

Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.

maeln · 47m ago
But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.
rob_c · 27m ago
No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)

The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.

Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.

The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.

TZubiri · 1h ago
"Googlebot has been doing it for probably a decade."

This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .

motoxpro · 59m ago
This is so obvious when you say it, but what an awesome insight.
rkangel · 11m ago
It's not quite that simple. I think that having that skillset and knowledge in house already probably led to it being feasible, but that's not why they did it. They created Chrome because it was in their best interests for rich web applications to run well.
nssnsjsjsjs · 20m ago
Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.

I reckon they made the browser to control the browser market.

baq · 10m ago
their browser is their scraper. what you see is what the scraper sees is what the ads look like.
rob_c · 32m ago
You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now
mschuster91 · 44m ago
... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.
bob1029 · 24m ago
I think this is not a battle that can be won in this way.

Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.

CGamesPlay · 20m ago
That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.
bob1029 · 12m ago
I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

heinrich5991 · 4m ago
> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

Yes, there are sites being DDoSed by scrapers for LLMs.

> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.

2000UltraDeluxe · 3m ago
25k+ hits/minute here. And that's just the scrapers that doesn't just identify themselves as a browsers.

Not sure where you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.

fl0id · 13m ago
For the model it’s not. But I think many of these bots are also from tool usage or ‚research‘ or whatever they call it these days. And for that it doesnatter
forty · 38m ago
Can we have proof of work algorithm that compute something actually useful? Like finding large prime numbers or something like this that have distributed computation programs. This way all this power wasted is at least not completely lost.
vanschelven · 31m ago
TBH most of the talk of "aggressive scraping" has been in the 100K pages/day range (which is ~1 page/s, i.e. "neglegtable). In my mind cloud providers ridiculous egres rates are more to blame here.
jeroenhd · 9m ago
I've caught Huawei and Tencent IPs scraping the same image over and over again, with different query parameters. Sure, the image was only 260KiB and I don't use Amazon or GCP or Azure so it didn't cost me anything, but it still spammed my logs and caused a constant drain on my servers' resources.

The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.

In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.

keepamovin · 1h ago
Interesting - we dealt with this issue in CloudTabs, a SaaS for BrowserBox remote browser. The way we handle it, is simply to monitor resource usage with a Python script, issue a warning to the user when their tab or all processes are running hot, then when the rules are triggered we just kill the offending processes (that use too much CPU or RAM).

Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.

In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.

nssnsjsjsjs · 24m ago
My violin is microscopic for this problem. It's actually given me ideas!
h1fra · 1h ago
Most scrapers are not able to monitor each websites performance, what will happen is that it's going to be slower for site to respond and that's it
matt3210 · 58m ago
The issue isn’t the resource usage as much as the content they’re stealing for reproduction purposes.
bmacho · 46m ago
Another idea: if your content is ~5kb text, then serve it to whoever asks it. If you don't have the bandwidth, try to make it smaller, static, and put it on the edge, some other people's computers.
foul · 2m ago
The fun and terrible thing about the web is that the "rockstar in temporary distress" trope can be true and can happen when you expect it the least, like, you know, when you receive a HN kiss of death.

You can surely expect that a static content will be static and will not run jpegoptim on a image at any given hit (a dynamic CMS + a sudden visit from ByteDance = your server is DDoSed), but you can't expect that any idiot/idiot hive on this planet will set up a multi-country edge caching servers architecture for a small sized website just in case some blog post will hit a few million visits every ten minutes. That can easily take down a server even for static content.

rob_c · 33m ago
Seen this already, back in the day the were sites which ran bitcoin hashing on their userbase and there was uproar.

If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)

Animats · 1h ago
"This page is taking too long to load" on sites with anti-scraper technology. Do Not Want.
jchw · 1h ago
In general if you're coming from a "clean" IP your difficulty won't be very high. In the future if these systems coordinate with each other in some way (DHT?) then it should make it possible to drop the baseline difficulty even further.
Unmixed0039 · 1h ago
Yeah but is it the victims fault that they defend themselves? Either they spend their money on upscaling while they get scraped or they don't and I have to wait a little bit.
RamRodification · 1h ago
No, but it also doesn't matter whose fault it is. Having that happen will probably reduce your traffic. It might be worth the trade-off, and it might not. But if people go "Do Not Want!" it is worth paying attention to.
alpaca128 · 58m ago
The one mentioned in the article was made because LLM scrapers DDOSed a person’s server. If it’s unusable for everyone, making everyone wait a bit to solve the issue is a clear improvement.

> it also doesn’t matter whose fault it is

People will always blame someone and have the uncanny talent to blame the wrong party. With cookie popups they blame the EU instead of the websites literally selling them out, here they’ll probably turn around and blame the website operators for defending themselves. For corporations running not just the scrapers but also the social networks it’s easy to algorithmically control the narrative.

Disposal8433 · 58m ago
> will probably reduce your traffic

That's the point. Some web sites are DDOSed by scrapers and they obviously don't want such traffic.

TZubiri · 1h ago
" and LLM scrapers may well have lots of CPU time available through means such as compromised machines."

It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.

But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.

OutOfHere · 1h ago
See also http://www.hashcash.org/ which is a famous proof-of-work algorithm. The bigger benefit of proof-of-work is not that it's anti-LLM; it is that it's statelessly anti-DoS.

I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.

wordofx · 1h ago
lol LLMs don’t just randomly execute JavaScript on a web page when scraped.

Edit: lol no wonder hacker news is generally anti AI. They think ai will just randomly execute JavaScript it sees on a web page. It’s amazing how incredibly dumb half of HN is.

seszett · 24m ago
This is about the scraping tools that are used to train LLMs, not the scrapers used by live LLMs.
darkhelmet · 24m ago
You're missing the point. The approach is block anything that doesn't execute the javascript. The javascript generates a key that allows your queries to work. No javascript => no key => no content for you to scrape.

Its a nuclear option. Nobody wins. The site operator can adjust the difficulty as needed to keep the abusive scrapers at bay.