I think this is not a battle that can be won in this way.
Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.
CGamesPlay · 2m ago
That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.
sznio · 1h ago
I'd really like this, since it wouldn't impact my scraping stuff.
I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.
DaSHacka · 39m ago
Surprised there hasn't been a fork of Anubis that changes the artificial PoW into a simple Monero mining PoW yet.
Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.
At least then we would be making money off these rude bots.
xnorswap · 18m ago
The bots could check if they've hit the jackpot themselves and keep the valid hashes for themselves and only return when they're worthless.
Then it's the bots who are making money from work they need to do for the captchas.
nssnsjsjsjs · 4m ago
1. The problem is the bot needs to understand the program it is running to do that. Akin to the halting problem.
2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.
forty · 17m ago
We need an oblivious crypto currency mining algorithm ^^
dannyw · 1h ago
This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.
Simple limits on runtime atop crypto mining from being too big of a problem.
jeroenhd · 2m ago
And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.
Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.
TZubiri · 1h ago
"Googlebot has been doing it for probably a decade."
This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .
motoxpro · 41m ago
This is so obvious when you say it, but what an awesome insight.
nssnsjsjsjs · 2m ago
Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.
I reckon they made the browser to control the browser market.
rob_c · 13m ago
You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now
mschuster91 · 25m ago
... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.
maeln · 28m ago
But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.
rob_c · 8m ago
No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)
The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.
Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.
The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.
nssnsjsjsjs · 6m ago
My violin is microscopic for this problem. It's actually given me ideas!
ChocolateGod · 1h ago
I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
vanschelven · 13m ago
TBH most of the talk of "aggressive scraping" has been in the 100K pages/day range (which is ~1 page/s, i.e. "neglegtable). In my mind cloud providers ridiculous egres rates are more to blame here.
forty · 20m ago
Can we have proof of work algorithm that compute something actually useful? Like finding large prime numbers or something like this that have distributed computation programs. This way all this power wasted is at least not completely lost.
keepamovin · 1h ago
Interesting - we dealt with this issue in CloudTabs, a SaaS for BrowserBox remote browser. The way we handle it, is simply to monitor resource usage with a Python script, issue a warning to the user when their tab or all processes are running hot, then when the rules are triggered we just kill the offending processes (that use too much CPU or RAM).
Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.
In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.
h1fra · 43m ago
Most scrapers are not able to monitor each websites performance, what will happen is that it's going to be slower for site to respond and that's it
matt3210 · 40m ago
The issue isn’t the resource usage as much as the content they’re stealing for reproduction purposes.
bmacho · 27m ago
Another idea: if your content is ~5kb text, then serve it to whoever asks it. If you don't have the bandwidth, try to make it smaller, static, and put it on the edge, some other people's computers.
rob_c · 15m ago
Seen this already, back in the day the were sites which ran bitcoin hashing on their userbase and there was uproar.
If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)
TZubiri · 1h ago
" and LLM scrapers may well have lots of CPU time available through means such as compromised machines."
It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.
But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.
Animats · 1h ago
"This page is taking too long to load" on sites with anti-scraper technology. Do Not Want.
jchw · 1h ago
In general if you're coming from a "clean" IP your difficulty won't be very high. In the future if these systems coordinate with each other in some way (DHT?) then it should make it possible to drop the baseline difficulty even further.
Unmixed0039 · 1h ago
Yeah but is it the victims fault that they defend themselves? Either they spend their money on upscaling while they get scraped or they don't and I have to wait a little bit.
RamRodification · 1h ago
No, but it also doesn't matter whose fault it is. Having that happen will probably reduce your traffic. It might be worth the trade-off, and it might not. But if people go "Do Not Want!" it is worth paying attention to.
alpaca128 · 39m ago
The one mentioned in the article was made because LLM scrapers DDOSed a person’s server. If it’s unusable for everyone, making everyone wait a bit to solve the issue is a clear improvement.
> it also doesn’t matter whose fault it is
People will always blame someone and have the uncanny talent to blame the wrong party. With cookie popups they blame the EU instead of the websites literally selling them out, here they’ll probably turn around and blame the website operators for defending themselves. For corporations running not just the scrapers but also the social networks it’s easy to algorithmically control the narrative.
Disposal8433 · 39m ago
> will probably reduce your traffic
That's the point. Some web sites are DDOSed by scrapers and they obviously don't want such traffic.
OutOfHere · 1h ago
See also http://www.hashcash.org/ which is a famous proof-of-work algorithm. The bigger benefit of proof-of-work is not that it's anti-LLM; it is that it's statelessly anti-DoS.
I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.
wordofx · 55m ago
lol LLMs don’t just randomly execute JavaScript on a web page when scraped.
Edit: lol no wonder hacker news is generally anti AI. They think ai will just randomly execute JavaScript it sees on a web page. It’s amazing how incredibly dumb half of HN is.
seszett · 5m ago
This is about the scraping tools that are used to train LLMs, not the scrapers used by live LLMs.
darkhelmet · 5m ago
You're missing the point. The approach is block anything that doesn't execute the javascript. The javascript generates a key that allows your queries to work. No javascript => no key => no content for you to scrape.
Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.
I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.
Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.
At least then we would be making money off these rude bots.
Then it's the bots who are making money from work they need to do for the captchas.
2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.
Simple limits on runtime atop crypto mining from being too big of a problem.
Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.
This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .
I reckon they made the browser to control the browser market.
The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.
Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.
The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.
In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.
If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)
It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.
But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.
> it also doesn’t matter whose fault it is
People will always blame someone and have the uncanny talent to blame the wrong party. With cookie popups they blame the EU instead of the websites literally selling them out, here they’ll probably turn around and blame the website operators for defending themselves. For corporations running not just the scrapers but also the social networks it’s easy to algorithmically control the narrative.
That's the point. Some web sites are DDOSed by scrapers and they obviously don't want such traffic.
I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.
Edit: lol no wonder hacker news is generally anti AI. They think ai will just randomly execute JavaScript it sees on a web page. It’s amazing how incredibly dumb half of HN is.