I'd really like this, since it wouldn't impact my scraping stuff.
I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.
DaSHacka · 1h ago
Surprised there hasn't been a fork of Anubis that changes the artificial PoW into a simple Monero mining PoW yet.
Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.
At least then we would be making money off these rude bots.
xnorswap · 1h ago
The bots could check if they've hit the jackpot themselves and keep the valid hashes for themselves and only return when they're worthless.
Then it's the bots who are making money from work they need to do for the captchas.
nssnsjsjsjs · 1h ago
1. The problem is the bot needs to understand the program it is running to do that. Akin to the halting problem.
2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.
forty · 1h ago
We need an oblivious crypto currency mining algorithm ^^
g-b-r · 39m ago
It would be a much bigger incentive to add them with little care for the innocents impacted.
Although admittedly millions of sites already ruined themselves with cloudflare without that incentive
hardwaresofton · 48m ago
Fantastic work by Xe here -- not the first but this seems like the most traction I've seen on a PoW anti-scraper project (with an MIT license to boot!).
PoW anti-scraper tools are a good first step, but why don't we just jump straight to the endgame? We're drawing closer to a point where information's value is actually fully realized -- people will stop sharing knowledge for free. It doesn't have to be that way, but it does in a world where people are pressed for economic means, knowledge becomes an obvious thing to convert to capital and attempt to extract rent on.
The simple way this happens is just a login wall -- for every website. It doesn't have to be a paid login wall of course (at first), but it's a super simple way to both legally and practically protect from scrapers.
I think high quality knowledge, source code (which is basically executable knowledge), being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving.
Don't get me wrong -- the doomer angle is almost always wrong -- every year humanity is almost always better off than we were the previous year on many important metrics, but it's getting harder to see a world where we cartwheel through another technological transformation that this time could possibly impact large percentages of the working population.
pjc50 · 17m ago
> people will stop sharing knowledge for free. It doesn't have to be that way
Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.
> being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving
"High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.
g-b-r · 41m ago
Yes, let's turn the whole web into Facebook, what a bright future
Schiendelman · 38m ago
I think we only have two choices here:
1) every webpage requires Facebook login, and then Facebook offers free hosting for the content.
2) every webpage requires some other method of login, but not locked into a single system.
I read the GP comment as suggesting we push on the second option there rather than passively waiting for the first option.
g-b-r · 23m ago
I hope you see that you shatter anonymity and the open web with that, single system or not
ChocolateGod · 2h ago
I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
jeroenhd · 59m ago
This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.
Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.
If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.
There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.
In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.
shiomiru · 23m ago
> Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.
Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.
Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.
ChocolateGod · 32m ago
> Scrapers, on the other hand, keep throwing out their session cookies
This isn't very difficult to change.
> but the way Anubis works, you will only get the PoW test once.
Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.
persnickety · 49m ago
> An LLM scraper is operating in a hostile environment [...] because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. [..] for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or [...] want to waste as much of your CPU as possible).
That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.
That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.
In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?
And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?
I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?
captainmuon · 28m ago
I don't know, Sandbox escape from a browser is a big deal, a million dollars bounty kind of deal. I feel safe to put an automated browser in a container or a VM and let it run with a timeout.
And if a site pulls something like that on me, then I just don't take their data. Joke is on them, soon if something is not visible to AI it will not 'exist', like it is now when you are delisted from Google.
pjc50 · 29m ago
I don't think this is "malicious" so much as it is "expensive" (in CPU cycles), which is already a problem for ad-heavy sites.
berkes · 28m ago
> Why would you threaten your users with that?
Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.
My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.
dannyw · 2h ago
This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.
Simple limits on runtime atop crypto mining from being too big of a problem.
jeroenhd · 1h ago
And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.
Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.
maeln · 1h ago
But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.
rob_c · 1h ago
No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)
The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.
Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.
The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.
TZubiri · 2h ago
"Googlebot has been doing it for probably a decade."
This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .
motoxpro · 1h ago
This is so obvious when you say it, but what an awesome insight.
nssnsjsjsjs · 1h ago
Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.
I reckon they made the browser to control the browser market.
baq · 55m ago
their browser is their scraper. what you see is what the scraper sees is what the ads look like.
rkangel · 56m ago
It's not quite that simple. I think that having that skillset and knowledge in house already probably led to it being feasible, but that's not why they did it. They created Chrome because it was in their best interests for rich web applications to run well.
rob_c · 1h ago
You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now
mschuster91 · 1h ago
... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.
bob1029 · 1h ago
I think this is not a battle that can be won in this way.
Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.
CGamesPlay · 1h ago
That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.
bob1029 · 57m ago
I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?
If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.
heinrich5991 · 49m ago
> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?
Yes, there are sites being DDoSed by scrapers for LLMs.
> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.
This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.
2000UltraDeluxe · 48m ago
25k+ hits/minute here. And that's just the scrapers that doesn't just identify themselves as a browsers.
Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.
fl0id · 58m ago
For the model it’s not. But I think many of these bots are also from tool usage or ‚research‘ or whatever they call it these days. And for that it doesnatter
forty · 1h ago
Can we have proof of work algorithm that compute something actually useful? Like finding large prime numbers or something like this that have distributed computation programs. This way all this power wasted is at least not completely lost.
g-b-r · 26m ago
Unfortunately useful things usually require much more computation to find a useful result, they require distributing the search, and so can't reliably verify that you performed the work (most searches will not find anything, and you can just pretend to not have found anything without doing any work).
If a service had enough concurrent clients to reliably hit useful results quickly, you could verify that most did the work by checking if a hit was found, and either let everyone in or block everyone according to that; but then you're relying on the large majority being honest for your service to work at all, and some dishonest clients would still slip through.
keepamovin · 2h ago
Interesting - we dealt with this issue in CloudTabs, a SaaS for BrowserBox remote browser. The way we handle it, is simply to monitor resource usage with a Python script, issue a warning to the user when their tab or all processes are running hot, then when the rules are triggered we just kill the offending processes (that use too much CPU or RAM).
Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.
In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.
captainmuon · 33m ago
As somebody who does some scraping / crawling for legitimate uses, I'm really unhappy with this development. I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it. Maybe they are opposed to it for fundamental reasons. I for one would like my content to be spread maximally. I want my arguments to be incorporated into AIs, so I can reach more people. But of course that is just me when I'd write certain content, others have different goals.
It gets annoying when you have the right to scrape something - either because the owner of the data gave you the OK or because it is openly licensed. But then the webmaster can't be bothered to relax the rate limiter for you, and nobody can give you a nice API. Now people are putting their Open Educational Resources, their open source software, even their freaking essays about openness that they want the world to read behind Anubis. It makes me shake my head.
I understand perfectly it is annoying when badly written bots hammer your site. But maybe then HTTP and those bots are the problem. Maybe we should make it easier for site owners to push their content somewhere where we can scrape it easier?
berkes · 22m ago
Sounds like something IPFS could be nice solution for.
vanschelven · 1h ago
TBH most of the talk of "aggressive scraping" has been in the 100K pages/day range (which is ~1 page/s, i.e. "neglegtable). In my mind cloud providers ridiculous egres rates are more to blame here.
jeroenhd · 54m ago
I've caught Huawei and Tencent IPs scraping the same image over and over again, with different query parameters. Sure, the image was only 260KiB and I don't use Amazon or GCP or Azure so it didn't cost me anything, but it still spammed my logs and caused a constant drain on my servers' resources.
The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.
In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.
h1fra · 1h ago
Most scrapers are not able to monitor each websites performance, what will happen is that it's going to be slower for site to respond and that's it
nssnsjsjsjs · 1h ago
My violin is microscopic for this problem. It's actually given me ideas!
matt3210 · 1h ago
The issue isn’t the resource usage as much as the content they’re stealing for reproduction purposes.
berkes · 15m ago
That's not true for the vast amount of creative-commons, open-source and other permissive licenced content.
(Aside: the licenses and distribution advocated by many of the same demography (information wants to be free -folks, jstor protestors, GPL-zealots) that now opposes LLMs using that content. )
rob_c · 1h ago
Seen this already, back in the day the were sites which ran bitcoin hashing on their userbase and there was uproar.
If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)
TZubiri · 2h ago
" and LLM scrapers may well have lots of CPU time available through means such as compromised machines."
It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.
But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.
Animats · 2h ago
"This page is taking too long to load" on sites with anti-scraper technology. Do Not Want.
jchw · 2h ago
In general if you're coming from a "clean" IP your difficulty won't be very high. In the future if these systems coordinate with each other in some way (DHT?) then it should make it possible to drop the baseline difficulty even further.
berkes · 12m ago
That's a perfect tool for monopolists to widen their moat even more.
In line with how email is technically still federated and distributed, but practically oligopolized by a handfull of big-tech, through "fighting spam".
Unmixed0039 · 2h ago
Yeah but is it the victims fault that they defend themselves? Either they spend their money on upscaling while they get scraped or they don't and I have to wait a little bit.
RamRodification · 2h ago
No, but it also doesn't matter whose fault it is. Having that happen will probably reduce your traffic. It might be worth the trade-off, and it might not. But if people go "Do Not Want!" it is worth paying attention to.
alpaca128 · 1h ago
The one mentioned in the article was made because LLM scrapers DDOSed a person’s server. If it’s unusable for everyone, making everyone wait a bit to solve the issue is a clear improvement.
> it also doesn’t matter whose fault it is
People will always blame someone and have the uncanny talent to blame the wrong party. With cookie popups they blame the EU instead of the websites literally selling them out, here they’ll probably turn around and blame the website operators for defending themselves. For corporations running not just the scrapers but also the social networks it’s easy to algorithmically control the narrative.
ReptileMan · 2m ago
>With cookie popups they blame the EU instead of the websites literally selling them out
EU created the flawed legislation, EU carved the loopholes, EU implementation and enforcement sucked, sites took the most obvious and predictable road. So who is to blame for the worse experience?
RamRodification · 33m ago
> If it’s unusable for everyone, making everyone wait a bit to solve the issue is a clear improvement.
I agree.
> People will always blame someone
I agree. That's my point.
Disposal8433 · 1h ago
> will probably reduce your traffic
That's the point. Some web sites are DDOSed by scrapers and they obviously don't want such traffic.
RamRodification · 35m ago
I meant 'real' traffic. Maybe that wasn't obvious.
OutOfHere · 2h ago
See also http://www.hashcash.org/ which is a famous proof-of-work algorithm. The bigger benefit of proof-of-work is not that it's anti-LLM; it is that it's statelessly anti-DoS.
I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.
wordofx · 1h ago
lol LLMs don’t just randomly execute JavaScript on a web page when scraped.
Edit: lol no wonder hacker news is generally anti AI. They think ai will just randomly execute JavaScript it sees on a web page. It’s amazing how incredibly dumb half of HN is.
pjc50 · 33m ago
Even before anti-scraping, lots of SPA sites won't give you any content if you don't run Javascript.
seszett · 1h ago
This is about the scraping tools that are used to train LLMs, not the scrapers used by live LLMs.
darkhelmet · 1h ago
You're missing the point. The approach is block anything that doesn't execute the javascript. The javascript generates a key that allows your queries to work. No javascript => no key => no content for you to scrape.
Its a nuclear option. Nobody wins. The site operator can adjust the difficulty as needed to keep the abusive scrapers at bay.
bmacho · 1h ago
Another idea: if your content is ~5kb text, then serve it to whoever asks it. If you don't have the bandwidth, try to make it smaller, static, and put it on the edge, some other people's computers.
foul · 47m ago
The fun and terrible thing about the web is that the "rockstar in temporary distress" trope can be true and can happen when you expect it the least, like, you know, when you receive a HN kiss of death.
You can surely expect that a static content will be static and will not run jpegoptim on a image at any given hit (a dynamic CMS + a sudden visit from ByteDance = your server is DDoSed), but you can't expect that any idiot/idiot hive on this planet will set up a multi-country edge caching servers architecture for a small sized website just in case some blog post will hit a few million visits every ten minutes. That can easily take down a server even for static content.
I concur that Anubis is a poor solution, and yet here we are, the UN it's using it to weather down requests.
I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.
Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.
At least then we would be making money off these rude bots.
Then it's the bots who are making money from work they need to do for the captchas.
2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.
Although admittedly millions of sites already ruined themselves with cloudflare without that incentive
PoW anti-scraper tools are a good first step, but why don't we just jump straight to the endgame? We're drawing closer to a point where information's value is actually fully realized -- people will stop sharing knowledge for free. It doesn't have to be that way, but it does in a world where people are pressed for economic means, knowledge becomes an obvious thing to convert to capital and attempt to extract rent on.
The simple way this happens is just a login wall -- for every website. It doesn't have to be a paid login wall of course (at first), but it's a super simple way to both legally and practically protect from scrapers.
I think high quality knowledge, source code (which is basically executable knowledge), being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving.
Don't get me wrong -- the doomer angle is almost always wrong -- every year humanity is almost always better off than we were the previous year on many important metrics, but it's getting harder to see a world where we cartwheel through another technological transformation that this time could possibly impact large percentages of the working population.
Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.
> being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving
"High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.
I read the GP comment as suggesting we push on the second option there rather than passively waiting for the first option.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.
If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.
There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.
In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.
Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.
Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.
This isn't very difficult to change.
> but the way Anubis works, you will only get the PoW test once.
Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.
That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.
That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.
In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?
And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?
I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?
And if a site pulls something like that on me, then I just don't take their data. Joke is on them, soon if something is not visible to AI it will not 'exist', like it is now when you are delisted from Google.
Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.
My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.
Simple limits on runtime atop crypto mining from being too big of a problem.
Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.
The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.
Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.
The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.
This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .
I reckon they made the browser to control the browser market.
Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.
If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.
Yes, there are sites being DDoSed by scrapers for LLMs.
> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.
This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.
Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.
If a service had enough concurrent clients to reliably hit useful results quickly, you could verify that most did the work by checking if a hit was found, and either let everyone in or block everyone according to that; but then you're relying on the large majority being honest for your service to work at all, and some dishonest clients would still slip through.
Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.
In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.
It gets annoying when you have the right to scrape something - either because the owner of the data gave you the OK or because it is openly licensed. But then the webmaster can't be bothered to relax the rate limiter for you, and nobody can give you a nice API. Now people are putting their Open Educational Resources, their open source software, even their freaking essays about openness that they want the world to read behind Anubis. It makes me shake my head.
I understand perfectly it is annoying when badly written bots hammer your site. But maybe then HTTP and those bots are the problem. Maybe we should make it easier for site owners to push their content somewhere where we can scrape it easier?
The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.
In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.
(Aside: the licenses and distribution advocated by many of the same demography (information wants to be free -folks, jstor protestors, GPL-zealots) that now opposes LLMs using that content. )
If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)
It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.
But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.
In line with how email is technically still federated and distributed, but practically oligopolized by a handfull of big-tech, through "fighting spam".
> it also doesn’t matter whose fault it is
People will always blame someone and have the uncanny talent to blame the wrong party. With cookie popups they blame the EU instead of the websites literally selling them out, here they’ll probably turn around and blame the website operators for defending themselves. For corporations running not just the scrapers but also the social networks it’s easy to algorithmically control the narrative.
EU created the flawed legislation, EU carved the loopholes, EU implementation and enforcement sucked, sites took the most obvious and predictable road. So who is to blame for the worse experience?
I agree.
> People will always blame someone
I agree. That's my point.
That's the point. Some web sites are DDOSed by scrapers and they obviously don't want such traffic.
I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.
Edit: lol no wonder hacker news is generally anti AI. They think ai will just randomly execute JavaScript it sees on a web page. It’s amazing how incredibly dumb half of HN is.
Its a nuclear option. Nobody wins. The site operator can adjust the difficulty as needed to keep the abusive scrapers at bay.
You can surely expect that a static content will be static and will not run jpegoptim on a image at any given hit (a dynamic CMS + a sudden visit from ByteDance = your server is DDoSed), but you can't expect that any idiot/idiot hive on this planet will set up a multi-country edge caching servers architecture for a small sized website just in case some blog post will hit a few million visits every ten minutes. That can easily take down a server even for static content.
I concur that Anubis is a poor solution, and yet here we are, the UN it's using it to weather down requests.