A thought on JavaScript "proof of work" anti-scraper systems

153 zdw 184 5/26/2025, 5:01:25 AM utcc.utoronto.ca ↗

Comments (184)

myself248 · 9h ago
If the proof-of-work system is actually a crypto miner, such that visitors end up paying the site for the content they host, have we finally converged on a working implementation of the micropayments-for-websites concepts of decades ago?
diggan · 8h ago
> If the proof-of-work system is actually a crypto miner, such that visitors end up paying the site for the content they host

Unsure how that would work. If the proof you generate could be used for blockchain operations, so that the website operator could be paid by using that proof as generated by the website visitor, why shouldn't the visitor keep that proof to themselves and use it instead? Then they'd get the full amount, and the website operator gets nothing. So then there is no point for it, and the visitor might as well just run a miner locally :)

lurkshark · 8h ago
This system actually existed for awhile, it was called Coinhive. Each visitor would be treated like a node in a mining pool with “credit” for the resources going to the site owner. Somewhat predictably it became primarily used by hackers who would inject the code on high profile sites or use advertising networks.

https://krebsonsecurity.com/2018/03/who-and-what-is-coinhive...

xd1936 · 4h ago
viraptor · 8h ago
Have a look at how mining pools are implemented. The client only gets to change some part of the block and does the hashing from there. You can't go back from that to change the original data - you wouldn't get paid. Otherwise you could easily scam the mining pool and always keep the winning numbers to yourself while getting paid for the partials too.
odo1242 · 8h ago
The company Coinhive used to do this before they shut down. Basically, in order to enter a website, you have to provide the website with a certain number of Monero hashes (usually around 1,024) that the website would send to Coinhive’s miner pool before letting the user through.

It kinda worked, except for the fact that hackers would try to “cryptojack” random websites by hacking them and inserting Coinhive’s miner into their pages. This caused everyone to block Coinhive’s servers. (Also you wouldn’t get very much money out of it - even the cryptojackers who managed to get tens of millions of page views out of hacked websites reported they only made ~$40 from the operation)

kbenson · 7h ago
If attackers only made ~$40 fora good amount of work, seems like it would have resolved itself if the scheme was left to run itself to conclusion before people started blocking coinhive in (what sounds like from your description) a knee-jerk reaction.

Then again, I'm sure there's quite a bit of tweaking that could be done to make clients submit far more hashes, but that would make it much more noticeable.

hoppp · 1h ago
That $40 now coud be in the thousands if they didn't spend it Xmr was cheaper back then.
Retr0id · 8h ago
If the user mined it themselves and then paid the site owner before accessing the site, they'd have to pay a transaction fee and wait for a high-latency transaction to commit. The transaction fee could dwarf the actual payment value.

Mining on behalf of the site owner negates the need for a transaction entirely.

viraptor · 8h ago
(unnecessary)
Retr0id · 8h ago
I know this.
viraptor · 8h ago
Responded to wrong comment, sorry
SilasX · 7h ago
From my understanding: to pose the problem for miners, you hash the block you're planning to submit (which includes all the proposed transactions). Miners only get the hash. To claim the reward, you need the preimage (i.e. block to be submitted), which miners don't have.

In theory, you could watch the transactions being broadcast, and guess (+confirm) the corresponding block, but that would require you to see all the transactions the pool owner did, and put them in the same order (the possibilities of which scale exponentially with the number of transactions). There may be some other randomness you can insert into a block too -- someone else might know this.

Edit: oops, I forgot: the block also contains the address that the fees should be sent to. So even if you "stole" your solution and broadcast it with the block, the fee is still going to the pool owner. That's a bigger deal.

moralestapia · 4h ago
Yeah, but then they wouldn't get your content? Duh.
bdcravens · 6h ago
There were some Javascript-based embedded miners in the early days of Bitcoin

https://web.archive.org/web/20110603143708/http://www.bitcoi...

odo1242 · 8h ago
Not really, because it takes a LOT of hashes to actually get any crypto out of the system. Yes, you’re technically taking the user’s power and getting paid crypto, but unless you’re delaying the user for a long time, you’re only really being paid about a ten thousandth of a cent for a page visit.

Also virus scanners and corporate networks would hate you, because hackers are probably trying to embed whatever library you’re using into other unsuspecting sites.

jfengel · 1h ago
What does one actually get per page impression from Google Ads? I gather that it's more than a ten thousandth of a cent, but perhaps not all that much more.
msgodel · 6h ago
It would be nice if this could get standardized http headers so bots could still use sites but they effectively pay for use. That seems like the best of all possible worlds to me, the whole point of HTML is that robots can read it, otherwise we'd just be emailing eachother pdfs.
DrillShopper · 2h ago
They should have to set the evil bit
overfeed · 6h ago
> bots could still use sites but they effectively pay for use. That seems like the best of all possible worlds to me

This would make the entire internet a maze of AI-slop content primarily made for other bots to consume. Humans may have to resort to emailing handwritten PDFs to avoid the thoroughly enshittified web.

DocTomoe · 8h ago
There was that concept used by a German image board around 2018 - which quickly got decried as 'sneaky, malware-like, potentially criminal' by Krebs (of KrebsOnSecurity). Of course, the article by KrebsOnSecurity was hyperbole and painted a good idea for site revenue as evil[1]. It also decided to doxx the administrator of said image board.

This caused major stress with the board's founders, a change in leadership on the imageboard due to burnout, "Krebs ist Scheiße" (Krebs / cancer is sh*t) as a meme-like statement in German internet culture, and annual fundraisers to anti-cancer organizations in an attempt to 'Fight Krebs', which regularly are in the 100-250k area.

Lessons learned: Good ideas in paying for your content needs to pass the outrage culture test. And Krebs is ... not a honest news source.

[1] https://krebsonsecurity.com/2018/03/who-and-what-is-coinhive...

rcxdude · 49m ago
It's not actually a good idea, though. It's basically just banditry: the cost to the users it much more than the value to the benefactor, and there's not much they can do about it. (to be fair, the super invasive tracking ad systems that now exist have the same problem, but it's not obvious that they're worse).
shkkmo · 1h ago
The doxxing is questinable, but much less questionable than your presentation of events.

Coinhive earned 35% of everything mined on any site, not just the image board. They had no means stopping malicious installations from stealing from users. This provided hackers financial incenctive to compromise as many sites as possible and Coinhive's incentives were aligned with this. The choice on Monero as the base blockchain made it pretty clear what the intentions were.

> Lessons learned: Good ideas in paying for your content needs to pass the outrage culture test. And Krebs is ... not a honest news source.

Don't create tools clearly intended to facilitate criminal activity, make money off of it, and expect every to be OK with it.

kmeisthax · 6h ago
The problem with micropayments was fourfold:

1. Banner ads made more money. This stopped being true a while ago, it's why newspapers all have annoying soft paywalls now.

2. People didn't have payment rails set up for e-commerce back then. Largely fixed now, at least for adults in the US.

3. Transactions have fixed processing costs that make anything <$1 too cheap to transact. Fixed with batching (e.g. buy $5 of credit and spend it over time).

4. Having to approve each micropurchase imposes a fixed mental transaction cost that outweighs the actual cost of the individual item. Difficult to solve ethically.

With the exception of, arguably[0], Patreon, all of these hurdles proved fatal to microtransactions as a means to sell web content. Games are an exception, but they solved the problem of mental transaction costs by drowning it in intensely unethical dark patterns protected by shittons of DRM[1]. You basically have to make someone press the spend button without thinking.

The way these proof-of-work systems are currently implemented, you're effectively taking away the buy button and just charging someone the moment they hit the page. This is ethically dubious, at least as ethically dubious as 'data caps[2]' in terms of how much affordance you give the user to manage their spending: none.

Furthermore, if we use a proof-of-work system that's shared with an actual cryptocurrency, so as to actually get payment from these hashes, then we have a new problem: ASICs. Cryptocurrencies have to be secured by a globally agreed-upon hash function, and changing that global consensus to a new hash function is very difficult. And those hashes have economic value. So it makes lots of sense to go build custom hardware just to crack hashes faster and claim more of the inflation schedule and on-chain fees.

If ASICs exist for a given hash function, then proof-of-work fails at both:

- Being an antispam system, since spammers will have better hardware than legitimate users[3]

- Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money

If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.

"Don't roll your own crypto" is usually good security advice, but in this case, we're not doing security, we're doing DRM. The same fundamental constants of computing that make stopping you from copying a movie off Netflix a fool's errand also make stopping scrapers theoretically impossible. The only reason why DRM works is because of the gap between theory and practice: technically unsophisticated actors can be stopped by theoretically dubious usages of cryptography. And boy howdy are LLM scrapers unsophisticated. But using the tried-and-true solutions means they don't have to be: they can just grab off-the-shelf solutions for cracking hashes and break whatever you use.

[0] At least until Apple cracked Patreon's kneecaps and made them drop support for any billing mode Apple's shitty commerce system couldn't handle.

[1] At the very least, you can't sell microtransaction items in games without criminalizing cheat devices that had previously been perfectly legal for offline use. Half the shit you sell in a cash shop is just what used to be a GameShark code.

[2] To be clear, the units in which Internet connections are sold should be kbps, not GB/mo. Every connection already has a bandwidth limit, so what ISPs are doing when they sell you a plan with a data cap is a bait and switch. Two caps means the lower cap is actually a link utilization cap, hidden behind a math problem.

[3] A similar problem has arisen in e-mail, where spammy domains have perfect DKIM/SPF, while good senders tend to not care about e-mail bureaucracy and thus look worse to antispam systems.

jaredwiener · 6h ago
Point 4 is often overlooked and I think the biggest issue.

Once there is ANY value exchanged, the user immediately wonders if it is worth it -- and if the payment/token/whatever is sent prior to the pageload, they have no way of knowing.

wahern · 2h ago
> Once there is ANY value exchanged

There's always value exchanged--"If you're not paying for the product, you are the product".[1] For ads we've established the fiction that everybody knowingly understands and accepts this quid pro quo. For proof of work we'd settle on a similar fiction, though perhaps browsers could add a little graphic showing CPU consumption.

[1] This is true even for personal blogs, albeit the monetary element is much more remote.

bee_rider · 5h ago
This is most true of books and other types of media (well, you can flip through a book at the store, but it isn’t a perfect thing…).

I dunno. Brands and other quality signals (imperfect as they tend to be, they still aren’t completely useless) could develop.

kpw94 · 4h ago
Books have a back cover for that reason: so you can read it before buying.

Long-form articles could have a back cover summary too, or an enticing intro... and some substack paid articles do that already: they let you read an intro and cut before going in the interesting details.

But for short newspapers articles it becomes harder to do based on topic. If the summary has to give out 90% of the information to not be too vague, you may then feel robbed paying for it once you realize the remaining 10% wasn't that useful.

jaredwiener · 3h ago
Not to mention, the reporting that went into the headline or blurb is what is expensive. You got the value by reading it for free.

https://blog.forth.news/a-business-model-for-21st-century-ne...

benregenspan · 6h ago
At a media company, our web performance monitoring tool started flagging long-running clientside XHR requests, which I couldn't reproduce in a real browser. It turned out that an analytics vendor was injecting a script which checked if it looked like the client was a bot. If so, they would then essentially use the client as a worker to perform their own third-party API requests (for data like social share counts). So there's definitely some prior art for this kind of thing.
apitman · 3h ago
This is really interesting. One naive thought that immediately came to mind is that bots might be capable of making cross site requests. The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers. Not sure that fact will appreciably reduce their scaping abilities though.
benregenspan · 1h ago
> The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers

I think this is almost already the case now. Services like Cloudflare do a pretty good job of classifying primitive bots and if site operators want to block all (or at least vast majority), they can. The only reliable way through is a real browser. (Which does put a floor on resource needs for scraping)

dragonwriter · 1h ago
> The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers.

I thought bots using (headless) browsers was an existing workaround for a number of existing issues with simpler bots, so this doesn't seem to be a big change.

ChocolateGod · 15h ago
I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.

I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

jeroenhd · 14h ago
This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.

There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.

In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.

shiomiru · 13h ago
> Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.

ndiddy · 5h ago
> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

The reason why Anubis was created was that the author's public Gitea instance was using a ton of compute because poorly written LLM scraper bots were scraping its web interface, making the server generate a ton of diffs, blames, etc. If the AI companies work around proof-of-work blocks by not constantly scraping the same pages over and over, or by detecting that a given site is a Git host and cloning the repo instead of scraping the web interface, I think that means proof-of-work has won. It provides an incentive for the AI companies to scrape more efficiently by raising their cost to load a given page.

cesarb · 3h ago
> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire.

AFAIK, Anubis does not work alone, it works together with traditional per-IP-address rate limiting; its cookies are bound to the requesting IP address. If the scraper uses a new IP address for each request, it cannot reuse the cookies; if it uses the same IP address to be able to reuse the cookies, it will be restricted by the rate limiting.

kokanee · 7h ago
Attestation is a compelling technical idea, but a terrible economic idea. It essentially creates an Internet that is only viewable via Google and Apple consumer products. Scamming and scraping would become more expensive, but wouldn't stop.

It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause. Proof of work is just another way to burn more coal on every web request, and the LLM oligarchs will happily burn more coal if it reduces competition from upstart LLMs.

Sam Altman's goal is to turn the Internet into an unmitigated LLM training network, and to get humans to stop using traditional browsing altogether, interacting solely via the LLM device Jony Ive is making for him.

Based on the current trajectory, I think he might get his way, if only because the web is so enshittified that we eventually won't have another way to reach mainstream media other than via LLMs.

jerf · 4h ago
"It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause."

Ah, but this isn't doing that. All this is doing is raising friction. Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

All this really does is bring the costs into some sort of alignment. Right now it is too cheap to access web pages that may be expensive to generate. Maybe the page has a lot of nontrivial calculations to run. Maybe the server is just overwhelmed by the sheer size of the scraping swarm and the resulting asymmetry of a huge corporation on one side and a $5/month server on the other. A proof-of-work system doesn't change the server's costs much but now if you want to scrape the entire site you're going to have to pay. You may not have to pay the site owner, but you will have to pay.

If you want to prevent bots from accessing a page that it really wants to access, that's another problem. But, that really is a different problem. The problem this solves is people using small amounts of resources to wholesale scrape entire sites that take a lot of resources to provide, and if implemented at scale, would pretty much solve that problem.

It's not a perfect solution, but no such thing is on the table anyhow. "Raising friction" doesn't mean that bots can't get past it. But it will mean they're going to have to be much more selective about what they do. Even the biggest server farms need to think twice about suddenly dedicating hundreds of times more resources to just doing proof-of-work.

It's an interesting economic problem... the web's relationship to search engines has been fraying slowly but surely for decades now. Widespread deployment of this sort of technology is potentially a doom scenario for them, as well as AI. Is AI the harbinger of the scrapers extracting so much from the web that the web finally finds it economically efficient to strike back and try to normalize the relationship?

ChocolateGod · 2h ago
> Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

If you're going to needlessly waste my CPU cycles, please at least do some mining and donate it to charity.

xena · 2h ago
Anubis author here. Tell me what I'm missing to implement protein folding without having to download gigabytes of scientific data to random people's browsers and I'll implement it today.
dijksterhuis · 1h ago
Perhaps something along the lines of folding@home? https://foldingathome.org https://github.com/FoldingAtHome/fah-client-bastet

seems like it would be possible to split the compute up.

FAQ: https://foldingathome.org/faq/running-foldinghome/

What if I turn off my computer? Does the client save its work (i.e. checkpoint)?

> Periodically, the core writes data to your hard disk so that if you stop the client, it can resume processing that WU from some point other than the very beginning. With the Tinker core, this happens at the end of every frame. With the Gromacs core, these checkpoints can happen almost anywhere and they are not tied to the data recorded in the results. Initially, this was set to every 1% of a WU (like 100 frames in Tinker) and then a timed checkpoint was added every 15 minutes, so that on a slow machine, you never lose more that 15 minutes work.

> Starting in the 4.x version of the client, you can set the 15 minute default to another value (3-30 minutes).

caveat: I have no idea how much data "1 frame" is.

ChocolateGod · 5h ago
People are using LLMs because search results (due to SEO overload, Google's bad algorithm etc) are terrible, Anubis makes these already bad search results even worse by trying to block indexing, meaning people will want to use LLMs even more.

So the existence of Anubis will mean even more incentive for scraping.

ChocolateGod · 13h ago
> Scrapers, on the other hand, keep throwing out their session cookies

This isn't very difficult to change.

> but the way Anubis works, you will only get the PoW test once.

Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.

viraptor · 8h ago
> (why?)

So you can pay the developers for the professional version where you can easily change the image. It's a great way of funding the work.

alpaca128 · 8h ago
> I see the weab girl picture (why?)

As far as I know the creator of Anubis didn't anticipate such a widespread use and the anime girl image is the default. Some sites have personalized it, like sourcehut.

account42 · 8h ago
> This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Actually I will get it zero times because I refuse to enable javashit for sites that shouldn't need it and move on to something run by someone competent.

RHSeeger · 7h ago
> sites that shouldn't need it

There's lots of ways to define "shouldn't" in this case

- Shouldn't need it, but include it to track you

- Shouldn't need it, but include it to enhance the page

- Shouldn't need it, but include it to keep their costs down (for example, by loading parts of the page dynamically / per person and caching the rest of the page)

- Shouldn't need it, but include it because it help stop the bots that are costing them more than the site could reasonably expected to make

I get it, JS can be used in a bad way, and you don't like it. But the pillar of righteousness that you seem to envision yourself standing on it not as profound as you seem to think it is.

odo1242 · 8h ago
Well, everything’s a tradeoff. I know a lot of small websites that had to shut down because LLM scraping was increasing their CPU and bandwidth load to the point where it was untenable to host the site.
eric__cartman · 4h ago
My phone is a piece of junk from 8 years ago and I haven't noticed any degradation in browsing experience. A website takes like two extra seconds to load, not a big deal.
jgalt212 · 6h ago
> I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

I dunno. How much work do you really need in PoW systems to make the scrapers go after easier targets? My guess is not so much that you impair a human's UX. And if you do, then you have not fine-tuned your PoW algo, or you have very determined adversaries / scrapers.

ChocolateGod · 6h ago
Any PoW that doesn't impact end users is not going to impact LLM scrapers.
sznio · 15h ago
I'd really like this, since it wouldn't impact my scraping stuff.

I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.

diggan · 8h ago
Yeah, to me PoW makes a lot of sense in this way too. Captchas are hard for (some) people to solve, and very annoying to fill out, but easy for vision-enabled LLMs to solve (or even use 3rd party services where you pay for N/solves, available for every major captcha service). PoW instead are hard to deal with in a distributed/spammy design, but very easy for any user to just sit and wait a second or two. And all personal scraping tooling just keeps working, just slightly slower.

Sounds like an OK solution to a shitty problem that has a bunch of other shitty solutions.

DaSHacka · 15h ago
Surprised there hasn't been a fork of Anubis that changes the artificial PoW into a simple Monero mining PoW yet.

Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.

At least then we would be making money off these rude bots.

albrewer · 7h ago
There a company awhile back that did almost exactly this called CoinHive
g-b-r · 14h ago
It would be a much bigger incentive to add them with little care for the innocents impacted.

Although admittedly millions of sites already ruined themselves with cloudflare without that incentive

xnorswap · 14h ago
The bots could check if they've hit the jackpot themselves and keep the valid hashes for themselves and only return when they're worthless.

Then it's the bots who are making money from work they need to do for the captchas.

gus_massa · 8h ago
IIRC the mined block has an instruction like

fake quote > Please add the reward and fees to: 187e6128f96thep00laddr3s9827a4c629b8723d07809

And if you make a fake block that changes the address, then the fake block is not a good one.

This avoid the same problem with people stealing from pools, and also evil people listening to new mined blocks that pretend that they found it and send a fake one.

nssnsjsjsjs · 14h ago
1. The problem is the bot needs to understand the program it is running to do that. Akin to the halting problem.

2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.

immibis · 11h ago
Realistically, the bot owner could notice you're running MoneroAnubis and then would specifically check for MoneroAnubis, for example with a file hash, or a comment saying /* MoneroAnubis 1.0 copyright blah blah GPL license blah */. The bot wouldn't be expected to somehow determine this by itself automatically.

Also, the ideal Monero miner is a power-efficient CPU (so probably in-order). There are no Monero ASICs by design.

nssnsjsjsjs · 10h ago
I doubt you could do this efficiently enough such that an mining-business optimised mining rig can be kept busy enough with web-scraped honey pots to be worth the think time of setting it up vs. just scrape and skip pow protected sites + dedicated crypto mining operation as 2 seperate things.
hypeatei · 3h ago
> Then it's the bots who are making money from work they need to do for the captchas.

Wouldn't it be easier to mine crypto themselves at that point? Seems like a very roundabout way to go about mining crypto.

forty · 14h ago
We need an oblivious crypto currency mining algorithm ^^
kmeisthax · 6h ago
This is a good idea for honeypotting scrapers, though as per [0] I hope nobody actually tries to use it on a real website anyone would want to use.

[0] https://news.ycombinator.com/item?id=44117591

avastel · 7h ago
Reposting a similar point I made recently about CAPTCHA and scalpers, but it’s even more relevant for scrapers.

PoW can help against basic scrapers or DDoS, but it won’t stop anyone serious. Last week I looked into a Binance CAPTCHA solver that didn’t use a browser at all, just a plain HTTP client. https://blog.castle.io/what-a-binance-captcha-solver-tells-u...

The attacker had fully reverse engineered the signal collection and solved-state flow, including obfuscated parts. They could forge all the expected telemetry.

This kind of setup is pretty standard in bot-heavy environments like ticketing or sneaker drops. Scrapers often do the same to cut costs. CAPTCHA and PoW mostly become signal collection protocols, if those signals aren’t tightly coupled to the actual runtime, they get spoofed.

And regarding PoW: if you try to make it slow enough to hurt bots, you also hurt users on low-end devices. Someone even ported PerimeterX’s PoW to CUDA to accelerate solving: https://github.com/re-jevi/PerimiterXCudaSolver/blob/main/po...

persnickety · 14h ago
> An LLM scraper is operating in a hostile environment [...] because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. [..] for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or [...] want to waste as much of your CPU as possible).

That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.

That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.

In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?

And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?

I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?

pjc50 · 13h ago
I don't think this is "malicious" so much as it is "expensive" (in CPU cycles), which is already a problem for ad-heavy sites.
captainmuon · 13h ago
I don't know, Sandbox escape from a browser is a big deal, a million dollars bounty kind of deal. I feel safe to put an automated browser in a container or a VM and let it run with a timeout.

And if a site pulls something like that on me, then I just don't take their data. Joke is on them, soon if something is not visible to AI it will not 'exist', like it is now when you are delisted from Google.

berkes · 13h ago
> Why would you threaten your users with that?

Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.

My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.

account42 · 8h ago
Not safety-conscious users who disable javascript.
kbenson · 2h ago
With regard to proof of work systems that provide revenue:

1) Making LLM (and other) scrapers pay for the resources they use seems perfectly fine to me. Also, as someone that manages some level of scraping (on the order of low tens of millions of requests a month), I'm fine with this. There's a wide range of scraping that the problem is not some resource cost, but the other side not wanting to deal with setting up APIs or putting so many hurdles on access that it's easier to just bypass it.

2) This seems like it might be an opportunity for Cloudflare. Let customers opt-in to requiring a proof of work when visitors already trip the cloudflare vetting page that runs additional checks to see if you're a bad actor, and apply any revenue to a service credit towards their monthly fee (or if on a free plan, as credit to be used for trying out additional for-pay features). There might be a perverse inventive to toggle on more stringent checking from cloudflare, but ultimately since it's all being paid for that's the site owner's choice on how they want to manage their site.

dannyw · 15h ago
This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.

Simple limits on runtime atop crypto mining from being too big of a problem.

jeroenhd · 14h ago
And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.

Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.

account42 · 8h ago
Real users also have a limit where they will close the tab.
nitwit005 · 2h ago
> Simple limits on runtime atop crypto mining from being too big of a problem.

If they put in a limit, you've won. You just make your site be above that limit, and the problem is gone.

TZubiri · 15h ago
"Googlebot has been doing it for probably a decade."

This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .

motoxpro · 15h ago
This is so obvious when you say it, but what an awesome insight.
nssnsjsjsjs · 14h ago
Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.

I reckon they made the browser to control the browser market.

zinekeller · 8h ago
> Why not just use Firefox.

The reason why Servo has existed (when it was still in Mozilla's care) was because on how deeply spagettified Gecko's code (sans IonMonkey) was, with the plan of replacing Gecko's components with Servo's.

Firefox's automation systems are now miles better but that's literally the combination of years of work to modularize Gecko, the partial replacement of Gecko's parts with Servo's (like Stylo: https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-en...), and actively building the APIs despite the still-spagettified mess.

chrisco255 · 5h ago
V8 was dramatically better than Firefox at the time. AFAIK, it was the first JS engine to take the approach of compiling repetitive JS to native assembly.

If it's true that V8 was used internally for Google's scraper before they even thought about Chrome, then it makes obvious sense why not. The other factor is the bureaucracy and difficulty of getting an open source project to refactor their entire code base around your own personal engine. Google had the money and resources to pay the best in the business to work on Chrome.

baq · 14h ago
their browser is their scraper. what you see is what the scraper sees is what the ads look like.
TZubiri · 4h ago
"Why develop in-house software for the core application of the biggest company in the world at the time, worth more than 100B$. Why not just repurpose rinky dink open source browser as some kind of parser, bank our 100B$ business on some volunteers and a 501c3 NFP, that will play out well in a shareholder meeting and in trials when they ask us how we safeguard our software."
rkangel · 14h ago
It's not quite that simple. I think that having that skillset and knowledge in house already probably led to it being feasible, but that's not why they did it. They created Chrome because it was in their best interests for rich web applications to run well.
rob_c · 14h ago
You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now
mschuster91 · 14h ago
... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.
chrisco255 · 5h ago
Was it really a success though in that regard? HTML5 was great and all, but it never did replace Flash. Websites mainly just became more static. I suspect the lack of mobile integration had more to do with Flash dying than HTML5 getting better. It's a shame in some sense, because Flash was a lot of fun.
maeln · 14h ago
But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.
rob_c · 14h ago
No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)

The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.

Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.

The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.

dxuh · 11h ago
I always thought that JavaScript cryptomining is a better alternative to ads for monetizing websites (as long as people don't depend on those websites and website owners don't take it too far). I'd much rather give you a second of my CPU instead of space in my brain. Why is this so frowned upon? And in the same way I thought Anubis should just mine crypto instead of wasting power.
captainbland · 8h ago
I'd imagine it's pretty much impossible to make a crypto system which doesn't introduce unreasonable latency/battery drain on low-end mobile devices which is also sufficiently difficult for scrapers running on bleeding edge hardware.

If you decide that low end devices are a worthy sacrifice then you're creating e-waste. Not to mention the energy burden.

thedanbob · 11h ago
> Why is this so frowned upon?

Maybe because while ad tech these days is no less shady than crypto mining, the concept of ads is something people understand. Most people don't really understand crypto so it gets lumped in with "hackers" and "viruses".

Alternatively, for those who do understand ad tech and crypto, crypto mining still subjectively feels (to me at least) more like you're being stolen from than ads. Same with Anubis, wasting power on PoW "feels" more acceptable to me than mining crypto. One of those quirks of the human psyche I guess.

matheusmoreira · 6h ago
Running proof of work on user machines without their consent is theft of their computing and energy resources. Any site doing so for any purpose whatsoever is serving malware and should be treated as such.

Advertising is theft of attention which is extremely limited in supply. I'd even say it's mind rape. They forcibly insert their brands and trademarks into our minds without our consent. They deliberately ignore and circumvent any and all attempts to resist. It's all "justified" though, business interests excuse everything.

ge96 · 7h ago
I think some sites that stream content (illegally) do this
bob1029 · 14h ago
I think this is not a battle that can be won in this way.

Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.

CGamesPlay · 14h ago
That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.
bob1029 · 14h ago
I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

xena · 11h ago
I've set up a few honeypot servers. Right now OpenAI alone accounts for 4 hours of compute for one of the honeypots in a span of 24 hours. It's not hypothetical.
2000UltraDeluxe · 14h ago
25k+ hits/minute here. And that's just the scrapers that doesn't just identify themselves as a browsers.

Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.

spiffyk · 13h ago
> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

It is very real and the reason why Anubis has been created in the first place. It is not plain hostility towards LLMs, it is *first and foremost* a DDoS protection against their scrapers.

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

https://social.kernel.org/notice/AsgziNL6zgmdbta3lY

https://xeiaso.net/notes/2025/amazon-crawler/

heinrich5991 · 14h ago
> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

Yes, there are sites being DDoSed by scrapers for LLMs.

> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.

lelanthran · 10h ago
> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

There are already a few dozens of thousands of scrapers right now trying to get even more training data.

It will only get worse. We all want more training data. I want more training data. You want more training data.

We all want the most up to date data there is. So, yeah, it will only get worse as time goes on.

fl0id · 14h ago
For the model it’s not. But I think many of these bots are also from tool usage or ‚research‘ or whatever they call it these days. And for that it doesnatter
berkes · 13h ago
I doubt any "anti-scraper" system will actually work.

But if one is found, it will pave the way for a very dangerous counter-attack: Browser vendors with need for data (i.e. Google) simply using the vast fleet of installed browsers to do this scraping for them. Chrome, Safari, Edge, sending the pages you visit to their data-centers.

reginald78 · 6h ago
This feels like it already was half happening anyway so it isn't to big of a leap.

I also think this is the endgame of things like Recall in windows. Steal the training data right off your PC, no need to wait for the sucker to upload it to the web first.

lionkor · 13h ago
This is why we need https://ladybird.org/
forty · 14h ago
Can we have proof of work algorithm that compute something actually useful? Like finding large prime numbers or something like this that have distributed computation programs. This way all this power wasted is at least not completely lost.
xena · 12h ago
Anubis author here. I looked into protein folding. The problem is that protein folding requires scientific data, which can easily get into the gigabyte range. That is more data than I want to serve to clients. Unless there's a way to solve the big data problem, a lot of "compute for good" schemes are frankly unworkable.
g-b-r · 13h ago
Unfortunately useful things usually require much more computation to find a useful result, they require distributing the search, and so can't reliably verify that you performed the work (most searches will not find anything, and you can just pretend to not have found anything without doing any work).

If a service had enough concurrent clients to reliably hit useful results quickly, you could verify that most did the work by checking if a hit was found, and either let everyone in or block everyone according to that; but then you're relying on the large majority being honest for your service to work at all, and some dishonest clients would still slip through.

hardwaresofton · 14h ago
Fantastic work by Xe here -- not the first but this seems like the most traction I've seen on a PoW anti-scraper project (with an MIT license to boot!).

PoW anti-scraper tools are a good first step, but why don't we just jump straight to the endgame? We're drawing closer to a point where information's value is actually fully realized -- people will stop sharing knowledge for free. It doesn't have to be that way, but it does in a world where people are pressed for economic means, knowledge becomes an obvious thing to convert to capital and attempt to extract rent on.

The simple way this happens is just a login wall -- for every website. It doesn't have to be a paid login wall of course (at first), but it's a super simple way to both legally and practically protect from scrapers.

I think high quality knowledge, source code (which is basically executable knowledge), being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving.

Don't get me wrong -- the doomer angle is almost always wrong -- every year humanity is almost always better off than we were the previous year on many important metrics, but it's getting harder to see a world where we cartwheel through another technological transformation that this time could possibly impact large percentages of the working population.

pjc50 · 13h ago
> people will stop sharing knowledge for free. It doesn't have to be that way

Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.

> being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving

"High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.

hardwaresofton · 12h ago
> Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.

I don't think societies are open/trusting by default -- it takes work and a lot of anti-intuitive thinking, sustained over long periods of time.

> "High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.

I think the trust is downstream of the safety (and importantly "economic safety", if we can call it that). Everyone trusts more when they're not feeling threatened. People "defect" from cultures that don't work for them -- people leave the culture they like and go to another one usually because of some manifestation of danger.

lr4444lr · 8h ago
This is inimical to the purpose of the Internet.

Maybe the dream of knowledge being free and open was always doomed to fail; if knowledge has value, and people are encouraged to spend more of their time and energy to create it rather than other kinds of work, they will have to be compensated in order to do it increasingly well.

It's kinda sad though, if you grew up in a world where you could actually discover stuff organically and through search.

immibis · 7h ago
I don't think it's inevitable doom, but a realignment of incentives will probably be needed. Perhaps in the form of payment. In several EU countries it's illegal to have any internet connection without linking it to your ID card or passport in some central database, so that could also be thing - people are generally reluctant to get arrested.
xena · 5h ago
Thanks! I'm gonna try and bootstrap this into a company. My product goal for the immediate future is unbranded Anubis (already implemented) with a longer term goal of being a Canadian-run Cloudflare competitor.
g-b-r · 14h ago
Yes, let's turn the whole web into Facebook, what a bright future
Schiendelman · 14h ago
I think we only have two choices here: 1) every webpage requires Facebook login, and then Facebook offers free hosting for the content. 2) every webpage requires some other method of login, but not locked into a single system.

I read the GP comment as suggesting we push on the second option there rather than passively waiting for the first option.

hardwaresofton · 12h ago
You were right on the second one! Facebook wasn't even a thought in my mind per say (they're not unique in that every social network wants to build a walled garden).

My focus was more on the areas outside the large walled gardens -- they might be come a bunch of smaller... fenced backyards, to put it nicely.

g-b-r · 13h ago
I hope you see that you shatter anonymity and the open web with that, single system or not
hardwaresofton · 12h ago
anonymity and the open web are different things, and neither of them were promised/guaranteed to anyone on the internet.

For people that value anonymity, they'll create their own spaces. People that value openness will continue to be open.

What we're about to find out is what happens when the tide goes out and people show you what they really believe/want -- anything other than that is a form of social control, whether via browbeating or other means.

g-b-r · 2h ago
> anonymity and the open web are different things, and neither of them were promised/guaranteed to anyone on the internet. > > For people that value anonymity, they'll create their own spaces. People that value openness will continue to be open

Hardly anything of what's the internet today was promised, but who are you to decide what the internet has to become now, and that people with different ideas need to confine themselves in their own ghettos?

Everyone values privacy, it's just out of social pressure if most give up so much of it.

> What we're about to find out is what happens when the tide goes out and people show you what they really believe/want -- anything other than that is a form of social control, whether via browbeating or other means

No idea of what you're talking about there

Schiendelman · 12h ago
Of course I do. But that's already gone.
g-b-r · 2h ago
You must be on a different internet than mine
immibis · 7h ago
IP addresses are not anonymous. Have you tried to make your IP address anonymous, e.g., with Tor or one of those NordVPN-like companies? (not picking on Nord, though they deserve to be picked on - they're just the most advertised.)

You'll find CAPTCHAs almost everywhere, outright 403s or dropped connections in a lot of places. Even Google won't serve you sometimes.

The reason you're not seeing that situation right now is that your IP address is identifiable.

reginald78 · 6h ago
I see captchas all the time on my home internet connection without a VPN these days. That era seems to be ending, probably because AI scraping is now using residential IP blocks.
immibis · 5h ago
There's been talk on NANOG about whole residential ISPs getting marked as VPNs now. Turns out selling excessive security to businesses is easy, I guess. Like CrowdStrike.
g-b-r · 2h ago
IP addresses can be anonymous, and I do get CAPTCHAs almost everywhere they're used, without using Tor.

What cannot possibly be anonymous is a login with a verified identity.

account42 · 8h ago
Fantastic work? More like contributing to the enshittification of the web.
Analemma_ · 8h ago
I mean, LLM scrapers set fire to the commons, and when you do that, now you have a flaming hole in the ground where the commons used to be. It's not the fault of website operaters who have to act in self-defense lest their site get DDoSed out of existence.
jameslk · 3h ago
What if we move some of the website backend workload to the bots, effectively using them as decentralized cloud infrastructure? web3, we are so back
keepamovin · 15h ago
Interesting - we dealt with this issue in CloudTabs, a SaaS for BrowserBox remote browser. The way we handle it, is simply to monitor resource usage with a Python script, issue a warning to the user when their tab or all processes are running hot, then when the rules are triggered we just kill the offending processes (that use too much CPU or RAM).

Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.

In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.

fennecfoxy · 8h ago
Hmmmm this seems like something that will be bad for the environment.
reginald78 · 7h ago
AI scrapping, bloated javascript pages that are mostly text, ads, existing captchas, etc also all waste energy for no gain to the end user and we mostly just accept it or maybe run an ad blocker. I see people complain PoW solutions make it hard for low power devices and are a waste of energy, which is true. But that is also true of the status quo which is also an annoying waste of human time and often a privacy nightmare.
matt3210 · 15h ago
The issue isn’t the resource usage as much as the content they’re stealing for reproduction purposes.
reginald78 · 6h ago
That is one part, but they are so voracious and aggressive that they are starting to crush hosts of that content and cause things become less open. In a way, they are not only 'stealing' it for themselves but they're also erasing it for humans.
berkes · 13h ago
That's not true for the vast amount of creative-commons, open-source and other permissive licenced content.

(Aside: the licenses and distribution advocated by many of the same demography (information wants to be free -folks, jstor protestors, GPL-zealots) that now opposes LLMs using that content. )

jsheard · 10h ago
> GPL-zealots

I'm sure GPL zealots would be happier about this situation if LLM vendors abided by the spirit of the license by releasing their models under GPL after ingesting GPL data, but we all know that isn't happening.

captainmuon · 13h ago
As somebody who does some scraping / crawling for legitimate uses, I'm really unhappy with this development. I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it. Maybe they are opposed to it for fundamental reasons. I for one would like my content to be spread maximally. I want my arguments to be incorporated into AIs, so I can reach more people. But of course that is just me when I'd write certain content, others have different goals.

It gets annoying when you have the right to scrape something - either because the owner of the data gave you the OK or because it is openly licensed. But then the webmaster can't be bothered to relax the rate limiter for you, and nobody can give you a nice API. Now people are putting their Open Educational Resources, their open source software, even their freaking essays about openness that they want the world to read behind Anubis. It makes me shake my head.

I understand perfectly it is annoying when badly written bots hammer your site. But maybe then HTTP and those bots are the problem. Maybe we should make it easier for site owners to push their content somewhere where we can scrape it easier?

Analemma_ · 8h ago
If you scrape at a reasonable rate and don't clear session cookies, your scraper can solve the Anubis POW same as a user and you're fine. Anubis is for distributed scrapers which make requests at absurd rates.
yladiz · 13h ago
> I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it.

To be frank: it’s not your content, it’s theirs, and it doesn’t matter if you like it or not, they can decide what they want to do with it, you’re not entitled to it. Yes there are some cases that you personally have permission to scrape, or the license explicitly permits it, but this isn’t the norm.

The bigger issue isn’t that people don’t want their content to be read it’s that they want it to be read and consumed by a human in most cases, and they want their server resources (network bandwidth, CPU, etc) to be used in a manageable way. If these bots were written to be respectful, then maybe we wouldn’t be in this situation. These bots poisoned the well, and they affect respectful bots because of their actions.

berkes · 13h ago
Sounds like something IPFS could be nice solution for.
vanschelven · 14h ago
TBH most of the talk of "aggressive scraping" has been in the 100K pages/day range (which is ~1 page/s, i.e. "neglegtable). In my mind cloud providers ridiculous egres rates are more to blame here.
jeroenhd · 14h ago
I've caught Huawei and Tencent IPs scraping the same image over and over again, with different query parameters. Sure, the image was only 260KiB and I don't use Amazon or GCP or Azure so it didn't cost me anything, but it still spammed my logs and caused a constant drain on my servers' resources.

The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.

In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.

account42 · 7h ago
So make sure the image is only available at one canonical URL with proper caching headers? No, obviously the only solution is to install crapware that worsens the experience for regular users.
account42 · 7h ago
Agreed. Website operators should have a hard look at why their unoptimized crap can't manage such low request rates before contributing to the enshittification of the web by deploying crapware like anubis or buttflare.
immibis · 7h ago
I've been blocking a few scrapers from my gitea service - not because it's overloaded, more just to see what happens. They're not getting good data from <repo>/commit/<every sha256 in the repo>/<every file path in the repo> anyway. If they actually wanted the data they could run "git clone".

I just checked, since someone was talking about scraping in IRC earlier. Facebook is sending me about 3 requests per second. I blocked their user-agent. Someone with a Googlebot user-agent is doing the same stupid scraping pattern, and I'm not blocking it. Someone else is sending a request every 5 seconds with

One thing that's interesting on the current web is that sites are expected to make themselves scrapeable. It's supposed to be my job to organize the site in such a way that scrapers don't try to scrape every combination of commit and file path.

bmacho · 14h ago
Another idea: if your content is ~5kb text, then serve it to whoever asks it. If you don't have the bandwidth, try to make it smaller, static, and put it on the edge, some other people's computers.
account42 · 7h ago
Exactly, if you have the bandwidth to serve your proof of work scripts and check the results then you also have the bandwidth to simply serve properly optimized content.
foul · 14h ago
The fun and terrible thing about the web is that the "rockstar in temporary distress" trope can be true and can happen when you expect it the least, like, you know, when you receive a HN kiss of death.

You can surely expect that a static content will be static and will not run jpegoptim on a image at any given hit (a dynamic CMS + a sudden visit from ByteDance = your server is DDoSed), but you can't expect that any idiot/idiot hive on this planet will set up a multi-country edge caching servers architecture for a small sized website just in case some blog post will hit a few million visits every ten minutes. That can easily take down a server even for static content.

I concur that Anubis is a poor solution, and yet here we are, the UN it's using it to weather down requests.

account42 · 7h ago
Popularity-based traffic spikes tend to be very temporary and should not be something smaller sites concern themselves with.
foul · 6h ago
My example was quite vapid, but you shouldn't concentrate on that use-case. Small doesn't always mean "neglectable infos", while scrapers are always stealing CPU time.
h1fra · 15h ago
Most scrapers are not able to monitor each websites performance, what will happen is that it's going to be slower for site to respond and that's it
nssnsjsjsjs · 14h ago
My violin is microscopic for this problem. It's actually given me ideas!
akomtu · 4h ago
That's DRM essentially: watch, but do not copy. Except that in this case the corporations do piracy.
EGreg · 8h ago
“ because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. Letting your scraper run JavaScript means that it can”

LLMs can, LOL

One of the powerful use cases is that they can catch pretty much EVERY attempt at obfuscation now. Building a marketplace and want to let the participants chat but not take the deal off-site? LLM in the loop. Protecting children? LLM in the loop.

Animats · 16h ago
"This page is taking too long to load" on sites with anti-scraper technology. Do Not Want.
jchw · 15h ago
In general if you're coming from a "clean" IP your difficulty won't be very high. In the future if these systems coordinate with each other in some way (DHT?) then it should make it possible to drop the baseline difficulty even further.
berkes · 13h ago
That's a perfect tool for monopolists to widen their moat even more.

In line with how email is technically still federated and distributed, but practically oligopolized by a handfull of big-tech, through "fighting spam".

account42 · 7h ago
> In general if you're coming from a "clean" IP your difficulty won't be very high.

Unless you are using an operating system or browser that isn't the monopoly choice.

Fuck off with this idea that some clients are better than others.

jchw · 4h ago
That has absolutely fuck all to do with IP reputation, you're mixing up unrelated concepts.
rob_c · 14h ago
Seen this already, back in the day the were sites which ran bitcoin hashing on their userbase and there was uproar.

If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)

TZubiri · 15h ago
" and LLM scrapers may well have lots of CPU time available through means such as compromised machines."

It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.

But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.

immibis · 7h ago
You can buy access to proxies in residential networks - with credit card on the open internet - and they may or may not be someone's botnet (probably not, but you don't know that). I'm not aware of anyone selling running code on the device though. It's just an HTTP or SOCKS5 level proxy.
OutOfHere · 15h ago
See also http://www.hashcash.org/ which is a famous proof-of-work algorithm. The bigger benefit of proof-of-work is not that it's anti-LLM; it is that it's statelessly anti-DoS.

I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.

berkes · 13h ago
I've done that as well. The PoC worked, but the statelessness did prove a hurdle.

It enforces a pattern in which a client must do the PoW every request.

Other difficulties, uncoverd in our PoC were:

Not all clients are equal: this punishes an old mobile phone or raspberry-pi much more than a client that runs on a beefy server with GPUs or clients that run on compromised hardware. - I.e. real users are likely punished, while illegitimate users often punished the least.

Not all endpoints are equal: We experimented with higher difficulties for e.g. POST/PUT/PATCH/DELETE over GET. and with different difficulties for different endpoints: attempting to match how expensive a call would be for us. That requires back-and-forth to exchange difficulties.

It discourages proper HATEOAS or REST, where a client browses through the API by following links and encourages calls that "just include as much as possible in one query". Deminishing our ability to cache, to be flexible and to leverage good HTTP practices.

wordofx · 15h ago
lol LLMs don’t just randomly execute JavaScript on a web page when scraped.

Edit: lol no wonder hacker news is generally anti AI. They think ai will just randomly execute JavaScript it sees on a web page. It’s amazing how incredibly dumb half of HN is.

voidUpdate · 8h ago
Judging by the amount of websites that construct themselves from scratch when you open them, rather than just, idk, having some html and css, you wont get much content if you dont execute js
pjc50 · 13h ago
Even before anti-scraping, lots of SPA sites won't give you any content if you don't run Javascript.
seszett · 14h ago
This is about the scraping tools that are used to train LLMs, not the scrapers used by live LLMs.
sltkr · 4h ago
Even the ones used live, why couldn't/wouldn't they run some type of headless browser with Javascript enabled, rather than just making an HTTP-request and looking at the HTML code?
seszett · 1h ago
That's the point, the scraping that happens live can run through a full browser, performance is not an important issue there

The scraping that's taking place to train the models is in a completely different scale and (probably) cannot really bear the added cost of PoW on each page they scrape. That's the scraping that is targeted by these tools.

darkhelmet · 14h ago
You're missing the point. The approach is block anything that doesn't execute the javascript. The javascript generates a key that allows your queries to work. No javascript => no key => no content for you to scrape.

Its a nuclear option. Nobody wins. The site operator can adjust the difficulty as needed to keep the abusive scrapers at bay.

wordofx · 13h ago
lol this thread has me wondering if anyone commenting has ever worked on web scraping before. Doesn’t seem like it.