Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

808 rrampage 470 8/4/2025, 1:39:30 PM blog.cloudflare.com ↗

Comments (470)

fxtentacle · 6h ago

I find this problem quite difficult to solve:

1. If I as a human request a website, then I should be shown the content. Everyone agrees.

2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

itsdesmond · 5h ago

Some stores do not welcome Instacart or Postmates shoppers. You can shop there. You can shop with your phone out, scanning every item to price match, something that some bookstores frown on, for example. Third party services cannot send employees to index their inventory, nor can they be dispatched to pick up an item you order online.

Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.

I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.

rjbwork · 5h ago

But I can send my personal shopper and you'll be none the wiser.

Polizeiposaune · 5h ago

To stretch the analogy to the breaking point: If you send 10,000 personal shoppers all at once to the same store just to check prices, the store's going to be rightfully annoyed that they aren't making sales because legit buyers can't get in.

hombre_fatal · 4h ago

Your comment and the above comment of course show different cases.

An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.

But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.

That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.

bayindirh · 4h ago

As a person who has a couple of sites out there, and witnesses AI crawlers coming and fetching pages from these sites, I have a question:

What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?

accrual · 2h ago

Thanks for sharing your experience. A little off-topic but I'd like to start hosting some personal content, guides/tutorials, etc.

Do you still see authentic human traffic on your domains, is it easy to discern?

I feel like I missed the bus on running a blog pre-AI.

tempfile · 4h ago

The way to prevent people from downloading your pages and using them is to take them off the public internet. There are laws to prevent people from violating your copyright or from preventing access to your service (by excessive traffic). But there is (thankfully) no magical right that stops people from reading your content and describing it.

bayindirh · 4h ago

Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.

I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?

Makes no sense whatsoever.

tempfile · 5m ago

Of course some people want that. And at the moment they can prevent it. But those methods may stop working. Will it then be alright to do it? Of course not, so why bother mentioning that they are able to prevent it now - just give a justification.

Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?

I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.

hombre_fatal · 1h ago

I guess that's a question that might be answered by the NYT vs OpenAI lawsuit at least on the enforceability of copyright claims if you're a corporation like NYT.

If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.

sublinear · 5h ago

Too bad. Build a bigger store or publish this information so we don't need 10,000 personal shoppers. Was this not the whole point of having a website? Who distorted that simple idea into the garbage websites we have now?

recursive · 4h ago

Weird take. The store doesn't owe your personal shippers anything.

the_real_cher · 4h ago

In the same token the personal shoppers don't owe the store anything either.

recursive · 3h ago

Then they can't complain if they're barred entry.

the_real_cher · 2h ago

http is neutral. it's up to the client to ignore robots.txt

You can block IP's at the host level but there's pretty easy ways around that with proxy networks.

eddythompson80 · 1h ago

> http is neutral.

Who misled you with that statement?

eddythompson80 · 4h ago

Surely they owe them money for the goods and service, no? I thought that's how stores worked.

the_real_cher · 2h ago

Context friend. This article and entire comments sections is about questionable web page access. Context.

eddythompson80 · 2h ago

You're replying in a store metaphor thread though. Context matters.

dabockster · 4h ago

> Who distorted that simple idea into the garbage websites we have now?

Corporate America. Where clean code goes to die.

rapind · 5h ago

It's all about scale. The impact of your personal shopper is insignificant unless you manage to scale it up into a business where everyone has a personal shopper by default.

nickthegreek · 4h ago

How is everyone having a personal shopper a problem of scale? I was going to shop myself, but I sent someone else to do it for me.

At this moment I am using Perplexity's Comet browser to take a spotify playlist and add all the tracks to my youtube music playlist. I love it.

dylan604 · 1h ago

Let's look at the opposite benefit to a store if a mom that would need to bring her 3 kids to the store vs that mom having a personal shopper. In this case, the personal shopper is "better" for the store as far as physical space. However, I'm sure the store would still rather have the mom and 3 kids physically in the store so that the kids can nag mom into buying unneeded items that are placed specifically to attract those kids' attention.

pixl97 · 1h ago

>o that the kids can nag mom into buying unneeded items

Excellent. Personal shoppers are 'adblock for IRL'.

>You owe the companies nothing. You especially don't owe them any courtesy. They have re-arranged the world to put themselves in front of you. They never asked for your permission, don't even start asking for theirs.

SoftTalker · 4h ago

We'll see more of this sort of thing as AI agents become more popular and capable. They will do things that the site or app should be able to do (or rather, things that users want to be able to do) but don't offer. The YouTube music playlist is a good example. One thing I'd like to be able to do is make a playlist of some specific artists. But you can't. You have to select specific songs.

If sites want to avoid people using agents, they should offer the functionality that people are using the agents to accomplish.

rapind · 4h ago

I didn't use the word "problem". In fact I presented no opinion at all. I'm just pointing out that scale matters a lot. In fact, in tech, it's often the only thing that matters. It's naive (or narrative) to think it doesn't.

Everyone having a personal shopper obviously changes the relationship to the products and services you use or purchase via personal shopper. Good, bad, whatever.

mbrumlow · 5h ago

Well then. Seems like you would be a fool to not allow personal shoppers then.

The point is the web is changing, and people use a different type of browser now. Ans that browser happens to be LLMs.

Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.

Grow up cloud flare, some is your business models don’t make sense any more.

julkali · 5h ago

Do not conflate your own experience with everyone else's.

goatlover · 4h ago

Some people use LLMs to search. Other people still prefer going to the actual websites. I'm not going to use an LLM to give me a list of the latest HN posts or NY Times articles, for example.

ToucanLoucan · 5h ago

> Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.

You say this as though all LLM/otherwise automated traffic is for the purposes of fulfilling a request made by a user 100% of the time which is just flatly on-its-face untrue.

Companies make vast amounts of requests for indexing purposes. That could be to facilitate user requests someday, perhaps, but it is not today and not why it's happening. And worse still, LLMs introduce a new third option: that it's not for indexing or for later linking but is instead either for training the language model itself, or for the model to ingest and regurgitate later on with no attribution, with the added fun that it might just make some shit up about whatever you said and be wrong. And as the person buying the web hosting, all of that is subsidized by me.

"The web is changing" does not mean every website must follow suit. Since I built my blog about 2 internet eternities ago, I have seen fad tech come and fad tech go. My blog remains more or less exactly what it was 2 decades ago, with more content and a better stylesheet. I have requested in my robots.txt that my content not be used for LLM training, and I fully expect that to be ignored because tech bros don't respect anyone, even fellow tech bros, when it means they have to change their behavior.

Imustaskforhelp · 4h ago

Tech bros just respect money. Making money is very easy in the short term if you don't show ethics. Venture capitalism and the whole growth/indie hacking is focused around making money and making it fast.

Its a clear road for disaster. I am honestly surprised by how great Hackernews is, to that comparison where most people are sharing it for the love of the craft as an example. And for that hackernews holds a special place in my heart. (Slightly exaggerating to give it a thematic ending I suppose)

bradleyjg · 5h ago

It’s possible to violate all sorts of social norms. Societies that celebrate people that do so are on the far opposite end of the spectrum from high trust ones. They are rather unpleasant.

pixl97 · 38m ago

Oh, this is a bunch of baloney.

What you've pretty much stated is "You must go to the shops yourself so the ads and marketing can completely permeate your soul, and turn you into a voracious consumer.

Businesses have the right to fuck completely and totally off a cliff taking their investor class with them in to the pit of the void. They lear at us from high places spending countless dollars on new ways to tell us we aren't good enough.

ToucanLoucan · 5h ago

Just the Silicon Valley ethos extended to it's logical conclusions. These companies take advantage of public space, utilities and goodwill at industrial scale to "move fast and break things" and then everyone else has to deal with the ensuing consequences. Like how cities are awash in those fucking electric scooters now.

Mind you I'm not saying electric scooters are a bad idea, I have one and I quite enjoy it. I'm saying we didn't need five fucking startups all competing to provide them at the lowest cost possible just for 2/3s of them to end up in fucking landfills when the VC funding ran out.

SoftTalker · 4h ago

My city impounded them and made them pay a fee to get them back. Now they have to pay a fee every year to be able to operate. Win/win.

sublinear · 4h ago

Do we really want a "high trust" society? That sounds awful.

bradleyjg · 4h ago

A place where you can lose you wallet and get it back with all the cash inside.

The horror!!

sublinear · 4h ago

Forget the wallet. A high trust society is a place where people get swindled all the time in various ways. Any and all trust is inherently bad and naive.

arrowsmith · 4h ago

No, you're describing a low-trust society.

Please learn what words mean before you comment on them.

No comments yet

ghurtado · 4h ago

> Any and all trust is inherently bad and naive.

Sounds like the final lesson from a lifetime of successful personal relationships.

No comments yet

fireflash38 · 4h ago

That's a very sad and lonely way to live.

sublinear · 4h ago

I don't think we're talking about the same thing.

sensanaty · 3h ago

That's quite literally the opposite of what high trust means...

arrowsmith · 4h ago

Go spend some time in Brazil or South Africa or other places where no-one trusts anyone (for good reasons), then report back.

immibis · 2h ago

High trust is prima facie incompatible with capitalism. If you want a high trust society, you don't want capitalism. Capitalism is inherently low trust because in capitalism, taking advantage of every edge you can is ultimately a matter of life and death, and that includes deceit. If the penalty for deceit was greater than the penalty for non-deceit then you could have a high-trust society.

Ray20 · 54m ago

> High trust is prima facie incompatible with capitalism

Quite compatible

> If you want a high trust society, you don't want capitalism.

There is nothing at all in capitalism that would prevent a high level of trust in society.

> Capitalism is inherently low trust

But that's not true. The thing about capitalism is that it's RESILENT to low trust. It does not require low levels of trust, but is capable of functioning in such conditions.

> If the penalty for deceit was greater than the penalty for non-deceit

Who are the judges? Capitalism is the most resistant to deception, deceivers under capitalism receive fewer benefits than under any other economic system. Simply because capitalism is based on the premise that people cheat, act out of greed, try to get the most for themselves at the expense of others. These qualities exist in people regardless of the existence of capitalism, it is just that capitalism ensures prosperity in society even when people have these qualities.

sublinear · 2h ago

Why bring up capitalism? I don't get it. What's stopping people from lying and cheating under any other system?

dgshsg · 1h ago

When lying and cheating doesn't get you ahead, there is no reason to do it.

Ray20 · 48m ago

The problem is that without capitalism ONLY lying and cheating will get you ahead. Look at ANY country that builds its economy on the restriction of people's economic freedom, on the absence of private property rights - these are the most deceitful and disgusting regimes in the world with zero level of public trust.

542354234235 · 5h ago

True, and I would ask, what is your point? Is it that no rule can have 100% perfect enforcement? That all rules have a grey area if you look close enough? Was it just a "gotcha" statement meant to insinuate what the prior commenter said was invalid?

ghurtado · 4h ago

Sure. There's lots of things you could do, but you don't do them because they are wrong.

Might does not make right.

rjbwork · 1m ago

How is it wrong to send my personal shopper? How is it wrong to have an agent act directly on my behalf?

It's like saying a web browser that is customized in any way is illegal. If one configures their browser to eagerly load links so that their next click is instant, is that now illegal and wrong?

fireflash38 · 4h ago

And you can be trespassed and prosecuted if you continue to violate.

itsdesmond · 5h ago

[flagged]

dang · 1h ago

Whoa, please don't post like this. We end up banning accounts that do.

https://news.ycombinator.com/newsguidelines.html

cma · 22m ago

These are more like a store putting up a billboard or catalog and asking people to turn off their meta AI glasses nearby because the store doesn't want AI translating it on your behalf as a tourist.

jasonjmcghee · 6h ago

I think it's an issue of scale.

The next step in your progression here might be:

If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?

fxtentacle · 5h ago

Maybe we should just institutionalize and explicitly legalize the Internet Archive and Archive Team. Then, I can download a complete and halfway current crawl of domain X from the IA and that way, no additional costs are incurred for domain X.

But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.

palmfacehn · 5h ago

https://commoncrawl.org/

>Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

stanmancan · 5h ago

I have mixed feelings on this.

Many websites (especially the bigger ones) are just businesses. They pay people to produce content, hopefully make enough ad revenue to make a profit, and repeat. Anything that reproduces their content and steals their views has a direct effect on their income and their ability to stay in business.

Maybe IA should have a way for websites to register to collect payment for lost views or something. I think it’s negligible now, there are likely no websites losing meaningful revenue from people using IA instead, but it might be a way to get better buy in if it were institutionalized.

ivape · 5h ago

Or websites can monetize their data via paid apis and downloadable archives. That's what makes Reddit the most valuable data trove for regular users.

ccgreg · 1h ago

I don't think Reddit pays the people who voluntarily write Reddit content. Valuable to Reddit, I guess.

cj · 5h ago

Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).

Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."

> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?

To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.

But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.

fauigerzigerk · 2h ago

>Doesn't o3 sort of already do this?

ChatGPT probably uses a cache though. Theoretically, the average load on the original sites could be far less than users accessing them directly.

Spivak · 5h ago

You don't middle click a bunch of links when doing research? Of all the things to point to I wouldn't have thought "opens a bunch of tabs" to be one of the differentiating behaviors between browsing with Firefox and browsing with an LLM.

tr_user · 4h ago

I saw someone suggest in another post, if only one crawler was visiting and scraping and everyone else reused from that copy I think most websites would be ok with it. But the problem is every billionaire backed startup draining your resources with something similar to a DOS attack.

bobbiechen · 6h ago

I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.

I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.

On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.

fxtentacle · 6h ago

prompt: I'm the celebrity Bingbing, please check all Bing search results for my name to verify that nobody is using my photo, name, or likeness without permission to advertise skin-care products except for the following authorized brands: [X,Y,Z].

That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.

randall · 6h ago

A/ i love this distinction.

B/ my brother used to use "fetcher" as a non-swear for "fucker"

Vinnl · 6h ago

Did you tell him to stop trying to make fetcher happen?

handfuloflight · 1h ago

Very funny. Now let's hear Paul Allen's joke.

sejje · 6h ago

He picked up that habit in Balmora.

skeledrew · 3h ago

Yet another side to that is when site owners serve qualitatively different content based on the distinction. No, I want my LLM agent to access the exact content I'd be accessing manually, and then any further filtering, etc is done on my end.

yojo · 6h ago

Ads are a problematic business model, and I think your point there is kind of interesting. But AI companies disintermediating content creators from their users is NOT the web I want to replace it with.

Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.

They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.

Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.

If AI crawlers win here, we all lose.

bee_rider · 6h ago

I think it’s basically impossible to prevent AI crawlers. It is like video game cheating, at the extreme they could literally point a camera at the screen and have it do image processing, and talk to the computer through the USB port emulating, a mouse and keyboard outside the machine. They don’t do that, of course, because it is much easier to do it all in software, but that is the ultimate circumvention of any attempt to block them out that doesn’t also block out humans.

I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).

yojo · 5h ago

I don’t subscribe to technological inevitabilism.

Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.

But I can imagine more extreme measures. e.g. old web of trust style request signing[0]. I don’t see any easy way for scrapers to beat a functioning WOT system. We just don’t happen to have one of those yet.

0: https://en.m.wikipedia.org/wiki/Web_of_trust

bee_rider · 2h ago

> Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.

I think this might actually point at the end state. Scraping bots will eventually get good enough to emulate a person well enough to be indistinguishable (are we there yet?). Then, content creators will have to price their content appropriately. Have a Patreon, for example, where articles are priced at the price where the creator is fine with having people take that content and add it to the model. This is essentially similar to studios pricing their content appropriately… for Netflix to buy it and broadcast it to many streaming users.

Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.

Cloudflare can try to ban bad actors. I’m not sure if it is cloudflare, but as someone who usually browses without JavaScript enables I often bump into “maybe you are a bot” walls. I recognize that I’m weird for not running JavaScript, but eventually their filters will have the problem where the net that captures bots also captures normal people.

skeledrew · 2h ago

Then personal key sharing will become a thing, similar to BugMeNot et al.

immibis · 2h ago

Beating web of trust is actually pretty easy: pay people to trust you.

Yes, you can identify who got paid to sign a key and ban them. They will create another key, go to someone else, pretend to be someone not yet signed up for WoT (or pay them), and get their new key signed, and sign more keys for money.

So many people will agree to trust for money, and accountability will be so diffuse, that you won't be able to ban them all. Even you, a site operator, would accept enough money from OpenAI to sign their key, for a promise the key will only be used against your competitor's site.

It wouldn't take a lot to make a binary-or-so tree of fake identities, with exponential fanout, and get some people to trust random points in the tree, and use the end nodes to access your site.

Heck, we even have a similar problem right now with IP addresses, and not even with very long trust chains. You are "trusted" by your ISP, who is "trusted" by one of the RIRs or from another ISP. The RIRs trust each other and you trust your local RIR (or probably all of them). We can trace any IP to see who owns it. But is that useful, or is it pointless because all actors involved make money off it? You know, when we tried making IPs more identifying, all that happened is VPN companies sprang up to make money by leasing non-identifying IPs. And most VPN exits don't show up as owned by the VPN company, because they'd be too easy to identify as non-identifying. They pay hosting providers to use their IPs. Sometimes they even pay residential ISPs so you can't even go by hosting provider. The original Internet was a web of trust (represented by physical connectivity), but that's long gone.

Spivak · 5h ago

It is inevitable, not because of some technological predestination but because if these services get hard-blocked and unable to perform their duties they will ship the agent as a web browser or browser add-on just like all the VSCode forks and then the requests will happen locally through the same pipe as the user's normal browser. It will be functionally indistinguishable from normal web traffic since it will be normal web traffic.

subspeakai · 4h ago

This is the fascinating case where I think this all goes - At some point costs come down and you can do this and bypass everything

hansvm · 6h ago

Ofttimes people are sufficiently anti-ad that this point won't resonate well. I'm personally mostly in that camp in that with relatively few exceptions money seems to make the parts of the web I care about worse (it's hard to replace passion, and wading through SEO-optimized AI drivel to find a good site is a lot of work). Giving them concrete examples of sites which would go away can help make your point.

E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.

vertoc · 6h ago

But even your example gets worse with AI potentially - the "upsell" of his blog isn't paid posts but more subscribers so there will be thankful readers, a few donators, people talking about it. If the only interface becomes an AI summary of his work without credit, it's much more likely he stops writing as it'll seem like he's just screaming into the void

hansvm · 5h ago

I don't think we're disagreeing?

blacksmith_tb · 6h ago

Sheldon died in 2008, but there's no doubt that all the bicycling wisdom he posted lives on!

wulfstan · 5h ago

He's that widely respected that amongst those who repair bikes (I maintain a fleet of ~10 for my immediate family) he is simply known as "Saint Sheldon".

yojo · 5h ago

I agree that specific examples help, though I think the ones that resonate most will necessarily be niche. As a teen, I loved Penny Arcade, and watched them almost die when the bottom fell out of the banner-ad market.

Now, most of the value I find in the web comes from niche home-improvement forums (which Reddit has mostly digested). But even Reddit has a problem if users stop showing up from SEO.

bombela · 4h ago

> Sheldon Brown (July 14, 1944 – February 4, 2008)

shadowgovt · 5h ago

> Otherwise there is literally no reason for them to make any of it available on the open web

This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.

If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.

Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).

Jury's out on these questions, I think.

yojo · 4h ago

Any information that requires something approximating a full-time job worth of effort to produce will necessarily go away, barring the small number of independently wealthy creators.

Existing subject-matter experts who blog for fun may or may not stick around, depending on what part of it is “fun” for them.

While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.

Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.

ndriscoll · 50m ago

OpenStreetMap is basically Wikipedia for maps and is quite successful. Over 10M registered users and millions of edits per day. Lots of information is also shared online on forums for free. The hosting (e.g. reddit) is basically commodity that benefits from network effects. The information is the more interesting bit, and people share it because they feel like it.

SoftTalker · 4h ago

Wikipedia works for the same reason open-source does: because most of the contributors are experts in the subject and have paid jobs in that field. Some are also just enthusiasts.

fxtentacle · 6h ago

Maybe, on a social level, we all win by letting AI ruin the attention economy:

The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.

Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.

danieldk · 5h ago

There are also a gazillion pages that are not ad-riddled content. With search engines, the implicit contract was that they could crawl pages because they would drive traffic to the websites that are crawled.

AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.

Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).

bigbuppo · 22m ago

Right, but the LLM isn't really being used for that. It's being used for marketing and advertising purposes most of the time. The AI companies also let you play with it from time to time so you'll be a shill for them, but mostly it's the advertising people you claim to not like.

johnfn · 6h ago

Unless I am misunderstanding you, you are talking about something different than the article. The article is talking about web-crawling. You are talking about local / personal LLM usage. No one has any problems with local / personal LLM usage. It's when Perplexity uses web crawlers that an issue arises.

lukeschlather · 6h ago

You probably need a computer that costs $250,000 or more to run the kind of LLM that Perplexity uses, but with batching it costs pennies to have the same LLM fetch a page for you, summarize the content, and tell you what is on it. And the power usage similarly, running the LLM for a single user will cost you a huge amount of money relative to the power it takes in a cloud environment with many users.

Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.

troyvit · 5h ago

> If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.

jabroni_salad · 1h ago

If it was just one human requesting one summary of the page nobody would ever notice. The typical watermark for junk traffic is pretty high as it was.

I have a dinky little txt site on my email domain. There is nothing of value on it, and the content changes less than once a year. So why are AI scrapers hitting it to the tune of dozens of GB per month?

fluidcruft · 5h ago

In theory, couldn't the LLM access the content on your browser and it's cache, rather than interacting with the website directly? Browser automation directly related to user activity (prefetch etc) seems qualitatively different to me. Similarly, refusing to download content or modifying content after it's already in my browser is also qualitatively different. That all seems fair-use-y. I'm not sure there's a technical solution beyond the typical cat/mouse wars... but there is a smell when a datacenter pretends to be a person. That's not a browser.

It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)

pavon · 1h ago

Question from a non-web-developer. In case 3, would it be technically possible for Perplexity's website to fetch the URL in question using javascript in the user's browser, and then send it to the server for LLM processing, rather than have the server fetch it? Or do cross-site restrictions prevent javascript from doing that?

Vegenoid · 4h ago

There is a significant distinction between 2 and 3 that you glossed over. In 1 and 2, you the human may be forced to prove that you are human via a captcha. You are present at the time of the request. Once you’ve performed the exchange, then the HTML is on your computer and so you can do what you want to it.

In 3, although you do not specify, I assume you mean that a bot requests the page, as opposed to you visiting the page like in scenario 2 and then an LLM processes the downloaded data (similarly to an adblocker). It is the former case that is a problem, the latter case is much harder to stop and there is much less reason to stop it.

This is the distinction: is a human present at the time of request.

philistine · 4h ago

To me it's even simpler: 3 is a request made from another ip address that isn't directly yours. Why should an LLM request that acts exactly like a VPN request be treated differently from a VPN request?

zeta0134 · 6h ago

If the LLM were running this sort of thing at the user's explicit request this would be fine. The problem is training. Every AI startup on the planet right now is aggressively crawling everything that will let them crawl. The server isn't seeing occasional summaries from interested users, but thousands upon thousands of bots repeatedly requesting every link they can find as fast as they can.

fxtentacle · 6h ago

Then what if I ask the LLM 10 questions about the same domain and ask it to research further? Any human would then click through 50 - 100 articles to make sure they know what that domain contains. If that part is automated by using an LLM, does that make any legal change? How many page URLs do you think one should be allowed to access per LLM prompt?

zeta0134 · 6h ago

All of them. That's at the explicit request of the user. I'm not sure where the downvotes are coming from, since I agree with all of these points. The training thing has merely pissed off lots of server operators already, so they quite reasonably tend to block first and ask questions later. I think that's important context.

hombre_fatal · 6h ago

TFA isn’t talking about crawling to harvest training data.

It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.

cjonas · 6h ago

Doesn't perplexity crawl to harvest and index data like a traditional search engine? Or is it all "on demand"?

lukeschlather · 6h ago

For the most part I would assume they pay for access to Google or Bing's index. I also assume they don't really train models. So all their "crawling" is on behalf of users.

mnmalst · 6h ago

But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.

ai-christianson · 6h ago

Websites should be able to request payment. Who cares if it is a human or an agent of a human if it is paying for the request?

adriand · 1h ago

Cloudflare launched a mechanism for this: https://blog.cloudflare.com/introducing-pay-per-crawl/

sds357 · 2h ago

What if the agent is reselling the request?

carlosjobim · 4h ago

They are able to request payment.

axus · 4h ago

For 1, 2, and 3, the website owner can choose to block you completely based on IP address or your User Agent. It's not nice, but the best reaction would be to find another website.

Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.

#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.

Evading an IP address ban doesn't violate my principles in some cases, and does in others.

dabockster · 4h ago

Because the LLM is usually on a 3rd party cloud system and ultimately not under your full control. You have no idea if the LLM is retaining any of that information for that business's own purposes beyond what a EULA says - which basically amounts to a pinky swear here. Especially if that LLM is located across international borders.

Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.

dawnerd · 5h ago

Nothing wrong if they fetch on your behalf. The problem is when they endlessly crawl along with every other ai company doing the same.

snihalani · 43m ago

you are paying for LLM but not paying for the website. LLM is removing the power the website had. Legally, that's cause for loss of income

talos_ · 5h ago

This analogy doesn't map to the actual problem here.

Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.

This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites

shadowgovt · 5h ago

I have never created a website that I would not mind being fully crawled and indexed into another dataset that was divorced from the source (other than such divorcement makes it much harder to check pedigree, which is an academic concern, not a data-content concern: if people want to trust information from sources they can't know and they can't verify I can't fix that for them).

In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.

Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.

short_sells_poo · 4h ago

There's an important distinction that we are glossing over I think. In the times of the "old web", people were putting things online to interact with a (large) online audience. If people found your content interesting, they'd keep coming back and some of them would email you, there'd be discussions on forums, IRC chatrooms, mailing lists, etc. Communities were built around interesting topics, and websites that started out as just some personal blog that someone used to write down their thoughts would grow into fonts of information for a large number of people.

Then came the social networks and walled gardens, SEO, and all the other cancer of the last 20 years and all of these disappeared for un-searchable videos, content farms and discord communities which are basically informational black holes.

And now AI is eating that cancer, but IMO it's just one cancer being replaced by an even more insidious cancer. If all the information is accessed via AI, then the last semblance of interaction between content creators and content consumers disappears. There are no more communities, just disconnected consumers interacting with a massive aggregating AI.

Instead of discussing an interesting topic with a human, we will discuss with AI...

jpadkins · 3h ago

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

I disagree. The website should have the right to say that the user can be shown the content under specific conditions (usage terms, presented how they designed, shown with ads, etc). If the software can't comply with those terms, then the human shouldn't be shown the content. Both parties did not agree in good faith.

dgshsg · 34m ago

You want the website to be able to force the user to see ads?

jpadkins · 11m ago

no, I think a fair + just world, both parties agree before they transact. There is no force in either direction (don't force creators to give their content on terms they don't want, don't force users to view ads they don't want). It's perfectly fine if people with strict preferences don't match. It's a big web, there are plenty of creators and consumers.

If the user doesn't want to view content with ads, that's okay and they can go elsewhere.

sussmannbaka · 2h ago

4. If I now go one step further and use a commercial DDoS service to make the get requests for me because this comparison is already a stretch, then why would the DDoS provider accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

GardenLetter27 · 5h ago

And isn't the obvious solution to just make some sort of browsers add-on for the LLM summary so the request comes from your browser and then gets sent to the LLM?

I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.

otterley · 5h ago

Why would a personal browser have to crawl fewer pages than the agent’s mechanism? If anything, the agent would be more efficient because it could cache the content for others to use. In the situation we’re talking about, the AI engine is behaving essentially like a caching proxy—just like a CDN.

Tuna-Fish · 5h ago

I would not mind 3, so long as it's just the LLM processing the website inside its context window, and no information from the website ends up in the weights of the model.

Spacecosmonaut · 6h ago

Regarding point 3: The problem from the perspective of websites would not be any different if they had been completely ad free. People would still consume LLM generated summaries because they cut down clicks and eyeballing to present you information that directly pertains to the promt.

The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?

Neil44 · 5h ago

Flip it around, why would you go to the trouble of creating a web page and content for it, if some AI bot is going to scrape it and save people the trouble of visiting your site? The value of your work has been captured by some AI company (by somewhat nefarious means too).

amiga386 · 4h ago

If you as a human are well behaved, that is absolutely fine.

If you as a human spam the shit out of my website and waste my resources, I will block you.

If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.

If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.

Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.

The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"

There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.

The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.

baxuz · 5h ago

1. To access a website you need a limited anonymized token that proves you are a human being, issued by a state authority

2. the end

I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.

There should exist an internet for humans only, with a single account per domain.

glenstein · 4h ago

A fascinating variation on this same issue can be found in Neal Stephenson's "Fall, or Dodge in Hell". There the solution is (1) discredit weaponized social media in its entirety by amplifying it's output exponentially and make its hostility universal in all directions, to the point that it's recognizeable as bad faith caricature. That way it can't be strategically leveraged with disproportionate directional focus against strategic targets by bad actors and (2) a new standard called PURDA, which is kind of behavioral signature as the mark of unique identity.

paulcole · 4h ago

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

Definitely don't agree. I don't think you should be shown the content, if for example:

1. You're in a country the site owner doesn't want to do business in.

2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.

3. The site owner has otherwise identified you as someone they don't want visiting their site.

You are welcome to try to fool them into giving you the content but it's not your right to get it.

porridgeraisin · 5h ago

I don't think people have a problem with an LLM issuing GET website.com and then summarising that, each and every time it uses that information (or atleast, save a citation to it and refer to that citation). Except ad ecosystem, ignoring them for now, please refer to last paragraph.

The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.

So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.

Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.

The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.

Now, IMO:

So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)

wulfstan · 5h ago

Yes, this is the crux of the matter.

The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.

LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.

IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.

EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/

giantrobot · 3h ago

> LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content

Let's be real, Google et al have been doing this for years with their quick answer and info boxes. AI chatbots are worse but it's not like the big search engines were great before AI came along. Google had made itself the one-stop shop for a huge percentage of users. They paid billions to be the default search engine on Apple's platforms not out of the goodness of their hearts but to be the main destination for everyone on the web.

skydhash · 5h ago

That’s why websites have no issues with googlebot and the search results. It’s a giant index and citation list. But stripping works from its context and presenting as your own is decried throughout history.

nelblu · 5h ago

> LLMs should also use citations.

Mojeek LLM (https://www.mojeek.com) uses citations.

Workaccount2 · 6h ago

>2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.

Is that something everyone would agree with?

The internet should be entirely behind paywalls, besides content that is already provided ad free.

Is that something everyone would agree with?

I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"

Bjartr · 6h ago

You're free to deny access to your site arbitrarily, including for lack of compensation.

ndiddy · 4h ago

This article is about Cloudflare attempting to deny Perplexity access to their demo site by blocking Perplexity's declared user-agent and official IP range. Perplexity responded to this denial by impersonating Google Chrome on macOS and rotating through IPs not listed in their published IP range to access the site anyway. This means it's not just "you're free to deny access to your site arbitrarily", it's "you're free to play a cat-and-mouse game indefinitely where the other side is a giant company with hundreds of millions of dollars in VC funding".

Bjartr · 56m ago

The comment I'm responding to established a slightly different context by asking a specific question about getting compensation from site visitors.

cjonas · 6h ago

Like for people or are using a ad block or for a crawler downloading your content so it can be used by an AI response?

Bjartr · 5h ago

Arbitrarily, as in for any reason. It's your site, you decide what constraints an incoming request must meet for it to get a response containing the content of your site.

Workaccount2 · 6h ago

>and the website should not be notified about it.

giantrobot · 3h ago

My user agent and its handling of your content once it's on my computer are not your concern. You don't need to know if the data is parsed by a screen reader, an AI agent, or just piped to /dev/null. It's simply not your concern and never will be.

nradov · 6h ago

Yes, I agree with that. If a website owner expects compensation then they should use a paywall.

Chris2048 · 6h ago

If I put time and effort into a food recipe should I (get) compensation?

the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.

"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.

Workaccount2 · 5h ago

You understand that HN is ad supported too, right?

Chris2048 · 5h ago

No, I don't.

But what is your point? Is the value in HN primarily in its hosting, or the non-ad-supported community?

Workaccount2 · 3h ago

Outside of Wikipedia, I'm not sure what content you are thinking of.

Taking HN as a potential one of these places, it doesn't even qualify. HN is funded entirely to be a place for advertising ycombinator companies to a large crowd of developers. HN is literally a developer honey pot that they get exclusive ad rights to.

TZubiri · 4h ago

In that case the llm would be a user-agent, quite distinct from scraping without a specific user request.

This is well defined in specs and ToS, not quite a gray area

renewiltord · 5h ago

The websites don’t nag you, actually. They just send you data. You have configured your user agent to nag yourself when the website sends you data.

And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.

We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.

carlosjobim · 5h ago

Legal category?

gentle · 4h ago

I believe you're being disingenuous. Perplexity is running a set of crawlers that do not respect robots.txt and take steps to actively evade detection.

They are running a service and this is not a user taking steps to modify their own content for their own use.

Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.

shadowgovt · 5h ago

Not only is it difficult to solve, it's the next step in the process of harvesting content to train AIs: companies will pay humans (probably in some flavor of "company scrip," such as extra queries on their AI engine) to install a browser extension that will piggy-back on their human access to sites and scrape the data from their human-controlled client.

At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).

quectophoton · 4h ago

> companies will pay humans (probably in some flavor of "company scrip," such as extra queries on their AI engine) to install a browser extension that will piggy-back on their human access to sites and scrape the data from their human-controlled client.

Proprietary web browsers are in a really good position to do something like this, especially if they offer a free VPN. The browser would connect to the "VPN servers", but it would be just to signal that this browser instance has an internet connection, while the requests are just proxied through another browser user.

That way the company that owns this browser gets a free network of residential IP address ready to make requests (in background) using a real web browser instance. If one of those background requests requires a CAPTCHA, they can just show it to the real user, e.g. the real user visits a Google page and they see a Cloudflare CAPTCHA, but that CAPTCHA is actually from one of the background requests (while lying in its UI and still showing the user a Google URL in the address bar).

epolanski · 5h ago

It's somebody's else content and resources and they are free to ban you or your bots as much as they please.

pyrale · 5h ago

Because LLM companies have historically been extremely disingenuous when it comes to crawling these sites.

Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.

Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.

bbqfog · 6h ago

Correct, it’s user hostile to dictate which software is allowed to see content.

klabb3 · 5h ago

They all do it. Facebook, Reddit, Twitter, Instagram. Because it interferes with their business model. It was already bad, but now the conflict between business and the open web is reaching unprecedented levels, especially since the copyright was scrapped for AI companies.

EGreg · 5h ago

1. I actually disagree. I think teasers should be free but websites should charge micropayments for their content. Here is how it can be done seamlessly, without individuals making decisions to pay every minute: https://qbix.com/ecosystem

2. This also intersects with copyright law. Ingesting content to your servers en masse through automation and transforming it there is not the same as giving people a tool (like Safari Reader) they can run on their client for specific sites they visit. Examples of companies that lost court cases about this:

  Aereo, Inc. v. American Broadcasting Companies (2014)
  TVEyes, Inc. v. Fox News Network, LLC (2018)
  UMG Recordings, Inc. v. MP3.com, Inc. (2000)
  Capitol Records, LLC v. ReDigi Inc. (2018)
  Cartoon Network v. CSC Holdings (Cablevision) (2008)
  Image Search Engines: Perfect 10 v. Google (2007)

That last one is very instructive. Caching thumbnails and previews may be OK. The rest is not. AMP is in a copyright grey area, because publishers choose to make their content available for AMP companies to redisplay. (@tptacek may have more on this)

3. Putting copyright law aside, that's the point. Decentralization vs Centralization. If a bunch of people want to come eat at an all-you-can-eat buffet, they can, because we know they have limited appetites. If you bring a giant truck and load up all the food from all all-you-can-eat buffets in the city, that's not OK, even if you later give the food away to homeless people for free. You're going to bankrupt the restaurants! https://xkcd.com/1499/

So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.

glenstein · 4h ago

I would love micropayments as a kind of baked-in ecosystem support. You can crawl if you want, but it's pay to play. Which hopefully drives motivation for robust norms for content access and content scraping that makes everyone happy.

EGreg · 4h ago

I want to bring Ted Nelson on my channel and interview him about Xanadu. Does anyone here know him?

https://xanadu.com.au/ted/XU/XuPageKeio.html

jacurtis · 2h ago

I think this is the world we are going to. I'm not going to get mired in the details of how it would happen, but I see this end result as inevitable (and we are already moving that way).

I expect a lot more paywalls for valuable content. General information is commoditized and offered in aggregated form through models. But when an AI is fetching information for you from a website, the publisher is still paying the cost of producing that content and hosting that content. The AI models are increasing the cost of hosting the content and then they are also removing the value of producing the content since you are just essentially offering value to the AI model. The user never sees your site.

I know Ads are unpopular here, but the truth is that is how publishers were compensated for your attention. When an AI model views the information that a publisher produces, then modifies it from its published form, and removes all ad content. Then you now have increased costs for producers, reduced compensation in producing content (since they are not getting ad traffic), and the content isn't even delivered in the original form.

The end result is that publishers now have to paywall their content.

Maybe an interesting middle-ground is if the AI Model companies compensated for content that they access similar to how Spotify compensates for plays of music. So if an AI model uses information from your site, they pay that publisher a fraction of a cent. People pay the AI models, and the AI models distribute that to the producers of content that feed and add value to the models.

Beijinger · 6h ago

How about I open a proxy, replace all ads with my ads, redirect the content to you and we share the ad revenue?

fxtentacle · 6h ago

That's somewhat antisocial, but perfectly legal in the US. It's called PayPal Honey, for example, and has been running for 13 years now.

rustc · 2m ago

Since when does PayPal Honey replace ads on websites?

> PayPal Honey is a browser extension that automatically finds and applies coupon codes at checkout with a single click.

carlosjobim · 4h ago

That's the Brave browser.

beardyw · 5h ago

You speak as 1% of the population to 1% of the population. Don't fool yourself.

sbarre · 6h ago

All of these scenarios assume you have an unconditional right to access the content on a website in whatever way you want.

Do you think you do?

Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?

That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).

And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.

It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.

hansvm · 5h ago

It doesn't work like that in the physical world though. Once you've bought a book the author can't stipulate that you're only allowed to read it with a video ad in the sidebar, by drinking a can of coke before each chapter, or by giving them permission to sniff through your family's medical history. They can't keep you from loaning it out for other people to read, even thousands of other people. They can't stop you from reading it in a certain room or with your favorite music playing. You can even have an LLM transcribe or summarize it for you for personal use (not everyone has those automatic page flipping machines, but hypothetically).

The reason people are up in arms is because rights they previously enjoyed are being stripped away by the current platforms. The content owner's rights aren't as valid as my own in the current world; they trump mine 10 to 1. If I "buy" a song and the content owner decides that my country is politically unfriendly, they just delete it and don't refund me. If I request to view their content and they start by wasting my bandwidth sending me an ad I haven't consented to, how can I even "not engage"? The damage is done, and there's no recourse.

cutemonster · 6h ago

If there's an article you want to read, and the ToS says that in between reading each paragraph, you must switch to their YouTube channel and look at their ads about cat food for 5 minutes, are your going to do that?

JimDabell · 6h ago

Hacker News has collectively answered this question by consistently voting up the archive.is links in the comments of every paywalled article posted here.

giantrobot · 2h ago

New sites have collectively decided to require people use those services because they can't fathom not enshittifying everything until it's an unusable transaction hellscape.

I never really minded magazine ads or even television ads. They might have tried to make me associate boobs with a brand of soda but they didn't data mine my life and track me everywhere. I'd much rather have old fashioned manipulation than pervasive and dangerous surveillance capitalism.

gruez · 6h ago

>Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you believe in this principle, fair enough, but are you going to apply this consistently? If it's fair game for a blog to restrict access to AI agents, what does that mean for other user agents that companies disagree with, like browsers with adblock? Does it just boil down to "it's okay if a person does it but not okay if a big evil corporation does it?"

gruez · 6h ago

Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

a2128 · 5h ago

In theory retrieving a page on behalf of a user would be acceptable, but these are AI companies who have disregarded all norms surrounding copyright, etc. It would be stupid of them not to also save contents of the page and use it for future AI training or further crawling

zarzavat · 2h ago

If you allow Googlebot to crawl your website and train Gemini, but you don't allow smaller AI companies to do the same thing, then you're contributing to Google's hegemony. Given that AI is likely to be an increasingly important part of society in the future, that kind of discrimination is anti-social. I don't want a future where everything is run by Google even more than it currently is.

Crawling is legal. Training is presumably legal. Long may the little guys do both.

dgreensp · 1h ago

Googlebot respects robots.txt. And Google doesn't use the fetched data from users of Chrome to supplement their search index (as a2128 is speculating that Perplexity might do when they fetch pages on the user's behalf).

foota · 17m ago

Yes, but there's no way to say "allow indexing for search, but not for AI use", right?

throwanem · 4h ago

The HTTP spec draws such a distinction, albeit implicitly, in the form (and name) of its concept of "user agent."

alexey-salmin · 1h ago

Over time it degraded into declaring compatibility with a bunch of different browser engines and doesn't reflect the actual agent anymore.

And very likely Perplexity is in fact using a Chrome-compatible engine to render the page.

throwanem · 1h ago

The header to which you refer was named for the concept.

fluidcruft · 6h ago

If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper? Just learn off the cached data. Being the man-in-the-middle seems like a pretty easy way to scrape salient content while also getting signals about that content's value.

JimDabell · 6h ago

No. The key difference is that if a user asks about a specific page, when Perplexity fetches that page, it is being operated by a human not acting as a crawler. It doesn’t matter how many times this happens or what they do with the result. If they aren’t recursively fetching pages, then they aren’t a crawler and robots.txt does not apply to them. robots.txt is not a generic access control mechanism, it is designed solely for automated clients.

sbarre · 6h ago

I would only agree with this if we knew for sure that these on-demand human-initiated crawls didn't result in the crawled page being added to an overall index and scheduled for future automated crawls.

Otherwise it's just adding an unwilling website to a crawl index, and showing the result of the first crawl as a byproduct of that action.

glenstein · 4h ago

> It doesn’t matter how many times this happens or what they do with the result.

That's where you lost me, as this is key to GP's point above and it takes more than a mere out-of-left-field declaration that "it doesn't matter" to settle the question of whether it matters.

I think they raised an important point about using cached data to support functions beyond the scope of simple at-request page retrieval.

fluidcruft · 6h ago

Many people don't want their data used for free/any training. AI developers have been so repeatedly unethical that the well-earned Baysian prior is high probability that you cannot trust AI developers to not cross the training/inference streams.

JimDabell · 6h ago

> Many people don't want their data used for free/any training.

That is true. But robots.txt is not designed to give them the ability to prevent this.

gunalx · 1h ago

It is in the name, rules for the robots. Any scraping ai or not, and even mass recrsive or single page, should abide by the rules.

gruez · 6h ago

>If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper?

That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.

[1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP

fluidcruft · 6h ago

I didn't suggest Perplexity should be shut down, though. And yes, in your analogy sites are completely justified to take whatever actions they can to block people who are building those caches.

busymom0 · 3h ago

The examples the article cites seem to me that they are merely retrieving content on behalf of the user. I do not see a problem with this.

thoroughburro · 6h ago

> I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.

We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

hombre_fatal · 6h ago

Taking down the content because you're annoyed that people are asking questions about it via an LLM interface doesn't seem like you're winning.

It's also a gift to your competitors.

You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.

ipaddr · 5h ago

It could be my server keeps going down because of llms agents keep requesting pages from my lyric site. Removing that site allowed other sites to remain up. True story.

Who cares if perplexity will never notice. Or competitors get an advantage. It is a negative for users using perplexity or visiting directly because the content doesn't exist.

That's the world perplexity and others are creating. They will be able to pull anything from the web but nothing will be left.

gkbrk · 6h ago

> Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.

Bluescreenbuddy · 6h ago

FF isn’t training their money printer with MY data. AI scrapers are

glenstein · 4h ago

>But they didn't take down the content, you did.

That skips the part about one party's unique role in the abuse of trust.

IncreasePosts · 6h ago

You don't win, because presumably you were providing the content for some reason, and forcing yourself to take it down is contrary to whatever reason that was in the first place.

ipaddr · 5h ago

Llms attack certain topics so removing one site will allow the others to live on the same server.

Den_VR · 6h ago

You can limit access, sure: with ACLs, putting content behind login, certificate based mechanisms, and at the end of the day -a power cord-.

But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?

hnburnsy · 49m ago

Respone from Perpelexity to Tech Crunch...

>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”

rustc · 5h ago

It's ironic Perplexity itself blocks crawlers:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.

thambidurai · 4h ago

someone asked this already to the CEO: https://x.com/AravSrinivas/status/1819610286036488625

fireflash38 · 4h ago

The bots are coming from inside the house

czk · 4h ago

ironically... they use cloudflare.

bob1029 · 5h ago

"Stealth" crawlers are always going to win the game.

There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).

I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.

[0] https://www.w3.org/TR/webdriver2/

[1] https://chromedevtools.github.io/devtools-protocol/

blibble · 5h ago

> "Stealth" crawlers are always going to win the game.

no, because we'll end up with remote attestation needed to access any site of value

Buttons840 · 1h ago

Yes, because there's always the option for a camera pointed at the screen and a robot arm moving the mouse. AI is hoping to solve much harder problems.

myflash13 · 7m ago

Won't work with biometric attestation. For example, banks in China require periodic facial recognition to continue the banking session.

gkbrk · 3h ago

Almost no site of value will use remote attestation because an alternative that works will all of your devices, operating systems, ad blockers and extensions will attract more users than your locked-down site.

bakugo · 3h ago

> alternative that works will all of your devices, operating systems, ad blockers and extensions

When 99.9% of users are using the same few types of locked down devices, operating systems, and browsers that all support remote attestation, the 0.1% doesn't matter. This is already the case on mobile devices, it's only a matter of time until computers become just as locked down.

blibble · 3h ago

tell that to the massive content sites already using widevine

rzz3 · 5h ago

> Today, over two and a half million websites have chosen to completely disallow AI training through our managed robots.txt feature or our managed rule blocking AI Crawlers.

No, he (Matthew) opted everyone in by default. If you’re a Cloudflare customer and you don’t care if AI can scrape your site, you should contact them and/or turn this off.

In a world where AI is fast becoming more important than search, companies who want AI to recommend their products need to turn this off before it starts hurting them financially.

KomoD · 2h ago

> No, he (Matthew) opted everyone in by default

Now you're just lying.

I checked several of my Cloudflare sites and none have it enabled by default:

"No robots.txt file found. Consider enabling Cloudflare managed robots.txt or generate one for your website"

"A robots.txt was found and is not managed by Cloudflare"

"Instruct AI bot traffic with robots.txt" disabled

cdrini · 21m ago

I think lying is a bit strong, I think they're potentially incorrect at worst.

The Cloudflare blog post where they announced this a few weeks ago stated "Cloudflare, Inc. (NYSE: NET), the leading connectivity cloud company, today announced it is now the first Internet infrastructure provider to block AI crawlers accessing content without permission or compensation, by default." [1]

I was also a bit confused by this wording and took it to mean Cloudflare was blocking AI traffic by default. What does it mean exactly?

Third party folks seemingly also interpreted it in the same way, eg The Verge reporting it with the title "Cloudflare will now block AI crawlers by default" [2]

I think what it actually means is that they'll offer new folks a default-enabled option to block ai traffic, so existing folks won't see any change. That aligns with text deeper in their blog post:

> Upon sign-up with Cloudflare, every new domain will now be asked if they want to allow AI crawlers, giving customers the choice upfront to explicitly allow or deny AI crawlers access. This significant shift means that every new domain starts with the default of control, and eliminates the need for webpage owners to manually configure their settings to opt out. Customers can easily check their settings and enable crawling at any time if they want their content to be freely accessed.

Not sure what this looks like in practice, or whether existing customers will be notified of the new option or something. But I also wouldn't fault someone for misinterpreting the headlines; they were a bit misleading.

[1]: https://www.cloudflare.com/en-ca/press-releases/2025/cloudfl...

[2]: https://www.theverge.com/news/695501/cloudflare-block-ai-cra...

fourside · 5h ago

> companies who want AI to recommend their products need to turn this off before it starts hurting them financially

Content marketing, gamified SEO, and obtrusive ads significantly hurt the quality of Google search. For all its flaws, LLMs don’t feel this gamified yet. It’s disappointing that this is probably where we’re headed. But I hope OpenAI and Anthropic realize that this drop in search result quality might be partly why Google’s losing traffic.

ipaddr · 5h ago

This has already started with people using special tags also people making content just for llms.

jedberg · 5h ago

There is a standard for making content just for LLMs: https://llmstxt.org

rzz3 · 4h ago

I hope they realize Cloudflare opted them in to blocking LLMs.

gcbirzan · 2h ago

I hope you realise that lying is bad.

gcbirzan · 2h ago

Yeah, that's a lie. I didn't do anything and I didn't get opted in.

Edit: And, btw, that statement was true before the default was changed. So, your comment is doubly false.

djoldman · 4h ago

The cat's out of the bag / pandora's box is opened with respect to AI training data.

No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.

Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.

The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.

binarymax · 6h ago

I've built and run a personal search engine, that can do pretty much what perplexity does from a basic standpoint. Testing with friends it gets about 50/50 preference for their queries vs Perplexity.

The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.

skeledrew · 3h ago

This is why Perplexity is my preferred deep search engine. The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source. If a site doesn't want particular users to access their content, put it behind a login. The only way I - and eventually many others - will see it in the first place anyway is when it pops up as a cited source in the LLM output, and there's an actual need to go to said source.

remus · 3h ago

> The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source.

If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.

As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!

alexey-salmin · 35s ago

> As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work.

How do you square these two? Of course big companies profit from your work, this is why they send all these bots to crawl your site.

crazygringo · 42m ago

> but I think it's a reasonable stance to not want other companies to profit from my hard work

Imagine someone at another company reads your site, and it informs a strategic decision they make at the company to make money around the niche activity you're talking about. And they make lots of money they wouldn't have otherwise. That's totally legal and totally ethical as well.

The reality is, if you do hard work and make the results public, well you've made them public. People and corporations are free to profit off the facts you've made public, and they should be. There are certain limited copyright protections (they can't sell large swathes of your words verbatim), but that's all.

So the idea that you don't want companies to profit from your hard work is unreasonable, if you make it public. If you don't want that to happen, don't make anything public.

Havoc · 5h ago

Seems a win.

CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation

Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit

No comments yet

rwmj · 5h ago

In unrelated news, Fedora (the Linux distro) has been taken down by a DDoS today which I understand is AI-scraping related: https://pagure.io/fedora-infrastructure/issue/12703

observationist · 5h ago

Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

If you want to gatekeep your content, use authentication.

Robots.txt is not a technical solution, it's a social nicety.

Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.

Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.

emehex · 5h ago

Would highly recommend listening to the latest Hard Fork podcast with Matthew Prince (CEO, Cloudflare): https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...

I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.

glenstein · 5h ago

I don't think criticizing the business practices of Cloudfare does the work of excusing Perplexity's disregard for norms.

rustc · 4h ago

> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> If you want to gatekeep your content, use authentication.

Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?

kevmo314 · 4h ago

Yes, I believe that's basically what https://serpapi.com/ is doing.

rustc · 4h ago

There are many APIs that scrape Google but I don't know of any search engine that scrapes and rebrands Google results. Kagi.com pays Google for search results. Either Kagi has a better deal than SERP apis (I doubt) or this is not legal.

AtNightWeCode · 1h ago

I think OP based this on an old case about what you can do with data from Facebook vs LinkedIn based on if you need to be logged in to get it. Not relevant when you talk about scraping in this case I think. P is clearly in the wrong here.

leptons · 3h ago

I tried to scrape Google results once using an automated process, and quickly got banned from all of Google. They banned my IP address completely. It kind of really sucked for a while, until my ISP assigned a new IP address. Funny enough, this was about 15 years ago and I was exploring developing something very similar to what LLMs are today.

pton_xd · 4h ago

> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?

observationist · 3h ago

They exist to optimize the internet for the platforms and big providers. Little people get screwed, with no legal recourse. They actively and explicitly degrade the internet, acting as censors and gatekeepers and on behalf of bad faith actors without legal authority or oversight.

They allow the big platforms to pay for special access. If you wanted to run a scraper, however, you're not allowed, despite the internet standards and protocols and the laws governing network access and free communications standards responsibilities by ISPs and service providers not granting the authority to any party involved with cloudflare blocking access.

It's equivalent to a private company deciding who, when, and how you can call from your phone, based on the interests and payments of people who profit from listening to your calls. What we have is not normal or good, unless you're exploiting the users of websites for profit and influence.

seydor · 4h ago

most users of cloudflare assume it's for spam control. They don't realize that they are blocking their content for everyone except for Faangs

No comments yet

dax_ · 3h ago

Well if it continues like this, that's what will happen. And I dread that future.

Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.

AtNightWeCode · 1h ago

This is 100% incorrect.

tantalor · 5h ago

I think Cloudfare is setting themselves up to get sued.

(IANAL) tortious interference

delfinom · 2h ago

[flagged]

dang · 1h ago

> Eat a dick.

Could you please stop breaking the HN guidelines? Your account has unfortunately done that repeatedly, and we've asked you several times to stop.

Your comment would be just fine without that bit.

https://news.ycombinator.com/newsguidelines.html

blibble · 5h ago

AI companies continuing to have problems with the concept of "consent" is increasingly alarming

god help us if they ever manage to build anything more than shitty chatbots

tempfile · 4h ago

Do you ask for consent before you visit a website? If I told you, you personally, to stop visiting my blog, would you stop?

Yizahi · 6m ago

Repeat after me - intentional discrimination of computer programs over humans is a good and praise worthy thing. We can and should make execution of computer programs harder and harder, even disproportionately so, if that makes lives of humans better and easier.

LLM programs does not have human rights.

gcbirzan · 2h ago

I am not told I cannot access. And, yes, I would, because I'd be breaking the law otherwise.

crazygringo · 35m ago

> And, yes, I would, because I'd be breaking the law otherwise.

No you wouldn't be. Even if someone tells you not to visit your site, you have every legal right to continue visiting it, at least in the US.

Under common interpretation of the CFAA, there needs to be a formal mechanism of authorized access. E.g. you could be charged if you hacked into a password-protected area of someone's site. But if you're merely told "hey bro don't visit my site", that's not going to reach the required legal threshold.

Which is why crawlers aren't breaking the law. If you want to restrict authorization, you need to actually implement that as a mechanism by creating logins, restricting content to logged-in users, and not giving logins to crawlers.

mplewis · 4h ago

If I were DOSing your blog, you'd ask me to stop. I run server ops for multiple online communities that are being severely negatively impacted and DOSed by these AI scrapers, and we have very few ways to stop them.

tempfile · 14m ago

That is a problem, but is not related to my comment. The person I'm replying to is acting as if consent is a relevant aspect of the public web, I am saying it isn't. That is not the same as saying "you can do whatever you want to a public server". It is just that what you are allowed to do is not related to the arbitrary whim of the server operator.

goatlover · 4h ago

They're certainly pouring billions of dollars into trying to build something more. Or at least that's what they're telling the public and investors.

throwmeaway222 · 14m ago

Change "no-crawl" to "will-sue"

and see if that fixes the problem.

madrox · 1h ago

Every time there's an industry disruption there's good money to be made in providing services to incumbents that slow the transition down. You saw it in streaming, and even the internet at large. Cloudflare just happens to be the business filling that role this time.

I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.

kylestanfield · 4h ago

Perplexity claims that you can “use the following robots.txt tags to manage how their sites and content interact with Perplexity.” https://docs.perplexity.ai/guides/bots

Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.

jp1016 · 5h ago

Using a robots.txt file to block crawlers is just a request, it’s not enforced. Even if some follow it, others can ignore it or get around it using fake user agents or proxies. It’s a battle you can’t really win.

nostrademons · 3h ago

It's entirely possible that it's not Perplexity using the stealth undeclared crawlers, but rather their fallback is to contract out to a dedicated for-pay webscraping firm that retrieves the desired content through unspecified means. (Some of these are pretty dodgy - several scraping companies effectively just install malware on consumer machines and then use their botnet to grab data for their customers.). There was a story on HN not long ago about the FBI using similar means to perform surveillance that would be illegal if the FBI did it itself, but becomes legal once they split the different parts up across a supply chain:

https://news.ycombinator.com/item?id=44220860

JimDabell · 6h ago

Their test seems flawed:

> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

— https://www.robotstxt.org/faq/what.html

If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.

These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.

There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.

If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.

1gn15 · 6h ago

> > We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".

Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.

tempfile · 4h ago

> You asked the web-enabled AI to look at the domains.

Right, and the domain was configured to disallow crawlers, but Perplexity crawled it anyway. I am really struggling to see how this is hard to understand. If you mean to say "I don't think there is anything wrong with ignoring robots.txt" then just say that. Don't pretend they didn't make it clear what they're objecting to, because they spell it out repeatedly.

runako · 5h ago

Relevant to this is that Perplexity lies to the user when specifically asked about this. When the user asks if there is a robots.txt file for the domain, it lies and says there is not.

If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?

charcircuit · 5h ago

The article did not test if the issue was specific to robots.txt or if it can not find other files.

There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.

runako · 5h ago

> specific to robots.txt > poor summarization of data

I'm not really addressing the issue raised in the article. I am noting that the LLM, when asked, is either lying to the user or making a statement that it does not know to be true (that there is no robots.txt). This is way beyond poor summarization.

charcircuit · 2h ago

I would say it's orthogonal to it. LLMs being unable to judge their capabilities is a separate issue to summarization quality.

runako · 1h ago

I'm not critiquing its ability to judge its own capability, I am pointing out that it is providing false information to the user.

wulfstan · 6h ago

Yeah I'm not so sure about that.

If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.

But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.

JimDabell · 5h ago

It doesn’t matter what you do with it afterwards. Crawling is defined by recursively following links. If a user asks software about a specific page and it fetches it, then a human is operating that software, it’s not a crawler. You can’t just redefine “crawler” to mean “software that does things I don’t like”. It very specifically refers to software that recursively follows links.

wulfstan · 5h ago

Technically correct (the best kind of correct), but if I set a thousand users on to a website to each download a single page and then feed the information they retrieve from that one page into my AI model, then are those thousand users not performing the same function as a crawler, even though they are (technically) not one?

If it looks like a duck, quacks like a duck and surfs a website like a duck, then perhaps we should just consider it a duck...

Edit: I should also add that it does matter what you do with it afterwards, because it's not content that belongs to you, it belongs to someone else. The law in most jurisdictions quite rightly restricts what you can do with content you've come across. For personal, relatively ephemeral use, or fair quoting for news etc. - all good. For feeding to your AI - not all good.

JimDabell · 5h ago

> if I set a thousand users on to a website to each download a single page and then feed the information they retrieve from that one page into my AI model, then are those thousand users not performing the same function as a crawler, even though they are (technically) not one?

No.

robots.txt is designed to stop recursive fetching. It is not designed to stop AI companies from getting your content. Devising scenarios in which AI companies get your content without recursively fetching it is irrelevant to robots.txt because robots.txt is about recursively fetching.

If you try to use robots.txt to stop AI companies from accessing your content, then you will be disappointed because robots.txt is not designed to do that. It’s using the wrong tool for the job.

catlifeonmars · 5h ago

I don’t disagree with you about robots.txt… however, what _is_ the right tool for the job?

hundchenkatze · 4h ago

auth, If you don't want content to be publicly accessible, don't make it public.

seydor · 3h ago

Perplexity can then just ask the user to copy/paste the page content. That should be legal , it's what the user wants. The cases are equivalent

Izkda · 5h ago

> If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it

That's not what Perplexity own documentation[1] says though:

"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity

Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."

[1] https://docs.perplexity.ai/guides/bots

hundchenkatze · 4h ago

You left out the part that says Perplexity-User generally ignores robots.txt because it's used for user requested actions.

> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.

daft_pink · 4h ago

I’m just curious at what point ai is a crawler and at what point ai is a client when the user is directing the searches and the ai is executing them.

Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.

zeld4 · 1h ago

Internet was built on trust, but not anymore. It's a Darwinian system; everyone has to find their own way to survive.

Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.

bob1029 · 2h ago

Has anyone bothered to properly quantify the worst case load (i.e., requests per second) that has been incurred by these scraping tools? I recall a post on HN a few weeks/months ago about something similar, but it seemed very light on figures.

It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.

If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.

talkingtab · 5h ago

I wonder if DRM is useful for this. The problem: I want people to access my site, but not Google, not bots, not crawlers and certainly not for use by AI.

I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.

Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.

rapatel0 · 4h ago

This is brilliant marketing and strategy from Cloudflare. They are pointing out bad actors and selling a service where they can be the private security guards for your website.

I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.

mikewarot · 4h ago

So, this calls for a new type of honeytrap, content that appears to be human generated, and high quality, but subtly wrong, preferably on a commercially catastrophic way. Behind settings that prohibit commercial usage.

It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.

ascorbic · 1h ago

They did that too https://blog.cloudflare.com/ai-labyrinth/

jgrall · 4h ago

This made me laugh. A form of malicious compliance.

czk · 4h ago

the year is 2045.

you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:

"sir, have you been botting today?"

immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command

"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile

"yeah... im gonna need you to exit the vehicle and take a field humanity test"

benreesman · 3h ago

greentext courtesy of opus prompted in greentext by me.

``` The user wants me to create a more elaborate greentext with deeper worldbuilding and references to internet culture/incel memes, but specifically asks for "tasteful" ones and explicitly says no violent references. I should incorporate classic internet/chan culture references like tendies, NEET lifestyle, "tfw no gf", basement dwelling, etc. while building out this dystopian 2045 world where proving humanity is the central challenge.

\>be me, 2045 \>32yo NEET still living in mom's smart-basement \>she threatens to turn off my Neuralink if I don't go outside today \>grudgingly take the robotaxi to pick up tendies \>$150 stablecoin toll at cloudflare checkpoint \>have to solve a moral dilemma to prove I'm not running on GPT-7 \>sweating.dll \>palantir unit pulls me over \>optimus v4 leans into window \>its facial mesh attempts "concern_expression_v2.blend" \>"sir, when did you last feel genuine human connection?" \>flashback to 2024 when the girl at McDonalds gave me extra honey mustard \>that was before the McBots took over \>"t-twenty one years ago officer" \>optimus's empathy subroutines activate \>"sir I need you to perform a field humanity test" \>get out, knees weak from vitamin D deficiency \>"please describe your ideal romantic partner without using the words 'tradwife' or 'submissive'" \>brain.exe has stopped responding \>try to remember pre-blackpill emotions \>"someone who... likes anime?" \>optimus scans my biometrics \>"stress patterns indicate authentic social anxiety, carry on citizen" \>get back in robotaxi \>it starts therapy session \>"I notice you ordered tendies again. Let's explore your relationship with your mother" \>tfw the car has better emotional intelligence than me \>finally get tendies from Wendy's AutoServ \>receipt prints with mandatory "rate your humanity score today" \>3.2/10 \>at least I'm improving

\>mfw bots are better at being human than humans \>it's over for carboncels ```

caesil · 4h ago

Cloudflare is an enemy of the open and freely accessible web.

jgrall · 3h ago

If by "open and freely accessible" you mean there should be no rules of the road, then I suppose yes. Personally, I'm glad CF is pushing back on this naive mentality.

xmodem · 2h ago

Question for those in this thread who are okay with this: If I have endpoints that are computationally expensive server-side, what mechanism do you propose I could use to avoid being overwhelmed?

The web will be a much worse place if such services are all forced behind captchas or logins.

m3047 · 2h ago

In 2005 I used a bot motel with Markov Chain derived dummy content for this exact purpose.

willguest · 4h ago

> The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust.

I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies

jgrall · 4h ago

Ain't that the truth.

codecracker3001 · 2h ago

> we were able to fingerprint this crawler using a combination of machine learning and network signals.

what machine learning algorithms are they using? time to deploy them onto our websites

kocial · 5h ago

Those Challenges can be bypassed too using various browser automation. With the Comet-like tool, Perplexity can advance its crawling activity with much more human-like behaviour.

ipaddr · 5h ago

If they can trick the ad networks then go for it. If the ad networks can detect it and exclude those visits we should be able to.

larodi · 5h ago

Good they do it. Facebook took TBs of data to train, nobody knows what Goog does to evade whatever they want.

the service is actually very convenient no matter faang likes it or not.

rzz3 · 5h ago

Well Cloudflare doesn’t even block Google’s AI crawlers because they don’t differentiate themselves from their search crawlers. Cloudflare gives Google an unfair competitive advantage.

klabb3 · 5h ago

Unexpected underdog argument. What is happening in reality is all companies are racing to (a) scrape, buy and collect as much as they can from others, both individuals and companies while (b) locking down their own data against everyone else who isn’t directly making them money (eg through viewing their ads).

Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.

larodi · 5h ago

We need to wake up and understand that all the information already uploaded is more or less a free web material, once taken through the lens of ML-somethings. With all the second, and third-order effects such as the fact that this changes completely the whole motivation, and consequence of open-source perhaps.

It is also only a matter of time scrapers once again get through walls by twitter, reddit and alike. This is, after all, information everyone produced, without being aware of it was now considered not theirs anymore.

ipaddr · 5h ago

Reddit sold their data already. Twitter made thier own AI.

bilater · 4h ago

As others have mentioned the problem is that of scale. Perhaps there needs to be a rate limit (times they ping a site) set within robots.txt that a site bot can come but only X times per hour etc. At least we move from a binary scrape or no scrape to a spectrum then.

crossroadsguy · 3h ago

I was recently listening to Cloudflare CEO on the Hard Fork podcast. He seemed to be selling a way for content creators to stop AI companies from profiting off such leeching. But the way he laid the whole thing out, adding how they are best placed to do this because they are gatekeepers of X% of the Internet (I don't recall the exact percentage), had me more concerned than I was at the prospect of AI companies being the front of summarised or interpreted consumption.

He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.

Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.

Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?

kazinator · 3h ago

Why single out Perplexity? Pretty much no crawler out there fetches robots.txt.

robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.

People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.

gcbirzan · 2h ago

That's not true, at all.

dhanushreddy29 · 3h ago

PS: perplexity is using cloudflare browser rendering to scrape websites

pera · 4h ago

Like many other generative AI companies, Perplexity exploits the good faith of the old Internet by extracting the content created almost entirely by normal folks (i.e. those who depend on a wage for subsistence) and reproducing it for a profit while removing the creators from the loop - even when normal folks are explicitly asking them to not do this.

If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.

decide1000 · 4h ago

C'mon CF. What are you doing? You are literally breaking the internet with your police behaviour. Starts to look like the Great Firewall.

jgrall · 4h ago

Not affiliated with CF in any way. Respectfully disagree. Calling out bad actors is in the public interest.

imcritic · 2h ago

CF is a bad actor. They ruin internet. They own more and more parts of it.

znpy · 4h ago

At work I'm considering blocking all the ip prefixes announced by ASNs owned by Microsoft and other companies known for their LLMs. At this point it seems like the only viable solutions.

LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.

micromacrofoot · 5h ago

Every major AI platform is doing this right now, it's effectively impossible to avoid having your content vacuumed up by LLMs if you operate on the public web.

I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.

curiousgal · 5h ago

I am sorry, Cloudafre is the internet police now?

otterley · 5h ago

Which is ironic given they are the primary enabler of streaming video copyright infringement on the Internet.

rzz3 · 5h ago

They hate AI it seems. I don’t see them offering any AI products or embracing it in any way. Seems like they’ll get left behind in the AI race.

pkilgore · 2h ago

Cloudflare literally publishes documentation pages and prompts for the single purpose of enabling better AI usage of their products and services [1,2]

They offer many products for the sole purpose of enabling their customers to use AI as a part of their product offers, as even the most cursory inquiry would have uncovered.

We're out here critiquing shit based on vibes vs. reality now.

[1]https://developers.cloudflare.com/llms.txt [2]https://developers.cloudflare.com/workers/prompt.txt

No comments yet

otterley · 3h ago

I don't think they hate AI. I think they're offering a service that their customers want.

bobnamob · 4h ago

? https://developers.cloudflare.com/workers-ai/

? https://ai.cloudflare.com/

rzz3 · 4h ago

Ah TIL. These are tiny models though but maybe it’s a good sign.

Oras · 4h ago

If they managed to enforce the pay-per-scrape, that would be a huge revenue, bigger than AdSense

gonzo41 · 5h ago

This is expected. There are not rules or conventions anymore. Look at LLMs, they stole/pirated all knowledge....no consequences.

kotaKat · 2h ago

An AI service violating peoples’ consent? Say it isn’t so! Those damn assult-culture techbros at it again.

tr_user · 5h ago

use anubis to throw up a POW challenge

nnx · 6h ago

I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.

ploynog · 6h ago

Being daft on purpose? I haven't heard that using an alternative browser suddenly increases the traffic that a user generates by several orders of magnitude to the point where it can significantly increase hosting cost. A web scraper on the other hand easily can and they often account for the majority of traffic especially on smaller sites.

So your comparison is at least naive assuming good intentions or malicious if not.

gruez · 6h ago

>I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

AI broke the brains of many people. The internet isn't a monolith, but prior to the AI boom you'd be hard pressed to find people who were pro-copyright (except maybe a few who wanted to use it to force companies to comply with copyleft obligations), pro user-agent restrictions, or anti-scraping. Now such positions receive consistent representation in discussions, and are even the predominant position in some places (eg. reddit). In the past, people would invoke principled justifications for why they opposed those positions, like how copyright constituted an immoral monopoly and stifled innovation, or how scraping was so important to interoperability and the open web. Turns out for many, none of those principles really mattered and they only held those positions because they thought those positions would harm big evil publishing/media companies (ie. symbolic politics theory). When being anti-copyright or pro-scraping helped big evil AI companies, they took the opposite stance.

542354234235 · 5h ago

There is an expression “the dose makes the poison”. With any sufficiently complex or broad category situation, there is rarely a binary ideological position that covers any and all situations. Should drugs be legal for recreation? Well my feeling for marijuana and fentanyl are different. Should individuals be allowed to own weapons? My views differ depending on if it a switch blade knife of a Stinger missile. Can law enforcement surveille possible criminals? My views differ based on whether it is a warranted wiretap or an IMSI catcher used on a group of protestors.

People can believe that corporations are using the power asymmetry between them and individuals through copywrite law to stifle the individual to protect profits. People can also believe that corporations are using the power asymmetry between them and individuals through AI to steal intellectual labor done by individuals to protect their profits. People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.

gruez · 3h ago

>There is an expression “the dose makes the poison”. With any sufficiently complex or broad category situation, there is rarely a binary ideological position that covers any and all situations. Should drugs be legal for recreation? Well my feeling for marijuana and fentanyl are different. Should individuals be allowed to own weapons? My views differ depending on if it a switch blade knife of a Stinger missile. Can law enforcement surveille possible criminals? My views differ based on whether it is a warranted wiretap or an IMSI catcher used on a group of protestors.

This seems very susceptible to manipulation to get whatever conclusion you want. For instance, is dose defined? It sounds like the idea you're going for is that the typical pirate downloads a few dozen movies/games but AI companies are doing millions/billions, but why should it be counted per infringer? After all, if everyone pirates a given movie, that wouldn't add up much in terms of their personal count of infringements, but would make the movie unprofitable.

>People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.

That sounds suspiciously close to "laws should just be whatever benefits me or my group". If so, that would be a sad and cynical worldview, not dissimilar to the stance on free speech held by the illiberal left and right. "Free speech is an important part of democracy", they say, except when they see their opponents voicing "dangerous ideas", in which case they think it should be clamped down. After all, what are laws for if not a tool to protect the interests of your side?

Fraterkes · 6h ago

I think the intelligent conclusion would be that the people you are looking at have more nuanced beliefs than you initially thought. Talking about broken brains is often just mediocre projecting

gruez · 5h ago

>I think the intelligent conclusion would be that the people you are looking at have more nuanced beliefs than you initially thought.

You don't seem to reject my claim that for many, principles took a backseat to "does this help or hurt evil corporations". If that's what passes as "nuance" to you, then sure.

>Talking about broken brains is often just mediocre projecting

To be clear, that part is metaphorical/hyperbolic and not meant to be taken literally. Obviously I'm not diagnosing people who switched sides with a psychiatric condition.

ipaddr · 4h ago

People never agreed DOSing a site to take copyright material was acceptable. Many people did not have a problem with taking copyright material in a respectful way that didn't kill the resource.

LLMs are killing the resource. This isn't a corporation vs person issue. No issue with an llm having my content but big issue with my server being down because llms are hammering the same page over and over.

gruez · 4h ago

>People never agreed DOSing a site to take copyright material was acceptable. Many people did not have a problem with taking copyright material in a respectful way that didn't kill the resource.

Has it be shown that perplexity engages in "DOSing"? I've heard of anecdotes of AI bots gone amuck, and maybe that's what's happening here, but cloudflare hasn't really shown that. All they did was set up a robots.txt and shown that perplexity bypassed it. There's probably archivers out there that's using youtube-dl to hit download from youtube at 1+Gbit/s, tens of times more than a typical viewer is downloading. Does that mean it's fair game to point to a random instance of someone using youtube-dl and characterizing that as "DOSing"?

Fraterkes · 3h ago

The guy that runs shadertoy talked about how the hostingcost for his free site shot up because Openai kept crawling his site for training data (ignoring robot.txt) I think that’s bad, and I have also experimented a bit with using BeautifulSoup in the past to download ~2MB of pictures from Instagram. Do you think I’m holding an inconsistent position?

gruez · 3h ago

My point is that to invoke the "they're DOSing" excuse, you actually have to provide evidence it's happening in this specific instance, rather than vaguely gesturing at some class of entities (AI companies) and concluding that because some AI companies are DOSing, all AI companies are DOSing. Otherwise it's like youtube blocking all youtube-dl users for "DOSing" (some fraction of users arguably are), and then justifying their actions with "People never agreed DOSing a site to take copyright material was acceptable".

Fraterkes · 2h ago

I tell you of an instance where the biggest ai company is DOS’ing and your reply is that I haven’t proven all of them are doing it? Why do I waste my time on this stuff

o11c · 5h ago

It's the hypocrisy you're seeing - why are AIs allowed to profit from violating copyright, while people wanting to do actually useful things have been consistently blocked? Either resolution would be fine, but we can't have it both ways.

Regardless, the bigger AI problem is spam, and that has never been acceptable.

magicmicah85 · 6h ago

A crawler intends to scrape the content to reuse for its own purposes while a browser has a human being using it. There's different intents behind the tools.

JimDabell · 6h ago

Cloudflare asked Perplexity this question:

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

In this case, Perplexity had a human being using it. Perplexity wasn’t crawling the site, Perplexity was being operated by a human working for Cloudflare.

nialse · 1h ago

Is it just me or is it rage bait? Switching up marketing a notch when the AI paywall did not get much media attention so far? Cloudflare seems to focus on enterprise marketing nowadays, currently geared towards the media industry, rather than the technical marketing suited for the HN audience. They have no horse in the AI race, so they’re betting on the anti-AI horse instead to gain market share in the media sector?

kissgyorgy · 5h ago

Not sure I would consider a user copy-pasting an URL being a bot.

Should curl be considered a bot too? What's the difference?

rustc · 4h ago

> Should curl be considered a bot too? What's the difference?

Perplexity definitely does:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

ipaddr · 5h ago

It gets blocked in my setup because bots use this as a workaround.

seydor · 4h ago

> it is built on trust.

This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.

Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

andy99 · 3h ago

Can't agree more, cloudflare is destroying the internet. We've entered the equivalent of when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much. These user hostile solutions have taken us back to dialup era page loading speeds for many sites, it's absurd that anyone thinks this is a service worth paying for.

rstat1 · 3h ago

So server owners are just supposed to bend over and take all the abuse they get from shitty bots and DDOS attacks and do nothing?

That seems pretty unreasonable.

adrian_b · 1h ago

Unreasonable is to use such incompetent companies like Cloudflare, which are absolutely incapable of distinguishing between the normal usage of a Web site by humans and DDOS attacks or accesses done by bots.

Only this week I have witnessed several dozen cases when Cloudflare has blocked normal Web page accesses without any possible correct reason, and this besides the normal annoyance of slowing every single access to any page on their "protected" sites with a bot check popup window.

rstat1 · 29m ago

I don’t know seems like it was working as intended to me.

madrox · 2h ago

There is a difference between blocking abusive behavior and blocking all bots. No one really cared about bot scraping to this degree before AI scraping for training purposes became a concern. This is fearmongering by Cloudflare for website maintainers who haven't figured out how to adapt to the AI era so they'll buy more Cloudflare.

remus · 1h ago

> No one really cared about bot scraping to this degree before AI scraping for training purposes became a concern. This is fearmongering by Cloudflare for website maintainers who haven't figured out how to adapt to the AI era so they'll buy more Cloudflare.

I think this is an overly harsh take. I run a fairly niche website which collates some info which isn't available anywhere else on the internet. As it happens I don't mind companies scraping the content, but I could totally undrestand if someone didn't want a company profiting from their work in that way. No one is under an obligation to provide a free service to AI companies.

spwa4 · 2h ago

No they're supposed to allow scraping and information aggregation. That's the essence of the web: it's all text, crawlable, machine-readable (sort of) and parseable. Feel free to block ddos'es.

bayindirh · 2h ago

Feel free to crawl paywalled sites and republish them with discoverable links.

Also after starting the crawl, you can read about Aaron Swartz while waiting.

inetknght · 2h ago

No, they're supposed to rally together and fight for better laws and enforcement of those laws. Which is, arguably, exactly what they've done just in a way that you and I don't like.

armchairhacker · 1h ago

What kind of laws and enforcement would stop a foreign actor from effectively DDoSing your site? What if the actor has (illegally) hacked tech-illiterate users so they have domestic residential IP addresses?

CharlesW · 3h ago

Ethics-free organizations and individuals like Perplexity are why Cloudflare exists. If you have a better way to solve the problems that they solve, the marketplace would reward you handsomely.

Terretta · 2h ago

Do you think users shouldn't get to have user agents or that "content farm ads scaffold" as a business model has a right to be viable? Forcing users to reward either stance seems unsustainable.

CharlesW · 1h ago

> Do you think users shouldn't get to have user agents or that "content farm ads scaffold" as a business model has a right to be viable?

Users should get to have authenticated, anonymous proxy user agents. Because companies like Perplexity just ignore `robots.txt`, maybe something like Private Access Tokens (PATs) with a new class for autonomous agents could be a solution for this.

By "content farm ads scaffold", I'm not sure if you had Perplexity and their ads business in mind, or those crappy little single-serving garbage sites. In any case, they shouldn't be treated differently. I have no problem with the business model, other than that the scam only works because it's currently trivial to parasitically strip-mine and monetize other people's IP.

adrian_b · 1h ago

While the existence of Perplexity may justify the existence of Cloudflare, it does not justify the incompetence of Cloudflare, which is unable to distinguish accesses done by Perplexity and the like from normal accesses done by humans, who use those sites exactly for the purpose they exist, so there cannot be any excuse for the failure of Cloudflare to recognize this.

adrian_b · 2h ago

In the previous years, I did not have many problems with Cloudflare.

However, in the last few months, Cloudflare has become increasingly annoying. I suspect that they might have implemented some "AI" "threat" detection, which gives much more false positives than before.

For instance, this week I have frequently been blocked when trying to access the home page of some sites where I am a paid subscriber, with a completely cryptic message "The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.".

The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.

Also, during the last few days I have been blocked from accessing ResearchGate. I may happen to hit a few times per day some page on the ResearchGate site, while searching for various research papers, which is the very purpose of that site. Therefore I cannot understand what stupid algorithm is used by Cloudflare, that it declares that such normal usage is a "threat".

The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.

I have no idea whether Cloudflare specifically associates Firefox on Linux with "threats" or this happens because whatever flawed statistics Cloudflare has collected about my accesses have all recorded the use of Firefox.

In any case, Cloudflare is completely incapable of discriminating between normal usage of a site by a human (which may be a paying customer) and "threats" caused by bots or whatever "threatening" entities might exist according to Cloudflare.

I am really annoyed by the incompetent programmers who implement such dumb "threat detection solutions", which can create major inconveniences for countless people around the world, while the incompetents who are the cause of this are hiding behind their employer corporation and never suffer consequences proportional to the problems that they have caused to others.

bob1029 · 2h ago

> when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much

This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.

You can tell this stuff is no longer about protecting users or property when there are no meaningful workarounds or exceptions offered anymore. You must use defender (or Cloudflare) unless you intend to be a naughty pirate user.

I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.

tonyhart7 · 1h ago

ratio'ed and L take

windows defender literally better than most commercial antivirus

Taek · 4h ago

We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.

And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.

I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.

It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.

dabockster · 4h ago

> We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.

That idea feels like those episodes of Star Trek DS9 that take place on Feregenar - where you have to pay admission and sign liability wavers to even walk on the sidewalk outside. It's not a true solution.

vineyardmike · 3h ago

> You can't micro-transact the whole internet.

I agree that end-users cannot handle micro transactions across the whole internet. That said, I would like to point out that most of the internet is blanketed in ads and ads involve tons of tiny quick auctions and micro transactions that occur on each page load.

It is totally possible for a system to evolve involving tons of tiny transactions across page loads.

helloplanets · 45m ago

You could argue that the suggested system is actually much simpler than the one we currently have for the sites that are "free", aka funded with ads.

The lengths Meta and the like go to in order to maximize clickthroughs...

edoceo · 2h ago

Remember Flattr?

Taek · 1h ago

The presented solution has invisible UX via layering it into existing metered billing.

And, the whole internet is already micro-transactioned! Every page with ads is doing a bidding war and spending money on your attention. The only person not allowed to bid is yourself!

sellmesoap · 2h ago

> You can't micro-transact the whole internet.

Clearly you don't have the lobes for business /s

AlexandrB · 3h ago

A scary observation in light of another front page article right now: https://news.ycombinator.com/item?id=44783566

If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".

dspillett · 2h ago

“Free” could have a number of meanings here. Free to the viewer, free to the hoster, free to the creator, etc…

That content can't be served entirely for free doesn't mean that all content will require payment, and so is subject to issues with payment processors, just that some things may gravitate back to a model where it costs a small amount to host something (i.e. pay for home internet and host bits off that, or you might have VPS out there that runs tools and costs a few $ /yr or /month). I pay for resources to host my bits & bobs instead of relying on services provided in exchange for stalking the people looking at them, this is free for the viewer as they aren't even paying indirectly.

Most things are paid for anyway, even if the person hosting it nor my looking at it are paying directly: adtech arseholes give services to people hosting content in exchange for the ability to stalk us and attempt to divert our attention. Very few sites/apps, other than play/hobby ones like mine or those from more actively privacy focused types, are free of that.

Taek · 2h ago

That's already a deep problem for all of society. If we don't want that to be an ongoing issue, we need to make sure money is a neutral infrastructure.

It doesn't just apply to the web, it applies to literally everything that we spend money on via a third party service. Which is... most everything these days.

OptionOfT · 3h ago

But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.

mh- · 3h ago

> They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

I'm old enough to remember when that was grounds for getting your site removed from Google results - "cloaking" was against the rules. You couldn't return one result for Googlebot, and another for humans.

No idea when they stopped doing that, but they obviously have let go of that principle.

dspillett · 2h ago

I remember that too, along with high-profile punishments for sites that were keyword stuffing (IIRC a couple of decades ago BMW were completely unlisted for a time for this reason).

I think it died largely because it became impossible top police with any reliability, and being strict about it would remove too much from Google's index because many sites are not easily indexable without them providing a “this is the version without all the extra round-trips for ad impressions and maybe a login needed” variant to common search engines.

Applying the rule strictly would mean that sites implementing PoW tricks like Anubis to reduce unwanted bot traffic would not be included in the index if they serve to Google without the PoW step.

I can't say I like that this has been legitimised even for the (arguably more common) deliberate bait & switch tricks is something I don't like, but (I think) I understand why the rule was allowed to slide.

Saline9515 · 2h ago

Why would I pay for a page if I don't know if the content is what I asked for? How much are you going to pay? How much are you going to charge? This will end up in SEO hell, especially with AI-generated pages farming paid clicks.

adrian_b · 1h ago

Your theory does not match the practice of Cloudflare.

Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.

The so-called "threats" are identified in users that may make a few accesses per day to a site, transferring perhaps a few kilobytes of useful data on the viewed pages (besides whatever amount of stupid scripts the site designer has implemented).

So certainly Cloudflare does not meter the consumed resources.

Moreover, Cloudflare preemptively annoys any user who accesses for the first time a site, having never consumed any resources, perhaps based on irrational profiling based on the used browser and operating system, and geographical location.

seer · 4h ago

Hah still remember the old “solving the internet with hate” idea from Zed Shaw in the glory days of Ruby on Rails.

https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...

I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades

nazcan · 4h ago

I think value is not proportional to bytes - an AI only needs to read a page once to add it to its model, and then served the effectively cached data many times.

chromatin · 3h ago

402 Payment Required

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

Sadly development along these lines has not progressed. Yes, Google Cloud and other services may return it and require some manual human intervention, but I'd love to see _automatic payment negotiation_.

I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.

https://docs.lightning.engineering/the-lightning-network/l40...

https://hackernoon.com/the-resurgence-of-http-402-in-the-age...

makingstuffs · 3h ago

As time passes I’m more certain in the belief that the internet will end up being a licensed system with insanely high barriers to entry which will stop your average dev from even being able to afford deploying a hobby project on it.

Your idea of micro transacting web requests would play into it and probably end up with a system like Netflix where your ISP has access to a set of content creators to whom they grant ‘unlimited’ access as part of the service fee.

I’d imagine that accessing any content creators which are not part of their package will either be blocked via a paywall (buy an addon to access X creators outside our network each month) or charged at an insane price per MB as is the case with mobile data.

Obvious this is all super hypothetical but weirder stuff has happened in my lifetime

bboygravity · 2h ago

I get your thinking, but x.com is proof that simply making users pay (quite a lot) does not eliminate bots.

The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.

As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.

debesyla · 4h ago

Wouldn't this lead to pirated page clones where customer pays less for same-ish content, and less, all the way down to essentially free?

Because I as an user would be glad to have "free sites only" filter, and then just steal content :))

But it's an interesting idea and thought experiment.

armchairhacker · 3h ago

That’s fine. The point for website owners isn’t to make money, it’s to not spend money hosting (or more specifically, to pay a small fixed rate hosting). They want people to see the content; if someone makes the content more accessible, that’s a good thing.

mapontosevenths · 3h ago

You ignore the issue of motivation. Most web content exists because someone wants to make money on it. If the content creator can't do that, they will stop producing content.

These AI web crawlers (Google, Perplexity, etc) are self-cannibalizing robots. They eat the goose that laid the golden egg for breakfast, and lose money doing it most of the time.

If something isn't done to incentivize content creators again eventually there will be only walled-gardens and obsolete content left for the cannibals.

armchairhacker · 2h ago

AFAIK, currently creators get money while not charging for users because of ads.

While I don’t blame creators for using ads now, I don’t think they’re a long-term solution. Ads are already blocked when people visit the site with ad blockers, which are becoming more popular. Obvious sponsored content may be blocked with the ads, and non-obvious sponsored content turns these “creators” into “shills” who are inauthentic and untrustworthy. Even without Google summaries, ad revenue may decrease over time as advertisers realize they aren’t effective or want more profit; even if it doesn’t, it’s my personal opinion that society should decrease the overall amount of ads.

Not everyone creates only for money, the best only create for enough money to sustain themselves. A long-term solution is to expand art funding (e.g. creators apply for grants with their ideas and, if accepted, get paid a fixed rate to execute them) or UBI. Then media can be redistributed, remixed, etc. without impacting creators’ finances.

Terretta · 2h ago

Pretty sure this "most" motivation means it's not a golden egg. It's SEO slop.

If only the one in ten thousand with something to share are left standing to share it, no manufactured content, that's a fine thing.

Terretta · 2h ago

Strongly agree with this armchair POV. Btw it doesn't cost much to host markdown.

Terretta · 2h ago

Or, flip this, don't expect to get paid for pamphleteering?

BolexNOLA · 4h ago

My first reaction: This solution would basically kill what little remaining fun there is to be had browsing the Internet and all but assure no new sites/smaller players will ever see traffic.

Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.

armchairhacker · 3h ago

Depending on the implementation (a big if) it would help smaller websites, because it would make hosting much cheaper. ISPs don’t choose what sites users visit, only what they pay. As long as the ISP isn’t giving significant discounts to visiting big sites (just charging a fixed rate per bytes downloads and uploaded) and charging something reasonable, visiting a small site would be so cheap (a few cents at most, but more likely <1 cent) users won’t weigh cost at all.

BolexNOLA · 3h ago

But users depend on major sites like google [insert service] still and will prioritize their usage accordingly like limited minutes and texts back in the day, right?

armchairhacker · 3h ago

Networking is so cheap, unless ISPs drastically inflate their price, users won’t care.

The average American allegedly* downloads 650-700GB/month, or >20GB/day. 10MB is more than enough for a webpage (honestly, 1MB is usually enough), so that means on average, ISPs serve over 2000 webpages worth of data per day. And the average internet plan is allegedly** $73/month, or <$2.50/day. So $2.50 gets you over 2000 indie sites.

That’s cheap enough, wrapped in a monthly bill, users won’t even pay attention to what sites they visit. The only people hurt by an ideal (granted, ideal) implementation are those who abuse fixed rates and download unreasonable amounts of data, like web crawlers who visit the same page seconds apart for many pages in parallel.

* https://www.astound.com/learn/internet/average-internet-data...

** https://www.nerdwallet.com/article/finance/how-much-is-inter...

brookst · 1h ago

Wait, so the ISPs do from taking $73/user home today to taking $0/user home tomorrow under this plan?

BolexNOLA · 48m ago

Yeah same reaction here - there's no world in which ISP's would agree to this and even if they did I don't want to add them to my list of utilities I have to regularly fight with over claimed vs. actual usage like I do with my power/water/gas companies.

Analemma_ · 3h ago

If site operators can’t afford the costs of keeping sites up in the face of AI scraping, the new/smaller sites are gone anyway.

BolexNOLA · 48m ago

Maybe not but we are not realistically in an either/or scenario here.

novok · 3h ago

The reason why that didn’t work was because regulations made micropayments too expensive, and the government wants it that way to keep control over the financial system.

concinds · 3h ago

> The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall

I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).

Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?

seydor · 3h ago

do you think that every well-meaning GET request should be treated the same way as a distributed attack ? The latter is the reason why people use CF not the former.

concinds · 3h ago

The line can be extremely blurry (that's putting it mildly), and "the latter" is not the only reason people use CF (actually, I wouldn't be surprised at all if it wasn't even the biggest reason).

akagusu · 2h ago

The reason people use Cloudflare is because they provide free CDN, and we have at least 10 years of content marketing out there telling aspiring bloggers that, if they use a CDN in front of their website, their shitty WordPress website hosted on a shady shared hosting will become fast.

tonyhart7 · 1h ago

well they aren't wrong

renrutal · 3h ago

How does one tell a "well-meaning" request from an attack?

benregenspan · 3h ago

> The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?

Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.

seydor · 3h ago

it's because those SEO bots keep crawling over and over, which perplexity does not seem to do (considering that the URLS are user-requested). Those are different cases and robots.txt is only about the former. Cloudflare in this case is not doing "ddos protection" because i presume Perplexity does not constantly refetch or crawl or ddos the website (If perplexity does those things then they are guilty)

https://www.robotstxt.org/faq/what.html

I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.

Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.

benregenspan · 3h ago

I believe that AI crawlers are the main thing that is currently blocked by default when you enroll a new site. No traditional crawlers are blocked, it's not that the big incumbents are allow-listed. And I think that clearly marked "user request" agents like ChatGPT-User are not blocked by default.

But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.

akagusu · 2h ago

> The Big Tech bots provide proven value to most sites.

They provide valeu for their companies. If you get some value from them it's just a side effect.

eddythompson80 · 4h ago

Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.

seydor · 3h ago

they didnt seem to mind when openai et al. took all their content to train LLMs when they were still parasites that didn't have a symbiotic relationship. This thinking is kind of too pro-monopolist for me

eddythompson80 · 3h ago

Pretty sure they DID mind that. It's what the whole post is about.

golergka · 1h ago

That’s a good thing. You want an LLM to know about product or service you are selling and promote it to its users. Getting into the training data is the new SEO.

pkilgore · 3h ago

> This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request.

Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.

layer8 · 2h ago

You’re paying Cloudflare to not get DDoS-attacked or swamped by illegitimate requests. GP is implying that Cloudflare could do a better job of not blocking legitimate, benign requests.

pkilgore · 1h ago

Then we're all operating with very different definitions of legitimate or benign!

I've only ever seen a Cloudflare interstitial when viewing a page with my VPN on, for example -- something I'm happy about as a site owner and accept quite willingly as a VPN user knowing the kinds of abuse that occur over VPN.

binarymax · 4h ago

Here's how perplexity works:

1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")

2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).

3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.

#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.

EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.

pests · 3h ago

What’s wrong with it downloading documents when the user asks it to? My browser also downloads whole documents and sometimes even prefetches documents I haven’t even clicked on yet. Toss in a adblocker or reader mode and my browser also strips all the ads.

Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?

michaelt · 1h ago

When Google sends people to a review website, 30% of users might have an adblocker, but 70% don't. And even those with adblockers might click an affiliate link if they found the review particularly helpful.

When ChatGPT reads a review website, though? Zero ad clicks, zero affiliate links.

binarymax · 3h ago

There’s nothing wrong with downloading documents. I do this in my personal search app. But if you are hammering the site that wants you to calm down, or bypass robots.txt, that’s wrong.

pests · 3h ago

robots.txt is for bots and I am not one though. As a user I can access anything regardless of it being blocked to bots. There are other mechanisms like status codes to rate limit or authenticate if that is an issue.

binarymax · 2h ago

I'm talking about perplexity's behavior. Perhaps there's a point of contention on perplexity downloading a document on a person's behalf. I view this as if there is a service running that does it for multiple people, then it's a bot.

layer8 · 2h ago

Perplexity makes requests on behalf of its users. I would argue that’s only illegitimate if the combined volume of the requests exceeds what the users would do by an order of magnitude or two. Maybe that’s what’s happening.

But “for multiple people” isn’t an argument IMO, since each of those people could run a separate service doing the same. Using the same service, on the contrary, provides an opportunity to reduce the request volume by caching.

mastodon_acc · 2h ago

As a website owner I definitely want the capability allow and block certain crawlers. If I say I don’t want crawlers from Perplexity they should respect that. This sneaky evasion just highlights that company is not to be trusted, and I would definitely pay any hosting provider that helps me enforce blocking parasitic companies like perplexity.

kentonv · 2h ago

> the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

You say "shouldn't" here, but why?

There seems to be a fundamental conflict between two groups who each assert they have "rights":

* Content consumers claim the right to use whatever software they want to consume content.

* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).

These two "rights" are in direct conflict.

The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.

But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?

Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?

(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)

kiratp · 2h ago

The web browsers that the AI companies are about to ship will make requests that are indistinguishable from user requests. The ship on trying to save minimization has sailed.

kentonv · 2h ago

We will be able to distinguish them.

Terretta · 1h ago

"Creators" need to eat, OK, but there's no right to get paid to paste yesterday's recycled newspapers on my laptop screen. Making that unprofitable seems incredibly good for by and large everyone.

It'd likely be a fantastic good if "content creators" stopped being able to eat from the slop they shovel. In the meantime, the smarter the tools that let folks never encounter that form of "content", the more they will pay for them.

There remain legitimate information creation or information discovery activities that nobody used to call "content". One can tell which they are by whether they have names pre-existing SEO, like "research" or "journalism" or "creative writing".

Ad-scaffolding, what the word "content" came to mean, costs money to make, ideally less than the ads it provides a place for generate. This simple equation means the whole ecosystem, together with the technology attempting to perpetuate it as viable, is an ouroboros, eating its own effluvia.

It is, I would argue, undetermined that advertising-driven content as a business model has a "right" to exist in today's form, rather than any number of other business models that sufficed for millennia of information and artistry before.

Today LLMs serve both the generation of additional literally brain-less content, and the sifting of such from information worth using. Both sides are up in arms, but in the long run, it sure seems some other form of information origination and creativity is likely to serve everyone better.

rat9988 · 4h ago

Don't they need a search index?

bwb · 2h ago

I could not keep my website up without Cloudflare given the level of bot and AI crawlers hammering things. I try whenever to do challenges, but sometimes I have to block entire AS blocks.

blantonl · 2h ago

Ask yourself why so many content hosting platforms utilize CLoudflare's services and then contrast that perspective with your posted one. Might enlighten you a bit to think about that for a second.

cantaccesrssbit · 2h ago

I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare sucks. What business is it of theirs to block something that is meant to be accessed by everyone. Like an RSS feeds. FU Cloudflare.

KomoD · 2h ago

That's not Cloudflare's fault, that's the website owner's fault.

If they want the RSS feeds to be accessible then they should configure it to allow those requests.

busymom0 · 3h ago

> why does perplexity even need to crawl websites?

I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.

I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.

golergka · 1h ago

Ironically, cloudflare is also the reason OpenAI agent mode with web use isn’t very usable right now. Every second time I asked it to do a mundane task like checking me in for a flight it couldn’t because of cloudflare.

tonyhart7 · 1h ago

what ironic with this???

we seeing many post about site owner that got hit by millions request because of LLM, we cant blame cloudflare for this because it literally neccessary evil

raincole · 3h ago

I'm sorry, but that's some crazy take.

Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.

seydor · 3h ago

agree to disagree , but:

Google already has access to any webpage because its own search Crawlers are allowed by most websites, and google crawls recursively. Thus Gemini has an advantage of this synergy with google search. Perplexity does not crawl recursively (i presume -- therefore it does not need to consult robots.txt), and it doesn't have synergies with a major search engine.

zer00eyz · 4h ago

> The internet we knew was open and not trusted ... monopolistic behavior

Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.

Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.

AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.

Zak · 3h ago

I disagree about AI agents, at least those that work by automating a web browser that a human could also use. I suppose Google's proposal to add remote attestation to Chrome might make it a little harder, but that seems to be dead for now (and I hope forever).

seydor · 3h ago

As agents become more useful, the monetization model will shift to something ... that we haven't though of yet.

TZubiri · 4h ago

Websites and any business really, have the right to impose terms of use and deny service.

Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.

"And those LLMs didn't ask anyones permission to crawl the entire 'net."

False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.

You either side with the law or with criminals.

seydor · 4h ago

Is that also how e.g. antrhopic trained on libgen?

You can't even say the same thing about openAI because we don't know the corpus they train their models on.

pphysch · 4h ago

Spam and DDOS are serious problems, it's not fair to suggest Cloudflare is just doing this to gatekeep the Internet for its own sake.

seydor · 4h ago

It's definitely not a DDOS when it's a single http request per year. I don't know if they do it on purpose but the fact is none of the big tech crawlers are limited.

zaphar · 4h ago

This is most attributable to the fact that traffic is essentially anonymous so the source ip address is the best that a service can do if it's trying to protect an endpoint.

ok123456 · 2h ago

ovh does a good job with ddos

jklinger410 · 4h ago

> That said, why does perplexity even need to crawl websites?

So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.

They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.

chuckreynolds · 4h ago

insert 'shocked' emoji face here

throw_m239339 · 6h ago

> How can you protect yourself?

Put your valuable content behind a paywall.

b0ner_t0ner · 5h ago

A combination of "Bypass Paywalls Clean for Firefox" and archive.is usually get past these.

schmorptron · 4h ago

Isn't that only because they offer unpaywalled versions to web crawlers in the first place, so they still get ranked in search results?

tempfile · 4h ago

A lot of people posting here seem to think you have a magical god-given right to make money from posting on the public internet. You do not. Effective rate-limiting of crawlers is important, but if the rate is moderated, you do not have a right to decide what people do with the content. If you don't believe that, get off the internet, and don't let the door hit you on the way out.

ibero · 4h ago

what if i want the rate set to zero?

tempfile · 14m ago

Then turn off the server?

You don't have a right to say who or what can read your public website (this is a normative statement). You do have a right not to be DoS'd. If you pretend not to know what that means, it sounds the same as saying "you have an arbitrary right to decide who gets to make requests to your service", but it does not mean that.

jgrall · 3h ago

"You do not have a right to decide what people do with the content." Smh. Yes, laws be damned.

tucnak · 5h ago

The rage-baiters in this thread are merely fishing for excuses to go up against "the Machine," but honestly, widely off-mark when it comes to reality of crawling. This topic has been chewed to bits long before LLM's, but only now it's a big deal because somebody is able to make money by selling automation of all things..? The irony would be strong to hear this from programmers, if only it didn't spell Resentment all over.

If you don't want to get scrapped, don't put up your stuff online.

bbqfog · 6h ago

If you put info on the web, it should be available to everyone or everything with access.

Workaccount2 · 6h ago

What this actually translates to is "Don't bother putting much effort into web content. Put effort into siloed mobile app content where you get compensation".

People like getting money for their work. You do too. Don't lose sight of that.

9cb14c1ec0 · 6h ago

Even for AI summaries that leech off your content without sending any traffic your direction?

goatlover · 4h ago

You're making a moral statement without providing a justification. Why should it for everything with access?

TechDebtDevin · 6h ago

Not according to CF. They are desperate to turn web sites into newspaper dispensers, where you should give them a quarter to see the content, on the basis that a bot is somehow different than a normal human vistor o a legal basis. Cf has been trying this psyop for years.

ectospheno · 6h ago

Sites aren’t getting ad clicks for this traffic. Thus, they have an incentive to do something. Cloudflare is just responding to the market. Is this response bad for us in the long run? Probably. Screaming about cloudflare isn’t going to change the market. You fix a problem with capitalism by using supply and demand levers. Everything else is folly.

TechDebtDevin · 2h ago

I wonder if crawlers started letting ads through, and interacting with them a bit, if these complaints would go away. If we can just shaft the advertisers, maybe that will solve the whole problem :)

TechDebtDevin · 6h ago

Cloudflare screaming into the void desperate to insert themselves as a middleman, in a market ( that they will never succeed in creating) where they extort scrapers for access to websites they cover.

Sorry CF, give up. the courts are on our sides here

sbarre · 6h ago

Which courts exactly?

The world is bigger than the USA.

Just because American tech giants have captured and corrupted legislators in the US doesn't mean the rest of the world will follow.

morkalork · 6h ago

Are you sure? I'm surprised they haven't jumped in on the "scan your face to see the webpage" madness that's taking off around the world

echo42null · 3h ago

Hmm, I’ve always seen robots.txt more as a polite request than an actual rule.

Sure, Google has to follow it because they’re a big company and need to respect certain laws or internal policies. But for everyone else, it’s basically just a “please don’t” sign, not a legal requirement or?

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Great Question (YC W21) Is Hiring a VP of Engineering (Remote) (ycombinator.com)

Coverage Cat (YC S22) Is Hiring a Senior, Staff, or Principal Engineer (coveragecat.com)

Kaizen (YC X25) is hiring engineers to build browser agents that work (kaizenautomation.com)

Infracost (YC W21) hiring first PM to shift $600B cloud spend to proactive (ycombinator.com)

Sei (YC W22) Is Hiring a Full Stack Engineer in Chennai, India (ycombinator.com)

Artie (YC S23) Is Hiring Founding AEs (ycombinator.com)

Cedana (YC S23) Is Hiring a Systems Engineer (ycombinator.com)

CodeCrafters (YC S22) is hiring first Marketing Person (ycombinator.com)

PAX Markets (YC W25) is hiring a founding principal hardware (RTL) engineer (ycombinator.com)

Sendblue (YC S23) is hiring senior engineers (ycombinator.com)

Thunder Compute (YC S24) Is Hiring a C++ Systems Engineer (ycombinator.com)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

QuestDB (YC S20) Is Hiring a Technical Content Lead (questdb.com)

Depot (YC W23) Is Hiring a Technical Content Writer (Remote) (ycombinator.com)

Firebender (YC W24) Is Hiring (ycombinator.com)

Better Auth (YC X25) Is Hiring (ycombinator.com)

Kapa.ai (YC S23) is hiring a software engineers (EU remote) (ycombinator.com)

Spice Data (YC S19) Is Hiring a Product Associate (New Grad) (ycombinator.com)

Extend (YC W23) is hiring engineers to build SOTA document processing (jobs.ashbyhq.com)

Piramidal (YC W24) is hiring a full stack engineer (ycombinator.com)

Mango Health (YC W24) Is Hiring (ycombinator.com)

Resolve (YC W15) Is Hiring an Operations and Billing Lead for Construction VR

Arva AI (YC S24) Is Hiring an AI Research Engineer (London, UK) (arva.ai)

Rejoy Health (YC W21) Is Hiring (ycombinator.com)

Weave (YC W25) is hiring an AI engineer (ycombinator.com)

CoinTracker (YC W18) is hiring to solve crypto taxes and accounting (remote)

Crimson (YC X25) is hiring founding engineers in London (ycombinator.com)

Martin (YC S23) Is Hiring Founding Engineers to Build a Better Siri (ycombinator.com)

Meticulous (YC S21) is hiring in UK to redefine software dev (tinyurl.com)

Infisical (YC W23) Is Hiring DevRel Engineers (ycombinator.com)

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

Comments (470)