Blocking LLMs from your website cuts you off from next-generation search

47 johnjwang 52 8/6/2025, 5:38:09 PM johnjianwang.medium.com ↗

Comments (52)

jerf · 2h ago

This post gets the reason why people are cutting off LLMs exactly backwards and consequently completely fails to address the core issue. The whole reason people are blocking LLMs is precisely that they believe it kills the flow of readers to your content. The LLMs present your ideas and content, maybe with super-tiny attribution that nobody notices or uses [1], maybe with no attribution at all, and you get nothing. People are blocking LLMs with the precise intent of trying to preserve the flow to their content, be it commercially, reputationally, whatever.

[1]: https://www.pewresearch.org/short-reads/2025/07/22/google-us...

andrewmutz · 19m ago

Why would removing your content from LLM training data cause people to go and seek it out directly from you?

Would removing your website from google search results cause people to go directly to your website?

piker · 38m ago

Fair, if your content is your product, but I’m more than happy for every LLM on the planet to summarize my page and hype the virtues of my product to its user.

Disposal8433 · 33m ago

Why do tech bros assume that every site is selling a product? There are blogs, personal web sites, communities, and open-source projects out there.

piker · 28m ago

If there's no product, and it's free, why would one care about it appearing in the output of an LLM? If it's so secret that it shouldn't, then perhaps it should be behind some auth anyway.

Wilder7977 · 16m ago

Because writing is, in many senses, exposing yourself, and at the very least you want the recognition for it (even if only in the form of a visit to the website, and maybe the interactions that can follow)? Maybe you want at least the prestige that comes with writing good content that took a lot of time to create. Maybe because you want to participate in a community with your stuff. Maybe other million reasons.

I know that medium, substack and the other "publication" platforms (like LinkedIn) are trying to commodify even the act of writing into purely a form or marketing (either for a product, or for your personal brand), but not everyone gave up just yet.

nerdjon · 2h ago

That is basically several paragraphs to just say "well you should just adapt to the new world instead of pushing against bad practices". There is barely any "why" actually said here.

We just had the article about how AI search is leading to less clicks, so where is that supposed "pipeline"?

Also completely ignores how you may not want your information to be misconstrued (lied basically) to the user with a helpful link telling them where the source is, but they may never click through. And worse if they know that the information being told to them is wrong, they may then think it was because your site was wrong and trust you less, all without ever clicking that link.

shortstuffsushi · 12m ago

I'm surprised I don't see any comments here to this effect yet: isn't this just AMP 2.0? Website authors don't want their content scraped and rehosted by a 3rd party, even when that 3rd party claims it's for their own benefit. We have a whole kerfuffle about this nearly a decade ago. The arguments for both sides don't appear to have changed.

riffraff · 2h ago

> LLMs are the next generation’s search layer. They’re already generating massive amounts of pipeline for the companies and websites that have gotten good at getting their content displayed in LLMs

[citation needed]

cpursley · 2h ago

Just check your analytics dashboards and see where hits are now starting to come from. Saw on LinkedIn the other day that in the space I serve that a new customer found them via ChatGPT.

eric-burel · 1h ago

The first sentence of the article is literally wrong as it conflates LLM and the search part of a RAG (retrieval augmented generation, when you mix a web search and an LLM). Blocking bots cuts you off from the next-generation search, because it cuts you off from search at all. So far, blocking LLM simply prevents you from being part of the training dataset, which is not the same thing. Please stop upvoting such bad content it really makes Hackernews a terrible place for staying informed on LLMs.

bellBivDinesh · 2h ago

Incredibly simplistic. I’m having a hard time believing a real person wrote this, read it over and decided they had made anything resembling a point.

How about the fact that Google (ideally) sends users to you rather than sharing your work unattributed?

endemic · 14m ago

Heck, Google mostly shows "AI Summaries" and ads -- you'd be lucky to get traffic from 'em now!

ryandrake · 2h ago

Like everything else on the web, LLMs are going to eventually be ruined by marketing teams trying to get them to say "Pepsi" instead of "Coke."

tartoran · 1h ago

Long live local LLMs!

jdiff · 44m ago

Open models still have to get their data from somewhere, the only way they're any more immune is direct corruption. But marketers have shown time and time again that if there's any algorithmic crack in the wall, they will find it.

SideburnsOfDoom · 54m ago

I don't know what you mean ... by including "eventually" in that sentence.

ayaros · 20m ago

Screw this. I didn't put effort into writing many paragraphs of content for my own websites just so it could be summarized by an LLM. I wrote it because I wanted other human beings to read it.

This is just yet another person running an AI company telling me why I should provide free data and labor to the LLMs that power their company. These AI companies are acting as middlemen between the end-user and the content creator; its the latest iteration of an age-old business model which works-out great for the middlemen. Meanwhile, people on either side are taken advantage of.

If the "next-generation" of search is accessed mostly through an LLM, then there's no incentive to participate in it unless you're directly selling a product or service... and then you have to hope and pray the LLM doesn't lie and misrepresent you. Otherwise, if you're making a website to share information or show off your own work, there's zero incentive to participate.

If AI companies want to pay me cold hard cash every time they query my site, then we can negotiate.

mflaherty22 · 2h ago

Very reductionist - so much so that I'm not even sure you understand why websites block LLMs.

JSR_FDED · 2h ago

Nonsensical article. Even if your goal is to create something on the web “for others” (as the article asserts), when 99.9% of your costs go to serving LLM crawlers, it puts that very objective at risk.

merelysounds · 1h ago

> most LLMs have an agentic web-search component that will actively generate links

I guess that’s the problem - search being only a component.

Is the possible search traffic worth having your content become part of an LLM’s training set and possibly used elsewhere?

I guess the answer depends on the content and the website’s business model.

ashwinsundar · 2h ago

    But how many of you wouldn’t hook up your website to Google?

Me. https://ashwinsundar.com/robots.txt

Your computer doesn't have the right to scrape what I say or do anything with it.

    I know one of the primary reasons that I do anything online is to provide an outlet for someone else to see it. If I didn’t want someone else to see it, I’d write it down on my notebook, not on the public web.

Sounds like the same schpiel from the anti-privacy advocates who think that we should all expose everything we're doing because "you should have nothing to hide".

https://archive.is/WjbcU

This article was written for Wired by Moxie Marlinspike in 2013, who went on to later develop the Signal protocol.

I don't want my thoughts or ideas spread across the web promiscuously. The things I say publicly are curated and full of context. That's why I have my own website, and don't post elsewhere.

I'm not playing the same game you are, which appears to be to post liberally and have loose thoughts to maximize "reach".

Mars008 · 33m ago

Instead of fighting you can submit text advertising to LLM bots. And sell it. This 'knowledge' will be embedded into next models.

politelemon · 2h ago

I never managed to get far on this post due to the obnoxious pop-ups. Perhaps blocking humans from reading your posts is ok.

iamwil · 52m ago

I agree with the sentiment. I remember Gwern in an interview remarking something to the effect that if you make your writing and thoughts invisible to LLMs, then your thoughts are going to be invisible to the future, as LLMs are here to stay.

Disposal8433 · 30m ago

LLMs are not a replacement for the HTTP protocol, and people who want to see my site know the address.

ramoz · 1h ago

It doesn't if the agent sits alongside users on their desktops.

LLMs are being blocked by standard bot detection - and the use cases are very much the same. People want smarter bots for the same shitty use cases.

pryelluw · 2h ago

“Providing high quality content that LLMs will actually cite is the new game in town.”

That is not my job nor is it my goal. These companies are taking my work, repurposing it, and selling it under the assumption that because they can access it they can sell it.

Maybe the OP should leave their house door open so people can come in and use his couch. The new game in town is to let other people use your couch.

The mental gymnastics in this post qualify for the Special Olympics.

righthand · 2h ago

It’s not dumb because Googlebot follows the robots.txt rules. That is the sincere crux of it all. No one is going to casually open their site up to Llms that are blatantly scraping their site to then use their information to displace them.

Not blocking violent, bad-actor scrapers is dumb. Letting through bad-actor scrapers because a bunch of rich people want to make it the norm is dumb.

Llms are not directing traffic to the sites and that is the tradeoff that site owners allow with Googlebot. Even if Perplexity or Claude will provide a source, the Llm user is most likely not asking/clicking for it 99% of the time.

skwee357 · 1h ago

I'm somewhat torn on this one.

As an amateur blogger, I would not like LLMs to "steal" my content, display the users the needed pieces they are looking for, while leaving me with zero visitors. The reason I write is to convey a particular message, which the meaning of gets lost, or worse communicated wrongly, due to LLMs.

As an online business owner, I do see both ChatGPT and Perplexity as referrers to my business, meaning that potential customers ask LLM a question/service recommendation, and LLM is directing them to my service, and I would not like to lose this vertical of organic customer acquisition.

---

On a completely different note, medium should die as a platform, together with substack. The amount of intrusive popups, "install our app" bars, and paywalls is just insane. Bloggers, especially technically savvy ones, should be able to host their own blog.

Wilder7977 · 59m ago

Once again I see someone mistaking an LLM regurgitating (in a right, wrong, misleading way, who knows) your content for people "accessing" your content. If the LLM sits between me and my reader and acts as a filter (because the information is rehashed, because maybe sometimes doesn't reference me), basically my goal is to provide information to tools for other companies to make money? I don't write for money, but if you remove also the basic human interaction reader-writer, I might as well really write in my notes.

acrispino · 54m ago

Why was the link title changed?

altairprime · 24m ago

Good.

andreagrandi · 1h ago

Another one not having a clue about what “consent” means. Next?

stego-tech · 45m ago

Good, because this “next generation search” doesn’t cite sources, invents falsehoods, steals content, and doesn’t direct traffic to the site in question, which was the whole point of search engines in the first place.

The fact LLM companies constantly keep getting dinged for ignoring every barrier we throw up to stop their scraping short of something like Anubis shows what their real goal is: theft, monopolization, and reality authoring.

cactusplant7374 · 37m ago

If your website is included in next gen search and the user asks for a source, then your website as a source will be included.

roywiggins · 36m ago

Nobody clicks through, to a first approximation.

https://www.pewresearch.org/short-reads/2025/07/22/google-us...

BoorishBears · 37m ago

It cites sources and while it generates less traffic, that traffic converts significantly better

blindriver · 50m ago

This article is stupid.

You write content so that you get paid, usually through ads and clicks. If people aren't seeing your content because am LLM has consumed it and is regurgitating it and taking your ad clicks, then there's no benefit for you only for the LLM. You're doing the work of Sam Altman and helping him attain his multibillionaire status and you get nothing in return.

jkingsman · 2h ago

> how many of you wouldn’t hook up your website to Google?

If there was a paid-only search engine with dubious ethics practices that was overwhelming my site with traffic in order resell search trained off of (among other things) my personally generated content, I would absolute block it.

LLMs are not search engines, and I'm not gaining any followers or customers in any meaningful way because an LLM indexes my site.

> it also cuts you off from the fastest-growing distribution channel on the web.

I haven't seen the needle tip at all in my acquisition channels from LLMs. Unless you're a household name or very large, LLMs aren't going to shill for your business.

> most LLMs have an agentic web-search component that will actively generate links

Totally. Which is why I don't care if the LLMs index it. Let web content search be good, and lead LLMs to good content; product placement in LLM weights ain't what I'm gonna optimize for, or even permit, if it comes at a cost to me and my infra.

caseyohara · 1h ago

> LLMs are not search engines, and I'm not gaining any followers or customers in any meaningful way because an LLM indexes my site.

Counterpoint: my wife owns an accounting firm and publishes a lot of highly valuable informational content on their website's blog. Stuff like sales tax policies and rates in certain states, accounting/payroll best practices articles, etc. I guess you could call it "content marketing".

Lately they have been getting highly qualified leads coming from LLMs that cite her website's content when answering questions like "What is the sales tax nexus policy in California?". Users presumably follow the citation and then engage with the website, eventually becoming a very warm lead.

So LLMs are obviously not search engines in the conventional sense, but it doesn't mean they are not useful at generating valuable traffic to your marketing website.

vb-8448 · 2h ago

> LLMs are not search engines, and I'm not gaining any followers or customers in any meaningful way because an LLM indexes my site.

^^^^

This

For the moment, and for the foreseeable future, you are just giving your content for free (and have to pay the hosting bill).

jkingsman · 1h ago

And the freeness cuts both ways — if I could, I'd happily open my content to Mistral and all the other totally-free/open-source-releasing LLM companies' scrapers. But I can't; they're going into big corpuses or scraped directly by the commercial actors with funds to scrape the whole kit & kaboodle.

kolinko · 46m ago

> LLMs are not search engines, and I'm not gaining any followers or customers in any meaningful way because an LLM indexes my site.

Friends of mine run a service company, and they already see a significant number of customers reach out because they found them using ChatGPT (et al), not Google. By significant I mean ~20% or so.

Also, for e-commerce, Deep Research from OpenAI, is way better in doing product recommendations than Google. That's my goto place to find most stuff novadays (e.g. I purchased dancing shoes, pants, air cleaners, an air conditioner, supplements and a ton of other things using the recommendations of DR - no search engine comes even close to it)

sackfield · 56m ago

FWIW most of the inbound traffic to startups websites that I know enough to ask about is coming from ChatGPT.

lambdadelirium · 2h ago

Stupid bait post

watwut · 1h ago

The whole thing about LLM is training on content other people created, redirecting that traffic to you and ultimately earn money on it. The whole thing about LLM being pushed everywhere is to get free training data too.

calyth2018 · 2h ago

Even if we take the argument at face value, we should allow LLMs to train their models for free, on the backs of real people's work, just so that there's a chance that they actually improved well enough to replace humans, all that just to have a temporary boost on search discovery of our content.

Not to mention LLMs still spew a lot of badly wrong results (no I will not anthropomorphize the models, they're not ready yet).

This is one heck of a poison chalice. 王先生，你願意喝這杯鶴酒嗎？

skywhopper · 1h ago

This article is based on the false assumption that use of a site by an LLM directs any user traffic to it whatsoever.

Why would anyone choose to anonymously and freely provide content to LLMs? Actually the only use case for that is deliberately seeding misinformation. Which is likely already happening and will soon be the majority of the content accessible to LLMs regardless of what blocking legitimate content providers choose to use.

frozenseven · 2h ago

Why was this flagged? A difference in opinion is no excuse for censorship.

Disposal8433 · 22m ago

His disdain for content creators (in the very first paragraph) is not an opinion. I'm showing my disdain with a flag.

debugnik · 1h ago

When I asked dang about why are very low quality submissions allowed despite them resulting in low quality discussion, he told me and I quote:

> You can flag submissions that you think don't belong on Hacker News.

Well, this is a very low quality submission in my eyes. A tiny read with an unsubstantiated, purely contrarian take that completely misses the point of the debate. Just to be clear, I think anyone is free to post anything on their blogs, that's what they're for, but I don't think posts like these contribute to HN having a good atmosphere for discussion; if I were to write something like this, I'd be ok with it being unsuitable for HN.

BTW I hadn't flagged this before reading your comment. I've done so after reading the submission though.

Claude Code IDE integration for Emacs (github.com)

Litestar is worth a look (b-list.org)

Project Hyperion: Interstellar ship design competition (projecthyperion.org)

A fast, growable array with stable pointers in C (danielchasehooper.com)

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work (collabora.com)

Multics (multicians.org)

The History of F1 Design (espn.com)

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model (github.com)

Jules, our asynchronous coding agent (blog.google)

Comptime.ts: compile-time expressions for TypeScript (comptime.js.org)

We'd be better off with 9-bit bytes (pavpanchekha.com)

Breaking the sorting barrier for directed single-source shortest paths (quantamagazine.org)

Zig Error Patterns (glfmn.io)

Vibe coding the MIT course catalog (stackdiver.com)

The Bluesky Dictionary (avibagla.com)

The arcane alphabets of Black Sabbath (fontsinuse.com)

303Gen – 303 acid loops generator (303-gen-06a668.netlify.app)

Why is it worth spending time on type theory? (2013) (math.stackexchange.com)

Automerge 3.0 (automerge.org)

Gleam v1.12 (github.com)

Rethinking DOM from first principles (acko.net)

Wild pigs' flesh turning neon blue in California (phys.org)

Show HN: Sinkzone DNS – Forwarder that blocks everything except your allowlist (github.com)

Show HN: Write lead sheets in a Markdown way and transpose in a second (cord.land)

The 1090 Megahertz Riddle: A Guide to Decoding Mode S and ADS-B Signals (books.open.tudelft.nl)

We shouldn't have needed lockfiles (tonsky.me)

States and cities decimated SROs, Americans' lowest-cost housing option (pew.org)

I gave the AI arms and legs then it rejected me (grell.dev)

Consistency over Availability: How rqlite Handles the CAP theorem (philipotoole.com)

About the BLOBs in Ventoy (github.com)

The Real Origin of Cisco Systems (tcracs.org)

Qwen3-4B-Thinking-2507 (huggingface.co)

Realizing we needed two sorts of alerts for our temperature monitoring (utcc.utoronto.ca)

Show HN: An open-source e-book reader for conversational reading with an LLM (github.com)

How and Why to Ditch GitHub (taggart-tech.com)

Show HN: When is the next Caltrain? (minimal webapp) (erikschluntz.com)

Dotfiles feel too personal to share (hamatti.org)

The internet wants to check your ID (newyorker.com)

Ofcom will force payment processors and ISPs to stop doing business with you (bsky.app)

Open models by OpenAI (openai.com)

Google suffers data breach in ongoing Salesforce data theft attacks (bleepingcomputer.com)

Python performance myths and fairy tales (lwn.net)

Blocking LLMs from your website cuts you off from next-generation search (johnjianwang.medium.com)

Steam's fight against Visa, Mastercard, and censorship is only getting messier (polygon.com)

NautilusTrader: Open-source algorithmic trading platform (nautilustrader.io)

Cognitive decline can be slowed down with lifestyle changes (smithsonianmag.com)

How to Scale Proteomics (asimov.press)

Brennan Center for Justice Report: The Campaign to Undermine the Next Election (brennancenter.org)

Kyber (YC W23) is hiring enterprise account executives (ycombinator.com)

Received a Mysterious Package with a QR Code? Don't Scan It (pcmag.com)

Blocking LLMs from your website cuts you off from next-generation search

Comments (52)