Copying my comment from a previous discussion of ignoring robots.txt, below. I actually don’t care if someone ignores my robots.txt, as long as their crawler is well run. But the smug attitude is annoying when so many crawlers are not.
————
We have a faceted search that creates billions of unique URLs by combinations of the facets. As such, we block all crawlers from it in robots.txt, which saves us AND them from a bunch of pointless indexing load.
But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.
Whoever it is, they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
nonethewiser · 21m ago
Wasting your money too right?
I guess another angle on this is putting trust in people to comply with ROBOTS.txt. There is no guarantee so we should probably design with the assumption that our sites will be crawled however people want.
Also im curious about your use case.
>We have a faceted search that creates billions of unique URLs by combinations of the facets.
Are we talking about a search that has filters like (to use ecommerce as an example), brand, price range, color, etc. And then all these combinations make up a URL (hence bilions)? How does a crawler discover these? They are just designed to detect all these filters and try all combinations? That doesn't really jive with my understanding with crawlers but otherwise IDK how it would be generating billions of unique URLs. I guess maybe they could also be included in sitemaps but I doubt that.
paulddraper · 26m ago
Yes, this has been the traditional reason for robots.txt -- protects the bot as much as it does the site.
bonaldi · 1h ago
Not sure the emotive language is warranted. Message appears to be “if you use robots.txt AND archive sites honor it AND you are dumb enough to delete your data without a backup THEN you won’t have a way to recover and you’ll be sorry”.
It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.
QuercusMax · 4m ago
I just plain don't understand what they mean by "suicide note" in this case, and it doesn't seem to be explained in the text.
A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".
bigbuppo · 34m ago
Or major web properties for that matter.
paulddraper · 32m ago
> volumes of LLM scraping
FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.
(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)
esseph · 6m ago
It's hard because of attribution, but it absolutely is happening at very high volume. I actually got an alert this morning when I woke up from our monitoring tools that some external sites were being scraped. Happens multiple times a day.
A lot of it is coming through compromised residential endpoint botnets.
tracerbulletx · 1h ago
This is a screed that does not address a single point of the actual philosophical issue.
The issue is a debate between what the expectations are for content posted on the public internet. There is the viewpoint that it should be totally machine operable and programmatic and if you want it to be private you should gate it behind authentication, that the semantic web is an important concept and violating it is a breach of protocol. There's also the argument that it's your content, no one has a right to it, and you should be able to license its use anyway you want. There is a trade off between the implications of the two.
rafram · 1h ago
I think this is kind of misguided - it ignores the main reason sites use robots.txt, which is to exclude irrelevant/old/non-human-readable pages that nevertheless need to remain online from being indexed by search engines - but it's an interesting look at Archive Team's rationale.
xp84 · 57m ago
Yes, and I'd add to that dynamically-generated URLs of infinite variability which have two separate but equally-important reasons for automated traffic to avoid:
1. You (bot) are wasting your bandwidth, CPU, storage on a literally unbounded set of pages
2. This may or may not cause resource problems for the owner of the site (e.g. Suppose they use Algolia to power search and you search for 10,000,000 different search terms... and Algolia charges them by volume of searches.)
The author of this angry rant really seems specifically ticked at some perceived 'bad actor' who is using robots.txt as an attempt to "block people from getting at stuff" but it's super misguided in that it ignores an entire purpose of robots.txt that is not even necessarily adversarial to the "robot."
This whole thing could have been a single sentence: "Robots.txt has a few competing vague interpretations and is voluntary; not all bots obey it, so if you're fully relying on it to prevent a site from being archived, that won't work."
paulddraper · 30m ago
Correct.
That has been one of the biggest uses -- improve SEO by preventing web crawlers from getting lost/confused in a maze of irrelevant content.
hosh · 1h ago
I absolutely will use a robots.txt on my personal sites, which will include a tarpit.
This has nothing to do with keeping my webserver from crashing, and has more to do with crawlers using content to train AI.
Anything I actually want to keep as a legacy, I’ll store with permanent.org
btilly · 9m ago
Whatever we think of archive.org's position, modern AI companies have clearly taken the same basic position. And are willing to devote a lot more resources to vacuuming up the internet than crawlers did back in 2011.
My personal position is that robots.txt is useless when faced with companies who have no sense of shame about abusing the resources of others. And if it is useless, there isn't much of a point in having it. Just make sure that nothing public facing is going to be too expensive for your server. But that's like saying that the solution to thieves is to not carry money around. Yes, it is a reasonable precaution. But it doesn't leave me feeling any better about the thieves.
knome · 1h ago
given this is from a group determined to copy and archive your data with or without your permission, their opinions on the usefulness of ROBOTS.TXT seem kind of irrelevant. of course they aren't going to respect it. they see themselves as 'rogue digital archivists', and being edgy and legally rather grey is part of their self-image. they're going to back it up, regardless of who says they can't.
for the rest of the net, ROBOTS.TXT is still often used for limiting the blast radius of search engines and bot crawl-delays and other "we know you're going to download this, please respect these provisions" type situations, as a sort of gentlemen's agreement. the site operator won't blackhole your net-ranges if you abide their terms. that's a reasonably useful thing to have.
SCdF · 1h ago
This wiki page was created in 2011, in case you're wondering how long they've held this position
procaryote · 1h ago
Not having things archived because you explicitly opted out of crawling is a feature, not a bug
Otherwise you can whitelist a specific crawler in robots.txt
rzzzt · 48m ago
(I understand it is a different entity) archive .org at one point started to honor the robots.txt settings of the website's current owner, hiding archived copies you could browse in the past. I don't know whether they still do this.
jawns · 1h ago
Is a person not allowed to put up a "no trespassing" sign on their land unless they have a reason that makes sense to would-be trespassers?
I know that ignoring a robots.txt file doesn't carry the same legal consequences as trespassing on physical land, but it's still going against the expressed wishes of the site owner.
Sure, you can argue that the site owner should restrict access using other gates, just as you might argue a land owner should put up a fence.
But isn't this a weird version of Chesterton's Fence, where a person decides that they will trespass beyond the fenced area because they can see no reason why the area should be fenced?
No comments yet
rglover · 56m ago
I see old stuff like this and it starts to become clear why the web is in tatters today. It may not be respected, but unless you have a really silly config (I'm hard-pressed to even guess what you could do short of a weird redirect loop), it won't be doing any harm.
> What this situation does, in fact, is cause many more problems than it solves - catastrophic failures on a website are ensured total destruction with the addition of ROBOTS.TXT.
Of course an archival pedant [1] will tell you it's a bad idea (because it makes their archival process less effective)—but this is one of those "maybe you should think for yourself and not just implement what some rando says on the internet" moments.
If you're using version control, running backups, and not treating your production env like a home computer (i.e., you're aware of the ephemeral nature of a disk on a VPS), you're fine.
[1] Archivists are great (and should be supported), but when you turn it into a crusade, you get foolish, generalized takes like this wiki.
bigstrat2003 · 27m ago
I really lost a lot of respect for the team when I read this page. No matter how good their intentions are, by deliberately ignoring robots.txt they are behaving just as badly as the various AI companies (and other similar entities) that scrape data against the wishes of the site owner. They are, in other words, directly contributing to the destruction of the commons by abusing trust and ensuring that everyone has to treat each other as a potential bad actor. Dick move, Archive Team.
akk0 · 19m ago
Mind you're reading a 14 year old page. I honestly don't see any value in this being posted on HN.
robots.txt is the digital equivalent of "one piece per person" on an unwatched Halloween bowl.
The people who wouldn't don't need the sign, the people who want to do it anyway.
If you don't want crawling, there are other ways to prevent / slow down crawling than asking nicely.
blipvert · 40m ago
Alternatively, it’s the equivalent of having a sign saying “Caution, Tarpit” and having a tarpit.
You’re welcome to ride if you obey the rules of carriage.
Don’t make me tap the sign.
kazinator · 1h ago
If you don't obey someone's robots.txt, your bot will end up in their honeypot: be prepared for zip bombs, or generated infinite recursions and whatnot. You better have good countercountermeasures.
robots.txt is helping you identify which parts of the website the author believes are of interest for search indexing or AI training or whatever.
fetching robots.txt and behaving in a conforming manner can open doors for you. If I spot a bot like that in my logs, I might whitelist them, and feed them a different robots.txt.
paulddraper · 28m ago
tbf most bots do that nowadays.
xg15 · 1h ago
Set up a tarpit, put it in the robots.txt as a global exclusion, watch hilarity ensue for all the crawlers that ignore the exclusion.
gmuslera · 33m ago
robots.txt takes as assumptions that are well-meant, and respectful to the site intentions, major players, that try to index/mirror sites while avoiding overwhelming it and accessing only what is supposed to be freely accessed. Using a visible user-agent, having a clearly defined IP block for doing those scans, and a method of scanning goes in the same direction of cooperating with the site owner to both get visibility while not affecting (a lot) functionality.
But that doesn't mean that there aren't bad players, that ignore the robots.txt, give random user agent strings, or connects from IPs from all the world to avoid being blocked.
LLMs has changed a bit the landscape, mostly because far more players want to get everything or have automated tools to search your information on specific requests. But that doesn't rule out that still exist well-behaved players.
spaceport · 43m ago
Renaming my robots.txt to reeeebots.txt and writing a justification line by line on why XYZ shouldn't be archived is now on my todo. Along with adding a tarpit.
rolph · 1h ago
the archiveteam statements in the article are sure to win special attention, i think this could be footgunning, and .IF archiveteam .THEN script.exe pleasantries.
giancarlostoro · 1h ago
Its mostly for search engines to figure out how to crawl your website. Use it sparingly.
Sanzig · 1h ago
Ugh. Yeah, this misses the point: not everyone wants their content archived. Of course, there are no feasible technical means to prevent this from happening, so robots.txt is a friendly way of saying "hey, don't save this stuff." Just because theres no technical reason you can't archive doesn't mean that you shouldn't respect someone's wishes.
It's a bit like going to a clothing optional beach with a big camera and taking a bunch of photos. Is what you're doing legal? In most countries, yes. Are you an asshole for doing it? Also yes.
layer8 · 1h ago
(2011)
rafram · 1h ago
Thanks, added to title.
soiltype · 1h ago
I have more complaints about this shitty article than it is worth. At least it's clearly a human screed, not LLM generated.
Just say you won't honor it and move on.
_Algernon_ · 1h ago
I mean the main reason is that robots.txt is pointless these days.
When it was introduced, the web was largely collaborative project within the academic realm. A system based on the honor system worked for the most part.
These days the web is adversarial through and through. A robots.txt file seems like an anachronistic, almost quaint museum piece, reminding us of what once was, while we stoop head first into tech feudalism.
RajT88 · 1h ago
In fact the problem of the "never ending September" has evolved into, "the never ending barrage of Septemberbots and AI vacuum bots".
The horrors of the 1990's internet is quaint by comparison to the society level problems we now have.
rolph · 1h ago
its not a request anymore, its often a warning not to go any farther, lest ye be zipbombed or tarpitted into wasting bandwidth and time.
————
We have a faceted search that creates billions of unique URLs by combinations of the facets. As such, we block all crawlers from it in robots.txt, which saves us AND them from a bunch of pointless indexing load. But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too. Whoever it is, they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
I guess another angle on this is putting trust in people to comply with ROBOTS.txt. There is no guarantee so we should probably design with the assumption that our sites will be crawled however people want.
Also im curious about your use case.
>We have a faceted search that creates billions of unique URLs by combinations of the facets.
Are we talking about a search that has filters like (to use ecommerce as an example), brand, price range, color, etc. And then all these combinations make up a URL (hence bilions)? How does a crawler discover these? They are just designed to detect all these filters and try all combinations? That doesn't really jive with my understanding with crawlers but otherwise IDK how it would be generating billions of unique URLs. I guess maybe they could also be included in sitemaps but I doubt that.
It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.
A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".
FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.
(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)
A lot of it is coming through compromised residential endpoint botnets.
The issue is a debate between what the expectations are for content posted on the public internet. There is the viewpoint that it should be totally machine operable and programmatic and if you want it to be private you should gate it behind authentication, that the semantic web is an important concept and violating it is a breach of protocol. There's also the argument that it's your content, no one has a right to it, and you should be able to license its use anyway you want. There is a trade off between the implications of the two.
1. You (bot) are wasting your bandwidth, CPU, storage on a literally unbounded set of pages
2. This may or may not cause resource problems for the owner of the site (e.g. Suppose they use Algolia to power search and you search for 10,000,000 different search terms... and Algolia charges them by volume of searches.)
The author of this angry rant really seems specifically ticked at some perceived 'bad actor' who is using robots.txt as an attempt to "block people from getting at stuff" but it's super misguided in that it ignores an entire purpose of robots.txt that is not even necessarily adversarial to the "robot."
This whole thing could have been a single sentence: "Robots.txt has a few competing vague interpretations and is voluntary; not all bots obey it, so if you're fully relying on it to prevent a site from being archived, that won't work."
That has been one of the biggest uses -- improve SEO by preventing web crawlers from getting lost/confused in a maze of irrelevant content.
This has nothing to do with keeping my webserver from crashing, and has more to do with crawlers using content to train AI.
Anything I actually want to keep as a legacy, I’ll store with permanent.org
See https://news.ycombinator.com/item?id=43476337 for a random example of a discussion about this.
My personal position is that robots.txt is useless when faced with companies who have no sense of shame about abusing the resources of others. And if it is useless, there isn't much of a point in having it. Just make sure that nothing public facing is going to be too expensive for your server. But that's like saying that the solution to thieves is to not carry money around. Yes, it is a reasonable precaution. But it doesn't leave me feeling any better about the thieves.
for the rest of the net, ROBOTS.TXT is still often used for limiting the blast radius of search engines and bot crawl-delays and other "we know you're going to download this, please respect these provisions" type situations, as a sort of gentlemen's agreement. the site operator won't blackhole your net-ranges if you abide their terms. that's a reasonably useful thing to have.
Otherwise you can whitelist a specific crawler in robots.txt
I know that ignoring a robots.txt file doesn't carry the same legal consequences as trespassing on physical land, but it's still going against the expressed wishes of the site owner.
Sure, you can argue that the site owner should restrict access using other gates, just as you might argue a land owner should put up a fence.
But isn't this a weird version of Chesterton's Fence, where a person decides that they will trespass beyond the fenced area because they can see no reason why the area should be fenced?
No comments yet
> What this situation does, in fact, is cause many more problems than it solves - catastrophic failures on a website are ensured total destruction with the addition of ROBOTS.TXT.
Of course an archival pedant [1] will tell you it's a bad idea (because it makes their archival process less effective)—but this is one of those "maybe you should think for yourself and not just implement what some rando says on the internet" moments.
If you're using version control, running backups, and not treating your production env like a home computer (i.e., you're aware of the ephemeral nature of a disk on a VPS), you're fine.
[1] Archivists are great (and should be supported), but when you turn it into a crusade, you get foolish, generalized takes like this wiki.
ROBOTS.TXT is a suicide note - https://news.ycombinator.com/item?id=13376870 - Jan 2017 (30 comments)
Robots.txt is a suicide note - https://news.ycombinator.com/item?id=2531219 - May 2011 (91 comments)
The people who wouldn't don't need the sign, the people who want to do it anyway.
If you don't want crawling, there are other ways to prevent / slow down crawling than asking nicely.
You’re welcome to ride if you obey the rules of carriage.
Don’t make me tap the sign.
robots.txt is helping you identify which parts of the website the author believes are of interest for search indexing or AI training or whatever.
fetching robots.txt and behaving in a conforming manner can open doors for you. If I spot a bot like that in my logs, I might whitelist them, and feed them a different robots.txt.
But that doesn't mean that there aren't bad players, that ignore the robots.txt, give random user agent strings, or connects from IPs from all the world to avoid being blocked.
LLMs has changed a bit the landscape, mostly because far more players want to get everything or have automated tools to search your information on specific requests. But that doesn't rule out that still exist well-behaved players.
It's a bit like going to a clothing optional beach with a big camera and taking a bunch of photos. Is what you're doing legal? In most countries, yes. Are you an asshole for doing it? Also yes.
Just say you won't honor it and move on.
When it was introduced, the web was largely collaborative project within the academic realm. A system based on the honor system worked for the most part.
These days the web is adversarial through and through. A robots.txt file seems like an anachronistic, almost quaint museum piece, reminding us of what once was, while we stoop head first into tech feudalism.
The horrors of the 1990's internet is quaint by comparison to the society level problems we now have.