It's so weird how fragile digital history is. When things first became digital I remember sentiments of "things can now be maintained perfectly forever" but today it feels like that in 30 years we'll have a better record of 1820 than 2020.
duxup · 3h ago
Everyone wants to close down their corner of the internet because they think AI is going to make them a ton of money. We're getting the first part but I'm not sure we're seeing the latter ... anywhere as far as platforms go.
necovek · 3h ago
Well, Reddit is getting a ton of money out of licensing deals for using their data to train AI.
Whether you classify that as "AI-related" or not, I don't know.
phil21 · 2h ago
It's funny/interesting/terrifying to me that developers went from the near-religious mantra of "Garbage In, Garbage Out" when I was learning computers - to now training our supposedly super intelligent AIs off of reddit posts or even worse.
Basically laundering outright wrong information into something the next generation is now going to believe as scientific truths.
I often wonder how many people/organizations are seeding places like Reddit with malinformation/beliefs for it to become canical truth in the AI age once it's too late to tell the difference for most people?
Lord knows I've made trolling-level posts that are only marginally accurate back in the day that are now part of the AI corpus of knowledge. Mix those in with some of my well-researched stuff and you couldn't even really filter it based on "this account is a shitposter" to weight it lower. Nevermind plenty of earnest posts made that were outright wrong simply due to... being wrong in the moment and later learning better.
anon7000 · 20m ago
Reddit sucks, but it’s also one of the biggest goldmines of human-curated information out there. Alternatives include blogspam, which is worse than useless these days, and forums with limited scope. Figuring out how to sift through dirt to find the nuggets of gold is important for any AI, whether they train on Reddit or not.
carlhjerpe · 49m ago
I think filtering on upvotes/downvotes, comments/views, sub, user and whatever metrics they have on the content can help AI companies train on somewhat reasonable things. Blend it with Wikipedia, scientific papers, reliable newspapers and you're golden?
Metadata is what makes gold out of poo, I assume model developers can "train negatively" too if metadata suggests they should.
phantompeace · 9m ago
So now the people who can buy upvotes get to write history?
BeFlatXIII · 2h ago
Every company who pays is a chump. They ought to get better at scraping and hacking IOT devices as residential proxies.
jajuuka · 1h ago
It's not entirely self enriching. AI crawlers hit servers hard and everyone has their own crawler. So it's partially covering a business expense. Especially with Reddit being a goldmine of content for training data.
Internet Archive has been terrible as capturing full pages on Reddit for a while. So it's not a real loss. Unfortunately right now these AI companies have full freedom to do whatever they want. Taking paid content, artistic works, and your own posts on social media. So Reddit trying to charge them is a good idea as it's some form of quid pro quo put on AI scraping companies.
JohnFen · 1h ago
> So Reddit trying to charge them is a good idea as it's some form of quid pro quo put on AI scraping companies.
Except that what Reddit is really doing is selling content they didn't produce and don't own. I don't think they're walking some kind of high road here like they would be if they were actually fighting against the scraping.
jajuuka · 50m ago
I didn't mean to imply they were. As I said, it's not ENTIRELY for self-enriching reasons. As in self-enrichment is a part of the reasons for this effort to combat AI scrapers.
That being said I can still take some satisfaction in seeing AI scapers get jammed up considering how they face zero consequences right now.
> They are not specifically targeting Wayback Machine.
Anything other than residential IP's are blocked, to my information. Such as IP's of cloud services like Hetzner, GCP, AWS... The list goes on. (from my comment there)
What they're really afraid of is that people will read content using LLM inference and make all the ads and nags and "download the app for a crap experience" go away -- and never click on ads accidentally for an occasional ka-ching.
Yeah, the front end for de-enshittification looks a lot like that other archive site,
In the summer of 2020 I was driving to Buffalo a lot with my son and getting cheap hotel deals thanks to the pandemic and thinking about missile defense systems and I was sick and tired of the awful shape of the web and dreaming up a system that would "archive" 100% of web pages before I read them. I spent two weeks on a spike prototype and concluded that an "archiver" can never really know if a modern web page is done loading so it at best uses heuristics to make the page load completely and waits a long time -- which makes following a link even slower than waiting for all the ads and trackers to load. I finally got Fiber-to-the-Node at home so downloading all the trash of the annoyances economy became more tolerable, a lot of the ideas I had that the time made it into my RSS reader a few years later.
Nathan2055 · 1m ago
> What they're really afraid of is that people will read content using LLM inference and make all the ads and nags and "download the app for a crap experience" go away -- and never click on ads accidentally for an occasional ka-ching.
See, I don't think this is right either. Back during the original API protests, several people (including me!) pointed out that if the concern was really that third-party apps weren't contributing back to Reddit (which was a fair point: Apollo never showed ads of any kind, neither Reddit's or their own) then a good solution would be to make using third-party apps require paying for Reddit Premium. Then they wouldn't have to audit all of the apps to ensure they were displaying ads correctly and would be able to collect revenue outside of the inherent limitations of advertising.
Theoretically, this should have been a straight win for Reddit, especially given the incredibly low income that they've apparently been getting from ads anyway (I can't find the report now so the numbers might not be exact, but I remember it being reported that Reddit was pulling in something like ~$0.60 per user per month versus Twitter's slightly better ~$8 per user per month and Meta's frankly mindblowing ~$50 per user per month) but it was immediately dismissed out of hand in favor of their way more complicated proposal that app developers audit their own usage and then pay Reddit back.
My initial thoughts were either that the Reddit API was so broken that they couldn't figure out how to properly implement the rate limits or payment gating needed for the other strategy (even now the API still doesn't have proper rate limits, they just commence legal action anyone they find abusing it rather than figure out how to lock them out; the best they can really do is the sort of basic IP bans they're using here), or the Reddit higher-ups were so frustrated that Apollo had worked out a profitable business model before them that they just wanted to deploy a strategy targeted specifically at punishing them.
But it quickly became clear later that Reddit genuinely wasn't even thinking about third-party apps. They saw dollar signs from the AI boom, and realized that Reddit was one of the largest and most accessibly corpuses of generally-high-quality text on a wide variety of topics, and AI companies were going to need that. Google showing an intense dependency on Reddit during the blackout didn't hurt either (yes, at this point I genuinely believe the blackout actually hurt more than it helped by giving Reddit further leverage to use on Google, hence why they were one of the first to sign a crawler deal afterwards).
So they decided to use any method they could think of to lock down access to the platform while keeping enough people around that the Reddit platform was still mostly decent enough to be usable for AI training and pivoted much of their business to selling data. All of this while claiming, as they're still doing today with the Internet Archive move, that this is somehow a "privacy measure" meant to ensure deleted comments aren't being archived anywhere.
The same thing basically happened with Stack Exchange, except they had much less leverage over their community because the entire site was previously CC licensed and they didn't have any real authority to override that beyond making data access really annoying.
The good news is that it really does seem like "injest everything" big model AI is the least likely to survive at this point. Between ChatGPT scaling things down massively to save on costs with the GPT-5 update and the Chinese models somehow making do with less data and slower chips by just using better engineering techniques, I highly doubt these economics around AI are going to last. The bad news is that, between stuff like this and the GitHub restructuring today, I don't thing Big Tech has any plans on how they're going to continue functioning in an economy that isn't entirely based on AI hype. And that's really concerning.
freedomben · 2h ago
I had (and still have to some extent) the same dream, though I'm ok with the archiving happening after-the-fact. ArchiveBox has worked reasonably well for me
ChrisArchitect · 2h ago
What is the source? Where did Reddit say this? No blog post or release anywhere
mikestew · 1h ago
Well, there is TFA that quotes a Reddit spokesperson. What do you want, stone tablets?
Whether you classify that as "AI-related" or not, I don't know.
Basically laundering outright wrong information into something the next generation is now going to believe as scientific truths.
I often wonder how many people/organizations are seeding places like Reddit with malinformation/beliefs for it to become canical truth in the AI age once it's too late to tell the difference for most people?
Lord knows I've made trolling-level posts that are only marginally accurate back in the day that are now part of the AI corpus of knowledge. Mix those in with some of my well-researched stuff and you couldn't even really filter it based on "this account is a shitposter" to weight it lower. Nevermind plenty of earnest posts made that were outright wrong simply due to... being wrong in the moment and later learning better.
Metadata is what makes gold out of poo, I assume model developers can "train negatively" too if metadata suggests they should.
Internet Archive has been terrible as capturing full pages on Reddit for a while. So it's not a real loss. Unfortunately right now these AI companies have full freedom to do whatever they want. Taking paid content, artistic works, and your own posts on social media. So Reddit trying to charge them is a good idea as it's some form of quid pro quo put on AI scraping companies.
Except that what Reddit is really doing is selling content they didn't produce and don't own. I don't think they're walking some kind of high road here like they would be if they were actually fighting against the scraping.
That being said I can still take some satisfaction in seeing AI scapers get jammed up considering how they face zero consequences right now.
See: https://www.reddit.com/r/internetarchive/comments/1gpn54q/is...
> They are not specifically targeting Wayback Machine. Anything other than residential IP's are blocked, to my information. Such as IP's of cloud services like Hetzner, GCP, AWS... The list goes on. (from my comment there)
Yeah, the front end for de-enshittification looks a lot like that other archive site,
https://archive.today/
In the summer of 2020 I was driving to Buffalo a lot with my son and getting cheap hotel deals thanks to the pandemic and thinking about missile defense systems and I was sick and tired of the awful shape of the web and dreaming up a system that would "archive" 100% of web pages before I read them. I spent two weeks on a spike prototype and concluded that an "archiver" can never really know if a modern web page is done loading so it at best uses heuristics to make the page load completely and waits a long time -- which makes following a link even slower than waiting for all the ads and trackers to load. I finally got Fiber-to-the-Node at home so downloading all the trash of the annoyances economy became more tolerable, a lot of the ideas I had that the time made it into my RSS reader a few years later.
See, I don't think this is right either. Back during the original API protests, several people (including me!) pointed out that if the concern was really that third-party apps weren't contributing back to Reddit (which was a fair point: Apollo never showed ads of any kind, neither Reddit's or their own) then a good solution would be to make using third-party apps require paying for Reddit Premium. Then they wouldn't have to audit all of the apps to ensure they were displaying ads correctly and would be able to collect revenue outside of the inherent limitations of advertising.
Theoretically, this should have been a straight win for Reddit, especially given the incredibly low income that they've apparently been getting from ads anyway (I can't find the report now so the numbers might not be exact, but I remember it being reported that Reddit was pulling in something like ~$0.60 per user per month versus Twitter's slightly better ~$8 per user per month and Meta's frankly mindblowing ~$50 per user per month) but it was immediately dismissed out of hand in favor of their way more complicated proposal that app developers audit their own usage and then pay Reddit back.
My initial thoughts were either that the Reddit API was so broken that they couldn't figure out how to properly implement the rate limits or payment gating needed for the other strategy (even now the API still doesn't have proper rate limits, they just commence legal action anyone they find abusing it rather than figure out how to lock them out; the best they can really do is the sort of basic IP bans they're using here), or the Reddit higher-ups were so frustrated that Apollo had worked out a profitable business model before them that they just wanted to deploy a strategy targeted specifically at punishing them.
But it quickly became clear later that Reddit genuinely wasn't even thinking about third-party apps. They saw dollar signs from the AI boom, and realized that Reddit was one of the largest and most accessibly corpuses of generally-high-quality text on a wide variety of topics, and AI companies were going to need that. Google showing an intense dependency on Reddit during the blackout didn't hurt either (yes, at this point I genuinely believe the blackout actually hurt more than it helped by giving Reddit further leverage to use on Google, hence why they were one of the first to sign a crawler deal afterwards).
So they decided to use any method they could think of to lock down access to the platform while keeping enough people around that the Reddit platform was still mostly decent enough to be usable for AI training and pivoted much of their business to selling data. All of this while claiming, as they're still doing today with the Internet Archive move, that this is somehow a "privacy measure" meant to ensure deleted comments aren't being archived anywhere.
The same thing basically happened with Stack Exchange, except they had much less leverage over their community because the entire site was previously CC licensed and they didn't have any real authority to override that beyond making data access really annoying.
The good news is that it really does seem like "injest everything" big model AI is the least likely to survive at this point. Between ChatGPT scaling things down massively to save on costs with the GPT-5 update and the Chinese models somehow making do with less data and slower chips by just using better engineering techniques, I highly doubt these economics around AI are going to last. The bad news is that, between stuff like this and the GitHub restructuring today, I don't thing Big Tech has any plans on how they're going to continue functioning in an economy that isn't entirely based on AI hype. And that's really concerning.