These companies are sitting on a never-ending stream of human created data. What do you think happens to your conversations or other interactions with AI? Quality might be a bit sus though.
bionhoward · 3h ago
You can deactivate ClosedAI model training in Settings > Data Controls > Improve the model for everyone
In Gemini you can turn off Gemini Apps Activity (warning: deletes your chat log, you need to copy paste everything into notes)
Highly recommended.
AstroBen · 3h ago
I'd imagine it's really low quality data. Most or all of my conversations with an LLM are questions or telling it to do something, with varying levels of specificity
I'm not sure what they'd get from training on that
ted537 · 3h ago
I don't think it would be too hard to scrape useful data out of my LLM convos.
If human response is "That's BS", "fuck off", or something similar, mark as bad assistant message.
If human response is "huh" or "cool", mark as good assistant message.
If on ChatGPT, watch how much scrolling user does. If there's a lot, its somewhat likely that the LLM outputted something useful.
That strategy would have holes of course but as long as its better than guessing something like that would be a useful heuristic.
londons_explore · 7m ago
This.
Even very weak human signals can be immensely valuable over large enough datasets.
insin · 3h ago
I sometimes wonder if they're vulnerable to a coordinated effort of deliberately upvoting shit assistant turns and praising in the next user turn - how much does that actually contribute to future training, if at all?
I had a very basic React question about useState while porting some vanilla code last week which all models of all stripes I've tried it on have been confidently and completely incorrect about, up to stating the code absolutely will not work, even when I take a turn to assert that I ran it and it does, so there's plenty of shit in there already.
phillipcarter · 3h ago
Most of the human-created data is also very low quality. But it's also limited in other ways, such as how a lot of so-called high-quality data online is typically the finished answer to a question, with no serialization of the thought process that lead to that answer.
jacobgkau · 3h ago
I think he was referring not to finished content, but to the prompts humans put in when using chatbots. The prompts would show some of the thought process, but then they won't really show the answer (as that's output by the chatbot and not the human prompting it).
PessimalDecimal · 3h ago
How will they tell if data is human-created or not?
adamgordonbell · 2h ago
> Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?
No. Synthetic data is being used to improve LLMs
wrsh07 · 2h ago
This whole line of thought is sort of funny. Yes you can try training a model on synthetic data in such a way that it experiences model collapse
That doesn't mean there aren't ways to train a model incorporating synthetic data without seeing model collapse
RainyDayTmrw · 1h ago
How does that work? It defies intuition. It distills existing data. How is that better than the initial data?
kolinko · 43m ago
Not when it comes to math/programming/reasoning. You can generate infinite new problem and solution examples that are based on existing knowledge of course, but build on top of it, not distill it.
A simple example would be chess ai. The core knowledge is rules of the game. We have human generated examples of plays, but we don’t really need them - we can (and we did) synthesize data to train ai.
A similar pattern can be used for all math/physics/programming/reasoning.
Jensson · 34m ago
> A similar pattern can be used for all math/physics/programming/reasoning.
No it can't, the pattern for chess worked since it was an invented problem where we have a simple outcome checks, we can't do the same for natural problems where we don't have easily judged outcomes.
So you can do it for arithmetics and similar where you can generate tons of questions and answers, but you can't use this for things that are fuzzier like physics or chemistry or math theorem choices. In the end we don't really know what a good math theorem is like, it has to be useful but how do you judge that? Not just any truthy mathematical statement is seen as a theorem, most statements doesn't lead anywhere.
Once we have a universal automated judge that can judge any kind of human research output then sure your statement is true, then we can train research AI that way. But we don't have that, or science would look very different than it does today. But I'd argue that such a judge needs to be AGI on its own, so its circular.
meowkit · 14m ago
> Once we have a universal automated judge that can judge any kind of human research output then sure your statement is true,
If you've noticed, most LLM interfaces have a "thumbs up" or "thumbs down" response. The prompt may provide novel data. The text generated is synthetic. You don't need an automated judge, the user is providing sufficient feedback.
Same thing goes for the other disciplines.
_heimdall · 2h ago
Do we know the results yet?
I know they're training with synthetic data, I didn't realize that has been done at scalr for long enough to really know if it improved (assuming the metrics its improving are defined well).
jdietrich · 2h ago
Deepseek V3 and R1 are both substantially trained on synthetic data. The results speak for themselves.
ninetyninenine · 2m ago
I mean imagine Linear least squares on a 2D graph.
I have a best fit line. Then I take random data on that line to train a new line.
I pretty much get the same line.
From an intuitive perspective... it doesn't get worse. At worst it stays the same.
Now imagine something a bit more complex. I have a best fit curve that's very close to a line.
I use random data from that curve to train a new best fit line.
I get something different now. Not necessarily worse.
I mean literally just take all your ideas of ML and just imagine it on the 2D plane doing curve fitting. If retraining new lines from generated data doesn't necessarily make things worse.
pphysch · 2h ago
Synthetic data ought to be viewed as an extension of the training process rather than proper new phenomena. It can definitely help smooth things out and reinforce wanted behavior, but it's still derivative of the real data.
behnamoh · 4h ago
I've heard that OpenAI and many AI labs put watermarks [0] in their LLM outputs to detect AI-generated content and filter it out.
[0] Like statistics of words, etc.
jsheard · 3h ago
Maybe they do use watermarks, and the vendors which only offer hosted models can just log everything they've ever generated, but there's enough players all working on this stuff independently of each other that filtering out their own noise would only get them so far.
I noticed that a big chunk of the default Llama 4 system prompt is devoted to suppressing various GPT-isms, which to me implies they weren't able to keep their newer training set from being contaminated by competing models.
> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
Rodeoclash · 2h ago
Yeah, it's known as the em dash!
jbaber · 2h ago
Y'know, I've been writing double dashes and having them converted into em dashes about 50% of the time on whatever platform I'm using for decades. It's bizarre that this is suddenly supposed to be a shibboleth.
AaronAPU · 2h ago
Have you ever considered you might be an LLM?
_heimdall · 2h ago
I could have sworn they all gave up on watermarking 12 or 18 months ago when they realized it wasn't possible to do reliably.
dustingetz · 3h ago
do they also watermark the code?
jimbob45 · 2h ago
Wouldn’t be hard to do. Just alternate tabs and spaces and no one would ever know or care to check.
djeastm · 2h ago
Most coders would have code cleaning tools in their IDEs that would take care of that automatically.
jimbob45 · 1h ago
What about invisible Unicode characters?
umbra07 · 29m ago
Too obvious. Someone would have found that already.
sampullman · 2h ago
Hopefully that's converted to one or the other when saved in an editor, or caught in CI.
IAmGraydon · 3h ago
Interesting. That could certainly come in handy if it’s something they can’t avoid. We, too, might be able to better detect and filter their output.
declan_roberts · 2h ago
The reality is that for the most part, any corpus created after 2022 is going to be seriously polluted.
alganet · 2h ago
I'd say 2007 or so.
There wasn't any known active AI back then, but statistics on popular ideas and internet content was already a thing, and speech pollution based on those assessments had already started to spread fast, manually outputted.
Sure, a lot of good content came out since then. But the amount of garbage... it's immense and very difficult to sort out automatically.
The major issue is that this garbage then _became_ the norm. Only people who lived back then can remember what it was. For new folk, it looks just like a generational shift. However, it is quite obvious that some aspects of this shift were... unnatural (in the sense of not being spontaneous cultural manifestations).
lazystar · 1h ago
and im sure someone from the 90's would say the same about '97.
I mentioned explicitly that I see what happened as distinct from a natural generational shift.
There are many phenomena around that era to support what I am saying. Like, for example, the first massive political campaign to leverage internet as its primary vehicle.
anonygler · 2h ago
This reminds me of the Monsanto case, where they sued a farmer (and won) for using patented seeds that the farmer obtained from a local grain elevator which happened to contain some of Monsanto's seeds.
Should it eventually happen for LLM outputs, I hope we name it Slop Wars.
deadbabe · 2h ago
A good way to harvest new training material is to eavesdrop real human conversations from non polluted sources (such as microphones listening to people talk in public places or texts), transcribe them, and feed them to LLMs.
blooddragon · 3h ago
Time for GANs to make a resurgence?
carlosjobim · 2h ago
Shadow libraries
jacobsenscott · 3h ago
Today we have humans being trained on llm garbage - kids using it to do their homework - programmers using it to "learn" how to code, med students cheating their way through med school, etc. So the content the humans are producing and will produce is really just LLM statistical word jumbles - ie human generated content will soon be as useless as LLM generated content.
nradov · 2h ago
I'm not too worried about med students. You can't really use an LLM to cheat on the boards or make it through residency.
No comments yet
throwup238 · 2h ago
It’d be deeply ironic if the great filter for the human race turned out to be chatbots.
nine_k · 2h ago
Hello, secret sources of untainted but modern knowledge, written by human experts, and closely guarded by these experts.
leoapagano · 2h ago
I can't lie, I miss when the only GPT I had to worry about was the GUID Partition Table.
userbinator · 2h ago
At least the MBR acronym still remains.
(Most of my disks are still MBR as they're not big enough to be worth the hassle of using GPT.)
layer8 · 2h ago
Someone should encode a chat program in it.
stainablesteel · 2h ago
i can't believe this article wasn't written 2 years ago, this is just the basics man
mondrian · 3h ago
The "core reasoning" part of AI may be increasingly important to improve, and its "database of factual knowledge" aspects may be less and less important, maybe increasingly a hindrance. So more focused and specialized training may take over toward increasing reasoning precision, and not this never-ending stream of new data.
So maybe we'll get better reasoning and therefore better generated data/content in the wild, without this negative feedback loop everyone is worried about.
meroes · 2h ago
What core reasoning?
SpicyLemonZest · 2h ago
For example, I just asked o3 "how many Is are in antidisestablishmentarianism", and this was its train-of-thought output:
------
The user wants to know how many “I”s are in the word “antidisestablishmentarianism.” They probably mean the letter "I" in lowercase, though the question was a little broader with "Is."
I’ll treat it case-insensitively and count the total occurrences of the letter "i" (both uppercase and lowercase). I can count them manually now, or use a tool for faster analysis.
Analyzed
word = "antidisestablishmentarianism"
word_lower = word.lower()
word_lower.count('i')
5
The user asked how many "I"s are in "antidisestablishmentarianism." I’m taking it to mean whether it’s counting "i" or "I", and I’ll consider them as the same for simplicity. So, I’ll count the 5 occurrences of "i" in the word. I manually broke it down and verified there are indeed five "i"s, so the final count should be 5. Pretty straightforward, now that it's all clear!
Done
There are 5 “i”s in “antidisestablishmentarianism.”
------
You can come up with complicated definitions of "core reasoning" where this doesn't count, and I understand there's great demand for such definitions, but this clearly satisfies common-sense understanding of the term.
Jensson · 41m ago
> You can come up with complicated definitions of "core reasoning" where this doesn't count
Did we read the same response? It did write a lot of reasons, but didn't do any reasoning at all, it just suddenly wrote "5" here
So, I’ll count the 5 occurrences of "i" in the word.
There was no reasoning at all to arrive at 5, so no your example just proves how these models are great at faking reasoning.
selfhoster · 1h ago
Then I guess Ubuntu has had reasoning for several decades:
sudp
Command 'sudp' not found, did you mean:
command 'sudo' from deb sudo (1.9.9-1ubuntu2.4)
command 'sudo' from deb sudo-ldap (1.9.9-1ubuntu2.4)
command 'sup' from deb sup (20100519-3)
command 'sfdp' from deb graphviz (2.42.2-6)
Try: sudo apt install <deb name>
meroes · 1h ago
I might just be on the opposite side of the aisle, but to me chain-of-thought is better understood as simply more context.
Of course there is ambiguity though, more context would be hard to distinguish from core-reasoning and vice versa.
I think LLMs/AI mean we can substitute reasoning with vast accumulations and relations between contexts.
Remember, RLHF gives the models some, and perhaps most of these chains-of-thought, when there isn’t sufficient text to scrape for each family of problems. When I see that chain-of-thought, the first thing I think of is of my peers who had write, rewrite, nudge, and correct these chains of thought, and not about core reasoning.
The CoT has that same overexplained step-by-step so many RLHF’ers will be accustomed to, and much of it was authored/originated by them. And due to the infinite holes it feels like plugging, I dont call that RL reasoning.
gerdesj · 3h ago
You seem to be arguing that bollocks couched in flowery language is a stretch goal of AI.
Are you sure?
_heimdall · 2h ago
That really depends on expectations.
If AI is meant to sound nearly identical to a human, you don't need more training data.
If its meant to act as a natural language encyclopedia, we'll never get there with LLMs which amount to natural language processing on top of a massively compressed dataset.
gerdesj · 2h ago
I've read the Google page "Getting AI to write good SQL: Text-to-SQL techniques explained" and it looks like all the usual bollocks.
... and sigh ... it look like I'm twiddling with a noddy!
lacker · 2h ago
Unfortunately, I don't really know if I can trust academics to analyze the development of large language models. No academic team has built an LLM. So... do people working at Stanford or Oxford really have good insight how LLMs are developed?
If people at OpenAI, Anthropic, or Google said this, that would be interesting. But I don't think it makes sense any more to treat academic computer scientists as relevant experts here.
_heimdall · 2h ago
My understanding is that those building them don't really know how they work. Research into interoperability has fallen way behind as funding went towards features and scale.
Any understanding of how they work is largely theoretical, that seems like a reasonable place for academics to lean in and join the conversation.
jsheard · 2h ago
It doesn't really make sense to trust what OpenAI and friends say about this either, when admitting to any kind of scaling limits would go against the narrative propping up their multi-hundred-billion-dollar valuations. I guess we're just flying blind for now.
pphysch · 2h ago
Why would Big AI kill their golden goose like that?
In Gemini you can turn off Gemini Apps Activity (warning: deletes your chat log, you need to copy paste everything into notes)
Highly recommended.
I'm not sure what they'd get from training on that
If human response is "That's BS", "fuck off", or something similar, mark as bad assistant message.
If human response is "huh" or "cool", mark as good assistant message.
If on ChatGPT, watch how much scrolling user does. If there's a lot, its somewhat likely that the LLM outputted something useful.
That strategy would have holes of course but as long as its better than guessing something like that would be a useful heuristic.
Even very weak human signals can be immensely valuable over large enough datasets.
I had a very basic React question about useState while porting some vanilla code last week which all models of all stripes I've tried it on have been confidently and completely incorrect about, up to stating the code absolutely will not work, even when I take a turn to assert that I ran it and it does, so there's plenty of shit in there already.
No. Synthetic data is being used to improve LLMs
That doesn't mean there aren't ways to train a model incorporating synthetic data without seeing model collapse
A simple example would be chess ai. The core knowledge is rules of the game. We have human generated examples of plays, but we don’t really need them - we can (and we did) synthesize data to train ai.
A similar pattern can be used for all math/physics/programming/reasoning.
No it can't, the pattern for chess worked since it was an invented problem where we have a simple outcome checks, we can't do the same for natural problems where we don't have easily judged outcomes.
So you can do it for arithmetics and similar where you can generate tons of questions and answers, but you can't use this for things that are fuzzier like physics or chemistry or math theorem choices. In the end we don't really know what a good math theorem is like, it has to be useful but how do you judge that? Not just any truthy mathematical statement is seen as a theorem, most statements doesn't lead anywhere.
Once we have a universal automated judge that can judge any kind of human research output then sure your statement is true, then we can train research AI that way. But we don't have that, or science would look very different than it does today. But I'd argue that such a judge needs to be AGI on its own, so its circular.
If you've noticed, most LLM interfaces have a "thumbs up" or "thumbs down" response. The prompt may provide novel data. The text generated is synthetic. You don't need an automated judge, the user is providing sufficient feedback.
Same thing goes for the other disciplines.
I know they're training with synthetic data, I didn't realize that has been done at scalr for long enough to really know if it improved (assuming the metrics its improving are defined well).
I have a best fit line. Then I take random data on that line to train a new line.
I pretty much get the same line.
From an intuitive perspective... it doesn't get worse. At worst it stays the same.
Now imagine something a bit more complex. I have a best fit curve that's very close to a line.
I use random data from that curve to train a new best fit line.
I get something different now. Not necessarily worse.
I mean literally just take all your ideas of ML and just imagine it on the 2D plane doing curve fitting. If retraining new lines from generated data doesn't necessarily make things worse.
[0] Like statistics of words, etc.
I noticed that a big chunk of the default Llama 4 system prompt is devoted to suppressing various GPT-isms, which to me implies they weren't able to keep their newer training set from being contaminated by competing models.
> You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
There wasn't any known active AI back then, but statistics on popular ideas and internet content was already a thing, and speech pollution based on those assessments had already started to spread fast, manually outputted.
Sure, a lot of good content came out since then. But the amount of garbage... it's immense and very difficult to sort out automatically.
The major issue is that this garbage then _became_ the norm. Only people who lived back then can remember what it was. For new folk, it looks just like a generational shift. However, it is quite obvious that some aspects of this shift were... unnatural (in the sense of not being spontaneous cultural manifestations).
https://en.m.wikipedia.org/wiki/Eternal_September
Pay attention.
I mentioned explicitly that I see what happened as distinct from a natural generational shift.
There are many phenomena around that era to support what I am saying. Like, for example, the first massive political campaign to leverage internet as its primary vehicle.
Should it eventually happen for LLM outputs, I hope we name it Slop Wars.
No comments yet
(Most of my disks are still MBR as they're not big enough to be worth the hassle of using GPT.)
So maybe we'll get better reasoning and therefore better generated data/content in the wild, without this negative feedback loop everyone is worried about.
------
The user wants to know how many “I”s are in the word “antidisestablishmentarianism.” They probably mean the letter "I" in lowercase, though the question was a little broader with "Is."
I’ll treat it case-insensitively and count the total occurrences of the letter "i" (both uppercase and lowercase). I can count them manually now, or use a tool for faster analysis.
Analyzed word = "antidisestablishmentarianism" word_lower = word.lower() word_lower.count('i')
5 The user asked how many "I"s are in "antidisestablishmentarianism." I’m taking it to mean whether it’s counting "i" or "I", and I’ll consider them as the same for simplicity. So, I’ll count the 5 occurrences of "i" in the word. I manually broke it down and verified there are indeed five "i"s, so the final count should be 5. Pretty straightforward, now that it's all clear!
Done
There are 5 “i”s in “antidisestablishmentarianism.”
------
You can come up with complicated definitions of "core reasoning" where this doesn't count, and I understand there's great demand for such definitions, but this clearly satisfies common-sense understanding of the term.
Did we read the same response? It did write a lot of reasons, but didn't do any reasoning at all, it just suddenly wrote "5" here
There was no reasoning at all to arrive at 5, so no your example just proves how these models are great at faking reasoning.Of course there is ambiguity though, more context would be hard to distinguish from core-reasoning and vice versa.
I think LLMs/AI mean we can substitute reasoning with vast accumulations and relations between contexts.
Remember, RLHF gives the models some, and perhaps most of these chains-of-thought, when there isn’t sufficient text to scrape for each family of problems. When I see that chain-of-thought, the first thing I think of is of my peers who had write, rewrite, nudge, and correct these chains of thought, and not about core reasoning.
The CoT has that same overexplained step-by-step so many RLHF’ers will be accustomed to, and much of it was authored/originated by them. And due to the infinite holes it feels like plugging, I dont call that RL reasoning.
Are you sure?
If AI is meant to sound nearly identical to a human, you don't need more training data.
If its meant to act as a natural language encyclopedia, we'll never get there with LLMs which amount to natural language processing on top of a massively compressed dataset.
... and sigh ... it look like I'm twiddling with a noddy!
If people at OpenAI, Anthropic, or Google said this, that would be interesting. But I don't think it makes sense any more to treat academic computer scientists as relevant experts here.
Any understanding of how they work is largely theoretical, that seems like a reasonable place for academics to lean in and join the conversation.