> Distillation isn’t an uncommon practice, but OpenAI’s terms of service prohibit customers from using the company’s model outputs to build competing AI.
I have the absolute tiniest of violins for this given OpenAI's behaviour vs everyone else's terms of service.
sovietmudkipz · 18h ago
“Copyright must evolve into the 21century (…so that AI can legally steal everything produced by people”
And also
“Don’t steal our AI!”
jsheard · 17h ago
The world is not prepared for the mental gymnastics that OpenAI/Google/etc will employ to defend their copyright if their big models ever get leaked.
bitpush · 16h ago
I see no evidence that Google is doing this. Any sources?
Zetaphor · 18h ago
I'm still unclear how they are able to claim this considering their raw thinking traces were never exposed to the end user, only summaries.
parineum · 20h ago
At this point, they all using each other because so much of the new content they are scraping for data is generated.
These models will converge and plateau because the datasets are only going to get worse as more of their content is incestuous.
jsheard · 18h ago
The default Llama 4 system prompt even instructs it to avoid using various ChatGPT-isms, presumably because they've already scraped so much GPT-generated material that it noticably skews their models output.
sovietmudkipz · 18h ago
I recall that AI trained on AI output over many cycles eventually becomes something akin to noise texture as the output degrades rapidly.
Won’t most AI produced content put out into the public be human curated, thus heavily mitigating this degradation effect? If we’re going to see a full length AI generated movie it seems like humans will be heavily involved, hand holding the output and throwing out the AI’s nonsense.
AstroBen · 16h ago
Some will be heavily curated, by those who care about quality. This is a lot slower to produce, requires some expertise to do right, so there will be far less of it
The vast majority of content will be (is) the fastest and easiest to create - AI slop
wkat4242 · 20h ago
Yes indeed some studies were already done on this.
zackangelo · 17h ago
There might be a plateau coming but I’m not sure that will be the reason.
It seems counterintuitive but there is some research suggesting that using synthetic data might actually be productive.
jsheard · 17h ago
I think there's probably a distinction to be made between deliberate, careful use of synthetic data, as opposed to blindly scraping 1PB of LLM generated SEO spam and force-feeding it into a new model. Maybe the former is useful, but the latter... probably not.
ksymph · 16h ago
Interesting. The tonal change has definitely been noticeable. It also seems a bit more succinct and precise with its word choice, less flowery. That does seem to be in line with Gemini's behavior.
vb-8448 · 18h ago
I wonder if at this point it really matters who used whose data ...
I have the absolute tiniest of violins for this given OpenAI's behaviour vs everyone else's terms of service.
And also “Don’t steal our AI!”
These models will converge and plateau because the datasets are only going to get worse as more of their content is incestuous.
Won’t most AI produced content put out into the public be human curated, thus heavily mitigating this degradation effect? If we’re going to see a full length AI generated movie it seems like humans will be heavily involved, hand holding the output and throwing out the AI’s nonsense.
The vast majority of content will be (is) the fastest and easiest to create - AI slop
It seems counterintuitive but there is some research suggesting that using synthetic data might actually be productive.