Look, we just need to add some new 'planes' to Unicode - that mirror all communicatively-useful characters, but with extra state bits for...
guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.
for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.
Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.
When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.
I am only (1 - epsilon) joking.
io84 · 1d ago
Just like with food: there will be a market value in content that is entirely “organic” (or in some languages “biological”). I.e. written, drawn, composed, edited, and curated by humans.
Just like with food: defining the boundaries of what’s allowed will be a nightmare, it will be impossible to prove content is organic, certifying it will be based entirely on networks of trust, it will be utterly contaminated by the thing it professes to be clean of, and it may even be demonstrably worse while still commanding a higher price point.
godelski · 1d ago
The entire world operates on trust of some form. Often people are acting in good faith. But regulation matters too.
If you don't go after offenders then you create a lemon markets. Most customers/people can't tell, so they operate on what they can. That doesn't mean they don't want the other things, it means they can't signal what they want. It is about available information, that's what causes lemon markets, information asymmetry.
It's also just a good thing to remember since we're in tech and most people aren't tech literate. Makes it hard to determine what "our customers" want
eru · 1d ago
> If you don't go after offenders then you create a lemon markets.
Btw, private markets are perfectly capable of handling 'markets for lemons'. There might be good excuses for introducing regulation, but markets for lemons ain't.
As a little thought exercise, you can take two minutes and come up with some ways businesses can 'fix' markets for lemons and make a profit in the meantime. How many can you find? How many can you find already implemented somewhere?
godelski · 12h ago
> As a little thought exercise, you can take two minutes and come up with some ways businesses can 'fix' markets for lemons and make a profit in the meantime. How many can you find? How many can you find already implemented somewhere?
This sounds exactly like what causes lemon markets in the first place. Subtle things matter and if you don't pay attention to them (or outright reject them) then that ends up with the lemon market situation.
Btw, lemon markets aren't actually good for anyone. They are suboptimal for businesses too. They still make money but they make less money than they would were it a market of peaches.
eru · 3h ago
Let me give you an example: reputation can solve the 'market for lemons'.
If you build a reputation for honest dealing and high quality, then people can trust that you don't sell them lemons (ie bad used cars in the original example). This reputation is valuable, so (most) companies will try to protect it.
And that's exactly what's happening with some used car dealers.
short_sells_poo · 1d ago
Well throw us bone! Can you cite robust examples where private markets deal with this gracefully? Because I can't.
An informational asymmetry that is beneficial to the businesses will heavily incentivise the businesses to maintain status quo. It's clear that they will actively fight against empowering the consumer.
The consumer has little to no power to force a change outside of regulation, since individually each consumer has asymptotically zero ability to influence the market. They want the goods, but they have no ability to make an informed decision. They can't go anywhere else. What mechanism would force this market to self correct?
eru · 3h ago
Businesses with a reputation for honest dealing and good quality attract repeat business.
Why are you so pessimistic that customers can't go anywhere else?
The classic market for lemons example is about used cars. People can just not buy used cars, eg by buying only new cars. But a dealer with a reputation for honesty can still sell used cars, even if the customer will only learn whether there's a lemon later.
Another solution is to use insurance, or third party inspectors.
bitmasher9 · 1d ago
I do wonder what would be an acceptable level of guarantee to trigger a “human written” bit.
I actually think a video of someone typing the content, along with the screen the content is appearing on, would be an acceptably high bar at this present moment. I don’t think it would be hard to fake, but I think it would very rarely be worth the cost of faking it.
I think this bar would be good for about 60 days, before someone trains a model that generates authentication videos for incredibly cheap and sells access to it.
kijin · 1d ago
Pen on paper, written without consulting any digital display. Just like exams used to be, before the pandemic.
Of course, the output will be no more valuable to the society at large than what a random student writes in their final exam.
io84 · 1d ago
Interesting...thinking this through: For text and ideas the information size is often small enough to fit in human memory, and thus containing this is already unsolvable! I can ask the LLM to compose the text of a pitch and then film myself writing it out. Nothing you can do will prove the provenance of those bits was not from the AI.
So I think the premium product becomes in-person interaction, where the buyer is present for the genesis of the content (e.g. in dialogue).
Image/video/music might have more scalable forms of organic "product". E.g. a high-trust chain of custody from recording device to screen.
short_sells_poo · 1d ago
Fully in agreement with you. There'll be ultimately two groups of consumers of "organic" content:
1. Those who just want to tick a checkbox will buy mass produced "organic" content. AI slop that had some woefully underpaid intern in a sweatshop add a bit of human touch.
2. People who don't care about virtue signalling but genuinely want good quality will use their network of trust to find and stick to specific creators. E.g. I'd go to the local farmer I trust and buy seasonal produce from them. I can have a friendly chat with them while shopping, they give me honest opinions on what to buy (e.g. this year was great for strawberries!). The stuff they sell on the farm does not have to go through the arcane processes and certifications to be labelled organic, but I've known the farmer for years, I know that they make an effort to minimize pesticide use, they treat their animals with care and respect and the stuff they sell on the farm is as fresh as it can be, and they don't get all their profits scalped by middlemen and huge grocery chains.
io84 · 1d ago
You're capturing nicely how the relationship with the farmer is an essential part of the "product" you buy when you buy high-end organic. I think that will continue to be true in culture/info markets.
thih9 · 1d ago
> emits text in these ranges that was AI generated
How would you define AI generated? Consider a homework and the following scenarios:
1. Student writes everything themselves with pen & paper.
2. Student does some research with an online encyclopedia, proceeds to write with pen and paper. Unbeknownst to them, the online encyclopedia uses AI to answer their queries.
3. Student asks an AI to come up with the structure of the paper, its main points and the conclusion. Proceeds with pen and paper.
4. Student writes the paper themselves, runs the text through AI as a final step, to check for typos, grammar and some styling improvements.
5. Student asks the AI to write the paper for them.
The first one and the last one are obvious, but what about the others?
Edit, bonus:
6. Student writes multiple papers about different topics; later asks an AI to pick the best paper.
juancroldan · 1d ago
7. Student spent the entire high school and bachelor's degree learning from content that teachers generate using AI and using it to do homework, hence becoming AI-contaminated
WithinReason · 1d ago
This is about the characters themselves, therefore:
1. Not AI
2. Not AI
3. Not AI
4. The characters directly generated by AI are AI characters
5. AI
6. Not AI
ljlolel · 1d ago
The student dictates a paper word for word exactly
The student is missing arms and so dictates a paper word for word exactly
WithinReason · 23h ago
Voice-to-text would probably not classify as an LLM
ljlolel · 20h ago
Ok now I dictate my prompt. Do you understand how these multimodal models work?
WithinReason · 19h ago
Then the characters of the response of the LLM is flagged as generated. This is simple logic, I don't see what you don't get.
ljlolel · 7h ago
You don’t get that dictating is the same MLM.
Applejinx · 23h ago
6 is extremely interesting, in that it's tantamount to asking a panel of innumerably many people to give an opinion on which paper is best for a general audience.
It's hard to imagine that NOT working unless it's implemented poorly.
dmsnell · 1d ago
Unicode has a range of Tag Characters, created for marking regions of text as coming from another language. These were deprecated for this purpose in favor of higher level marking (such as HTML tags), but the characters still exist.
They are special because they are invisible and sequences of them behave as a single character for cursor movement.
They mirror ASCII so you can encode arbitrary JSON or other data inside them. Quite suitable for marking LLM-generated spans, as long as you don’t mind annoying people with hidden data or deprecated usage.
Can't I get around this by starting my text selection one character after the start of some AI-generated text and ending it one character before the end, Ctrl-C, Ctrl-V?
ema · 1d ago
There are many ways to get around this since it is trivial to write code that strips those tags.
crubier · 1d ago
Twelve millisecond after this law gets into effect, typing factories open in India, where human operators hand-recopy text from AI sources to perform "data laundering".
miki123211 · 1d ago
If somebody writes in a foreign language and asks Chat GPT to translate to English, is that AI generated content? What about if they write on paper and use an LLM to OCR? What if they give the AI a very detailed outline, constantly ask for rewrites and are ruthless in removing any facts they're not 100% sure of if they slip in? What if they only use AI to fix the grammar and rewrite bad English into a proper scientific tone?
My answer would be a clear "no" to all of these, even though the content ultimately ends up fully copy-pasted from an LLM in all those cases.
theamk · 1d ago
My answer is clear "yes" to most of those.
Yes, machine translations are AI-generated content - I read foreign-language news sites which sometimes has machine translation articles and the quality stands out and not in a good way.
"Maybe" for "writing on paper and using LLM for OCR". It's like automatic meeting transcript - if the speaker has perfect pronunciation, it works well. If they don't, then the meeting notes still look coherent but have little relationship to what speaker said and/or will miss critical parts. Sadly there is no way for reader to know that from reading the transcript, so I'd recommend labeling "AI edited" just in case.
Yes, even if "they give the AI a very detailed outline, constantly ask for rewrites, etc.." it's still AI generated. I am not sure how can you argue otherwise - it's not their words. Also, it's really easy to convince yourself that you are "ruthless in removing any facts they're not 100% sure" while actually you are anything but.
"What if they only use AI to fix the grammar and rewrite bad English into a proper scientific tone?" - I'd label it "AI-edited" if the rewrites are minor or "AI-generated" if the rewrites are major. This one is especially insidious as people may not expect rewrites to change meaning, so they won't inspect them too much, so it will be easier for hallucinations to slip in.
fho · 1d ago
> they give the AI a very detailed outline […]
Honestly, I think that's a tough one.
(a) it "feels" like you are doing work. Without you the LLM would not even start.
(b) it is very close to how texts are generated without LLMs. Be it in academia, with the PI guiding the process of grad students, or in industry, with managers asking for documentation. In both cases the superior takes (some) credit for the work that is in large parts by others.
theamk · 16h ago
Don't see anything "tough" here.
At least in academia, if PI takes credit for student's work and does not list them as co-author, it's considered widely unethical. The rules there are simple - someone contributed to the text, they get onto the author list.
If we had same same rule for blogs - "this post is authored by fho and ChatGPT" - then I'd be completely satisfied, as this would be sufficient AI disclosure.
As for industry, I think the rules are very different place-by-place. In some places the authorship does not even come up - the slide deck/document can contain copies from random internet sites, or some previous version of the doc, and the reference will only be present if there is a need (say to lend an authority)
a57721 · 1d ago
It really depends on the context, e.g. if you need texts for a database of word frequencies, then the answer is a clear "yes", and LLMs have already ruined everything [1]. The only exception from your list would be OCR where a human proofreads the output.
For the translate part let me just point out the offensively bad translations that reddit (sites with an additional ?tl=foo) and YouTube automatic dubbing force upon users.
These are immediately, negatively obvious as AI content.
For the other questions the consensus of many publications/journals has been to treat grammar/spellcheck just like non-AI but require that other uses have to be declared. So for most of your questions the answer is a firm "yes".
zdc1 · 1d ago
If the purpose is to identify text that can be used as training data, in some ways it makes sense to me to mark anything and everything that isn't hand-typed as AI generated.
Like for your last example: to me, the concept "proper scientific tone" exists because humans hand-typed/wrote in a certain way. If we use AI edited/transformed text to act as a source for what "proper scientific tone" looks like, we still could end up with an echo chamber where AI biases for certain words and phrases feed into training data for the next round.
Being strict about how we mark text could mean a world where 99% of text is marked as AI-touched and less than 1% is marked as human-originated. That's still plenty of text to train on, though such a split could also arguably introduce its own (measurable) biases...
lazyasciiart · 1d ago
> we still could end up with an echo chamber where AI biases for certain words and phrases feed into training data for the next round.
That’s how it works with humans too. “That sounds professional because it sounds like the professionals”.
RodgerTheGreat · 1d ago
All four of your examples are situations where an LLM has potential to contaminate the structure or content of the text, so in all four cases it is clear-cut that the output poses the same essential hazards to training or consumption as something produced "whole cloth" from a minimal prompt; post-hoc human supervision will at best reduce the severity of these risks.
gojomo · 1d ago
OK, sure, there are gradations.
The new encoding can contain a FLOAT32 side channel on every character, to represent its proportional "AI-ness" – kinda like the 'alpha' transparency channel on pixels.
BugheadTorpeda6 · 1d ago
Yes yes yes yes
c-linkage · 1d ago
Stop ruining my simple and perfect ideas with nuance and complexity!
theamk · 1d ago
Nuance and complexity are a thing, but many of the GP's examples should be clearly AI labeled...
> What if they give the AI a very detailed outline, constantly ask for rewrites and are ruthless in removing any facts they're not 100% sure of if they slip in?
akoboldfrying · 1d ago
The whole point of those examples is to demonstrate that there is considerably diversity in opinion on how those cases "should" be classified -- which tells us that, at least in the near term, nothing useful can be expected from such a simplistic classification scheme.
slashdev · 1d ago
I’ll take the contrarian view. I don’t care if content is generated by a human or by an AI. I care about the quality of the content, and in many cases, the human does a better job currently.
I would like a search engine algorithm that penalizes low quality content. The ones we currently have do a piss poor job of that.
andsoitis · 1d ago
> I would like a search engine algorithm that penalizes low quality content. The ones we currently have do a piss poor job of that.
Without knowing the full dataset that got trimmed to the search result you see, how do you evaluate the effectiveness?
sethhochberg · 1d ago
You’re asking a fair question but I think you’re approaching it from a POV that’s maybe a bit more of an engineering mindset than the person you’re responding to is using
A brilliant algorithm that filters out some huge amount of AI slop is still frustrating to the user if any highly ranked AI slop remains. You still click it, immediately notice what it is, and wonder why the algo couldn’t figure this out if you did so quickly
It’s like complaining to a waiter that there’s a fly in your soup, and the waiter can’t understand why you’re upset because there were many more flies in the soup before they brought it to the table and they managed to remove almost all of them
slashdev · 23h ago
It doesn’t matter how much it filters out, if the top results are still spam.
I barely use Google anymore. Mostly just when I know the website I want, but not the URL.
ianburrell · 19h ago
Maybe have the glyph be zero width by default but have way to show them? I think begin-end markers would work better to make a whole range. It would need support from editor to manage the ranges and change editing AI generated text to mixed.
What might make sense is source marking. If you copy and paste text, it becomes a citation. AI source is always cited.
I havebeen thinking that there should be metadata in images for the provenance. Maybe a list of hashes of source images. Real cameras would include the raw sensor data. Again, AI image would be cited.
andrewflnr · 1d ago
It would be much less disruptive to require that any network traffic containing AI generated content must have the IP evil bit set.
sebzim4500 · 22h ago
You'd probably want to distinguish between content being readable by AI and being trainable by AI.
E.g. you might be fine with the search tool in chatgpt being able to read/link to your content but not be fine with your content being used to improve the base model.
achierius · 1d ago
Rather than new planes, some sort of combining-character or even just an invisible signifying-mark would achieve the same purpose with far less encoding space. Obviously this would still be a nightmare for everyone who has to process text regardless.
function_seven · 1d ago
Nope. Too easy to accidentally strip out. Each and every glyph must carry the taint.
We don’t want to send innocent people to jail! (Use UCS-18 for maximum benefit.)
qwertycrackers · 1d ago
Sounds like the plot of God Shaped Hole
brian-armstrong · 1d ago
Seems kind of excessive to send them to jail when the prisons are already pretty full. Might be more productive to do summary executions?
sneak · 1d ago
I have long thought that we should extend the plain text format to allow putting provenance metadata into substrings in the file.
This is that, but a different implementation. Plain text is like two conductor cables; it’s so useful and cost effective but the moment you add a single abstraction layer above it (a data pin) you can do so much more cool stuff.
crubier · 1d ago
That would be an evolution of HTML. Plain text is just plain text by definition, it can't include markup and annotations etc.
throwaway290 · 1d ago
> for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
Won't work because on day 0 someone will write a conversion library and apparently if you are big enough and have enough lawyers you can just ignore the jail threat (all popular LLMs just scrape internet and skip licensing any text or code. Show me one that isn't)
akoboldfrying · 1d ago
Each character should be, in effect, a signed git commit: in addition to a few bits for the Unicode code point itself, it should store a pointer back to the previous character's hash, plus a digital signature identifying the keyboard that typed it.
sReinwald · 23h ago
I understand that you're not completely serious about it, but you're proposing a very brittle technical solution for what is fundamentally a social and motivational issue.
The core flaw is that any such marker system is trivially easy to circumvent. Any user intending to pass AI content as their own would simply run the text through a basic script to normalize the character set. This isn't a high-level hack; it's a few dozen lines in Python and trivially easy to write for anyone who can follow a few basic Python tutorials or a 5-second task for ChatGPT or Claude.
Technical solutions to something like this exist in the analog world, of course, like the yellow dots on printers that encode date, time, and the printer's serial number. But, there is a fundamental difference: The user has no control over that enforcement mechanism. It's applied at a firmware/hardware layer that they can't access without significant modification. Encoding "human or AI" markers within the content itself means handing the enforcement mechanism directly to the people you're trying to constrain.
The real danger of such a system isn't even just that it's blatantly ineffective; it's that it creates a false sense of security. The absence of "AI-generated" markers would be incorrectly perceived as a guarantee for human origin. This is a far more dangerous state than even our current one, where a healthy level of skepticism is required for all content.
It reminds me of my own methods of circumventing plagiarism checkers back in school. I'm a native German speaker, and instead of copying from German sources for my homework, I would find an English source on the topic, translate it myself, and rewrite it. The core ideas were not my own, but because the text passed through an abstraction layer (my manual translation), it had no direct signature for the checkers to match. (And in case any of my teachers from back then read this: Obviously I didn't cheat in your class, promise.)
Stripping special Unicode characters is an even simpler version of the same principle. The people this system is meant to catch - those aiming to cheat, deceive, or manipulate - are precisely the ones who will bypass it effortlessly. Apart from the most lazy and hapless, of course. But we are already catching those constantly from being dumb enough to include their LLM prompts, or "Sure, I'll do that for you." when copying and pasting. But if you ask me, those people are not the ones we should be worried about.
//edit:
I'm sure there are way smarter people than me thinking about this problem, but I genuinely don't see any way to solve this problem with technology that isn't easily circumvented or extremely brittle.
The most promising would likely be something like unperceivable patterns in the content itself, somehow. Like hiding patterns in the length of words used, length of sentences, punctuation, starting letters for sentences, etc. But even if the big players in AI were to implement something like this immediately, it would be completely moot.
Local open-source models that can be run on consumer hardware already are more than capable enough to re-phrase input text without altering the meaning, and likely wouldn't contain these patterns. Manual editing breaks stylometric patterns trivially - swap synonyms, adjust sentence lengths, restructure paragraphs. You could even attack longer texts piecemeal by having different models rephrase different paragraphs (or sentences), breaking the overall pattern. And if all else fails, there's always my manual approach from high school.
foxglacier · 1d ago
But why? It's nice that somebody's collecting sources of pre-AI content that might be useful for curiosity or research or something. But other than that, why does it matter? AI text can still be perfectly good text. What's the psychological need behind this popular anti-AI ludditism?
jofzar · 1d ago
You’re absolutely right that AI-generated text can be good—sometimes even great. But the reason people care about preserving or identifying pre-AI content isn’t always about hating AI. It's more about context and trust.
Think of it like knowing the origin of food. Factory-produced food can be nutritious, but some people want organic or local because it reflects a different process, value system, or authenticity. Similarly, pre-AI content often carries a sense of human intention, struggle, or cultural imprint that people feel connected to in a different way.
It’s not necessarily a “psychological need” rooted in fear—it can be about preserving human context in a world where that’s becoming harder to spot. For researchers, historians, or even just curious readers, knowing that something was created without AI helps them understand what it reflects: a human moment, not a machine-generated pattern.
It’s not always about quality—it’s about provenance.
Edit: For those that can't tell this is obviously just copy and pasted from chatgpt response.
hbs18 · 1d ago
I feel like the em-dashes and "You're absoultely right" already kinda serve the purpose of special AI-only glyphs
BoxOfRain · 1d ago
I've found propensity to swear quite a useful observation when determining whether or not a user is an LLM. I suspect it'll remain useful for quite a while, the corporate LLM providers at least won't be training their models to sound like a sailor eight pints deep any time soon.
foxglacier · 1d ago
OK, so they can choose to read material from publishers that they trust to only produce human generated content. Similar to buying organic food. Pay a bit more for the feeling. No need for those idealists to drag everybody else into it.
multjoy · 1d ago
Why would you want to read something that isn’t written by a human?
K0balt · 1d ago
Ai generated content is inherently a regression to the mean and harms both training and human utility. There is no benefit in publishing anything that an AI can generate, just ask the question yourself. Maybe publish all AI content with <AI generated content> tags, but other than that it is a public nuisance much more often than a public good.
px1999 · 1d ago
Following this logic, why write anything at all? Shakespeare's sonnets are arrangements of existing words that were possible before he wrote them. Every mathematical proof, novel, piece of journalism is simply a configuration of symbols that existed in the space of all possible configurations. The fact that something could be generated doesn't negate its value when it is generated for a specific purpose, context, and audience.
pickledoyster · 1d ago
> William Shakespeare is credited with the invention or introduction of over 1,700 words that are still used in English today
He invented ‘undress’? Like he invented ‘undo’ or ‘unwell’? Come on, that’s silly.
dspillett · 1d ago
Invented might be a bit strong, but he is certainly the first written record of the word. Dress existed as a verb already, as did the generic reversing “un”, but before Shakespeare there is no evidence that they were used this way. Prior to that other words/phrases, which probably still exist in use today, were used instead. Perhaps “disrobe” though the OED lists the first reference to that as only a decade before Taming Of The Shrew (the first written use of undress) was published, so there are presumably other options that were in common use before both.
It is definitely valid to say he popularised the use of the word, which may have been being used informally in small pockets for some time before.
K0balt · 22h ago
Following that logic, we should publish all unique random orderings of words. I think there is a book about a library like that, but it is a great read and is not a regression to the mean of ideas.
Writing worth reading as a non-child surprises, challenges, teaches, and inspires. LLM writing tends towards the least surprising, worn out tropes that challenge only the patience and attention of the reader. The eager learner, however will tolerate that , so I suppose that I’ll give them teaching. They are great at children’s stories, where the goal is to rehearse and introduce tropes and moral lessons with archetypes, effectively teaching the listener the language of story.
FWIW I am not particularly a critic of AI and am engaged in AI related projects. I am quite sure that the breakthrough with transformer architecture will lead to the third industrial revolution, for better or for worse.
But there are some things we shouldn’t be using LLMs for.
gojomo · 1d ago
This was an intuitively-appealing belief, even with some qualified experimental support, as of a few years ago.
However, since then, a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.
DennisP · 1d ago
AI generates useful stuff, but unless it took a lot of complicated prompting, it's still true that you could "just ask the question yourself."
This will change as contexts get longer and people start feeding large stacks of books and papers into their prompts.
Swizec · 1d ago
> you could "just ask the question yourself."
Just like googling, AIing is a skill. You have to know how to evaluate and judge AI responses. Even how to ask the right questions.
Especially asking the right questions is harder than people realize. You see this difference in human managers where some are able to get good results and others aren’t, even when given the same underlying team.
multjoy · 1d ago
If you don’t know the answers, how can you judge the machine output?
aydyn · 1d ago
A lot of inquiries are like hash functions. Hard to find, easy to verify.
multjoy · 1d ago
“Siri, show me an example of overconfidence”
gojomo · 1d ago
No, new more-capable and/or efficient models have been forged using bulk outputs of other models as training data.
These inproved models do some valuable things better & cheaper than the models, or ensembles of models, that generated their training data. So you could not "just ask" the upstream models. The benefits emerge from further bulk training on well-selected synthetic data from the upstream models.
Yes, it's counterintuitive! That's why it's worth paying attention to, & describing accurately, rather than remaining stuck repeating obsolete folk misunderstandings.
DennisP · 22h ago
That's a process that's internal to companies doing training. It has nothing to do with publishing outputs on the internet.
K0balt · 1h ago
One example of useful output does not negate the flood of pollution. I’m not denying or downplaying the usefulness of AI. I am doubting the wisdom of blindly publishing -anything- without making at least a trivial attempt to ensure that it is useful and worth publishing. It is a form of pollution.
The problem is that it lowers the effort required to produce SEO spam and to “publish” to nearly zero, which creates a perverse incentive to shit on the sidewalk.
The amount of AI created, blatantly false blog posts about drug interactions, for example. Not advertising, just banal filler to create site visits, with dangerously false information.
It’s not like shitting on the sidewalk was never a problem before, it’s just that shitting on the sidewalk as a service (SOTSAAS) maybe is something we should try to avoid.
wahern · 1d ago
> a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.
How much work is "well-curated" doing in that statement?
gojomo · 1d ago
Less than you might think! Some of the frontier-advancing training-on-model-outputs ('synthetic data') work just uses other models & automated-checkers to select suitable prompts and desirable subsets of generations.
I find it (very) vaguely like how a person can improve at a sport or an instrument without an expert guiding them through every step up, just by drilling certain behaviors in an adequately-proper way. Training on synthetic data somehow seems to extract a similar iterative improvement in certain directions, without requiring any more natural data. It's somehow succeeding in using more compute to refine yet more value from the original non-synthetic-training-data's entropy.
Marazan · 1d ago
"adequately-proper way" is doing an incredible amount of heavy lifting in that sentence.
gojomo · 16h ago
Yes, but: for humans, even without an expert-over-the-shoulder providing fresh feedback, drilling/practice works – with the right caveats.
And, counter to much intuition & forum folklore, it works for AI models, too – with analogous caveats.
nicbou · 1d ago
How will AI write about a world it never experiences? By training on the work of human beings.
gojomo · 1d ago
The training sets can already include direct data series about the world, where the "work of human beings" is just setting up the the collection devices. So models can absolutely "experience the world".
But I'm not suggesting they'll advance much, in the near term, without any human-authored training data.
I'm just pointing out the cold hard fact that lots of recent breakthroughs came via training on synthetic data - text prompted by, generated by, & selected by other AI models.
That practice has now generated a bunch of notable wins in model capabilities – contra the upthread post's sweeping & confident wrongness alleging "Ai generated content is inherently a regression to the mean and harms both training and human utility".
nicbou · 1d ago
> models can absolutely "experience the world"
How does the banana bread taste at the café around the corner? What's the vibe like there? Is it a good place for people-watching?
What's the typical processing time for a family reunion visa in Berlin? What are the odds your case worker will speak English? Do they still accept English-language documents or do they require a certified translation?
Is the Uzbek-Tajik border crossing still closed? Do foreigners need to go all the way to the northern crossing? Is the Pamir highway doable on a bicycle? How does bribery typically work there? Are people nice?
The world is so much more than the data you have about it.
gojomo · 1d ago
Of course, training on synthetic data can't do everything! My main point is: it's been doing a bunch of surprisingly-beneficial things, contra the obsolete beliefs about model-output-worthlessness (or deleteriousness!) for further training to which I was initially responding.
But also: with regard to claims about what models "can't experience", such claims are pretty contingent on transient conditions, and expiring fast.
To your examples: despite their variety, most if not all could soon have useful answers answers collected by largely-automated processes.
People will comment publicly about the "vibe" & "people-watching" – or it'll be estimable from their shared photos. (Or even: personally-archived life-stream data.) People will describe the banana bread taste to each other, in ways that may also be shared with AI models.
Official info on policies, processing time, and staffing may already be public records with required availability; recent revisions & practical variances will often be a matter of public discussion.
To the extent all your examples are questions expressed in natural-language text, they will quite often be asked, and answered, in places where third parties – humans and AI models – can learn the answers.
Wearable devices, too, will keep shrinking the gap between things any human is able to see/hear (and maybe even feel/taste/smell) and that which will be logged digitally for wider consultation.
nicbou · 5h ago
So in the end, there is still a human doing the work
multjoy · 1d ago
You’ve used an LLM to write that, haven’t you.
gojomo · 16h ago
No - and you can compare the style & written tics for continuity with my 18y of posts here.
I used 'delving' in an HN comment more than a decade before LLMs became a thing!
> data series about the world, where the "work of human beings" is just setting up the the collection devices. So models can absolutely "experience the world"
But not experience it the way humans do.
We don’t experience a data series; we experience sensory input in a complicated, nuanced way, modified by prior experiences and emotions, etc. remember that qualia is subjective, with a biological underpinning.
gojomo · 1d ago
Perhaps. But these models can already clearly write about the world, in useful ways, without such 'qualia' or 'biological underpinnings'.
andsoitis · 1d ago
Sure, and there are many such writings that can be useful. No denying. But the LLM cannot experience like humans do and so will forever be outside our circle. Whether it also remains outside our circle of empathy, or us outside of its, remains to be discovered.
K0balt · 1d ago
I didn’t mean to imply that -no- ai generated content is useful, only that the vast, vast majority is pollution. The problem is that it is so cheap to produce garbage content with AI that writing actual content is disincentivized, and doing web searches has become an exercise is sifting through AI generated slop.
That at least will add extra work to filter usable training data, and costs users minutes a day wading through the refuse.
jbc1 · 1d ago
If I ask the question myself then there's no step where a human expert has vetted the content and put their name on it. That curation and vouching is of value.
Now your mind might have immediately went "pffff as if they're doing that" and I agree but only to the extent that it largely wasn't happening prior to AI anyway. The vast majority of internet content was already low quality and rushed out by low paid writers who lacked expertise in what they were writing about. AI doesn't change that.
flir · 1d ago
Completely agree. We are used to thinking of authorship as the critical step. We're going to have to adjust to thinking of publication as the critical step. In an ideal world, publication of a piece would be seen as vouching for that piece. Putting your reputation on the line.
I wonder if we'll see a resurgence in reputation systems (probably not).
tehjoker · 1d ago
This is basically already how publications work.
sneak · 1d ago
What about AI modified or copy edited content?
I write blog posts now by dictating into voice notes, transcribing it, and giving it to CGPT or Claude to work on the tone and rhythm.
theamk · 1d ago
So IMHO an right thing is to add "AI rewritten" label to your blog.
hm.. I wonder where this kind of label should live? For a personal blog, putting it on every post seems redundant, as if author uses it, it's likely they use it for all posts. And many blogs don't have dedicated "about this blog" section.
I wonder if things will end up like organic food labeling or "made in .." labels. Some blogs might say "100% by human", some might say "Designed by human, made by AI" and some might just say nothing.
sneak · 1d ago
AI is just an inanimate tool.
Do I need to disclose that I used a keyboard to write it, too?
The stuff I edit with AI is 100% made by a human - me.
theamk · 16h ago
In context of writing text, keyboard and text editor are inanimate tools because they cannot introduce text user did not come up with.
Spellcheck and autocorrect can come up with new words, and so is often anthropomorphized, it's not 100% "inanimate tool" anymore.
AI can form its own sentences and come up with its own facts for a much greater degree, so I would not call it "inanimate tool" at all (again, in context of writing text). It is much closer to editor-for-hire or copywriter-for-hire, and I think it should be treated the same as far as attribution goes.
hm.. looks like I am convincing myself into your point :)
After all, if another human edits/proofreads my posts before publish, I don't need to disclose that on my post... So why should AI's editing be different?
SamPatt · 1d ago
Nonsense. Have you used any of the deep research tools?
Don't fall for the utopia fallacy. Humans also publish junk.
krapht · 1d ago
Yes, and deep research was junk for the hard topics that I actually needed to sit down and research. Anything shallower I can usually reach by search engine use and scan; deep research saves me about 15-30 minutes for well-covered topics.
For the hard topics, the solution is still the same as pre-AI - search for popular survey papers, then start crawling through the citation network and keeping notes. The LLM output had no idea of what was actually impactful vs what was a junk paper in the niche topic I was interested in so I had no other alternative than quality time with Google Scholar.
We are a long way from deep research even approaching a well-written survey paper written by grad student sweat and tears.
triceratops · 1d ago
> deep research saves me about 15-30 minutes for well-covered topics.
Most people are capable of maybe 4 good hours a day of deep knowledge work. Saving 30 minutes is a lot.
SamPatt · 1d ago
Not everything is hard topics though.
I've found getting a personalized report for the basic stuff is incredibly useful. Maybe you're a world class researcher if it only saves you 15-30 minutes, I'm positive it has saved me many hours.
Grad students aren't an inexhaustible resource. Getting a report that's 80% as good in a few minutes for a few dollars is worth it for me.
cobbzilla · 1d ago
Steel-man angle: A desire for data provenance is a good thing with benefits that are independent of utopias/humans vs machines kinds of questions.
But, all provenance systems are gamed. I predict the most reliable methods will be cumbersome and not widespread, thus covering little actual content. The easily-gamed systems will be in widespread use, embedded in social media apps, etc.
Questions:
1. Does there exist a data provenance system that is both easy to use and reliable "enough" (for some sufficient definition of "enough")? Can we do bcrypt-style more-bits=more-security and trade time for security?
2. Is there enough of an incentive for the major tech companies to push adoption of such a system? How could this play out?
cryptonector · 1d ago
Yes, but GP's idea of segregating AI-generated content is worth considering.
If you're training an AI, do you want it to get trained on other AIs' output? That might be interesting actually, but I think you might then want to have both, an AI trained on everything, and another trained on everything except other AIs' output. So perhaps an HTML tag for indicating "this is AI-generated" might be a good idea.
RandomBK · 1d ago
My 2c is that it is worthwhile to train on AI generated content that has obtained some level of human approval or interest, as a form of extended RLHF loop.
cryptonector · 1d ago
Ok, but how do you denote that approval? What if you partially approve of that content? ("Overall this is correct, but this little nugget is hallucinated.")
bongodongobob · 1d ago
It apparently doesn't matter unless you somehow consider the entire Internet to be correct. They didn't only feed LLMs correct info. It all just got shoveled in and here we are.
cryptonector · 1d ago
Sure, humans also hallucinate.
thephyber · 1d ago
I can see the value of labeling all AI can be trained on purely non-AI generated content.
But I don’t think that’s a reasonable goal. Pragmatic example: There’s almost no optional HTML tags or optional HTTP Headers which are used anywhere close to 100% of the times they apply.
Also, I think field is already muddy, even before the game starts. Spell checker, grammar.ly, and translation all had AI contributions and likely affect most of human-generated text on the internet. The heuristic of “one drop of AI” is not useful. And any heuristic more complicated than “one drop” introduces too much subjective complexity for a Boolean data type.
cryptonector · 12h ago
Yes, it's impossible. We'd have to have started years ago. And then people wouldn't have the discipline to label content correctly or at all. It can't be done.
IncreasePosts · 1d ago
Shouldn't there be enough training content from the pre-ai era that the system itself can determine whether content is AI generated, or if it matters?
Infinity315 · 1d ago
Just ask any person who works in teaching or any of the numerous faulty AI detectors (they're all faulty).
Any current technology which can used to accurately detect pre-AI content would necessarily imply that that same technology could be used to train an AI to generate content that could skirt by the AI detector. Sure, there is going to be a lag time, but eventually we will run out of non-AI content.
cryptonector · 1d ago
No, that's the problem. Pre-AI era content a) is often not dated, so not identifiable as such, and b) also gets out of date. What was thought to be true 20 years ago might not be thought to be true today. Search for the "half-life of facts".
munificent · 1d ago
The observation that humans poop is not sufficient justification for spending millions of dollars building an automated firehose that pumps a torrent of shit onto the public square.
SamPatt · 1d ago
People are paying millions for access to the models. They are getting value from them or wouldn't be paying.
It's just not accurate to say they only produce shit. Their rapid adoption demonstrates otherwise.
munificent · 21h ago
I make no claim to the overall value of LLMs. I'm just pointing out that your analogy is a fallacy. The fact that group A does a small bad thing is not a justification for allowing group B to do a large bad thing. That is true regardless of whether group B does there non-bad things.
It may be the case that the non-bad things B does outweigh the bad things. That would be an argument in favor of B. The another group doing bad things has no bearing on the justification for B itself.
sensanaty · 1d ago
From my experience the people spending "millions" are hoping they get those millions * 10 back because a buddy of theirs told them "this AI thing" is going to replace the most expensive part of companies, the staff costs, not because they think the product is any good. We're getting AI forced down our throat because VC is throwing cash in like there's no tomorrow, not because of whatever value might or might not be there.
tonyedgecombe · 1d ago
>Humans also publish junk
They also consume it.
protocolture · 1d ago
I like how the chosen terminology is perfectly picked to paint the concern as irrelevant.
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."
I dont see that:
1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.
2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
3. That LLM output is going to infest everything anyway.
fer · 19h ago
>2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
>Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.
I really do just bail out whenever anyone uses the word slop.
>As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
Should run the same analysis against the word slop.
jbs789 · 1d ago
Umm… we stopped nuclear testing, which is what allowed the background radiation to reduce.
protocolture · 13h ago
And cars replaced horses in london, rendering forecasts of london being buried under a mountain of horse manure irrelevant too.
Change really is the only constant. The short term predictive game is rigged against hard predictions.
Legend2440 · 1d ago
I'm not convinced this is going to be as big of a deal as people think.
Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
AnotherGoodName · 1d ago
The hallucinations get quoted and then sourced as truth unfortunately.
A simple example. "Which MS Dos productivity program had connect four built in?".
I have an MSDOS emulator and know the answer. It's a little obscure but it's amazing how i get a different answer from all the AI's every time. I never saw any of them give the correct answer. Try asking it the above. Then ask it if it's sure about that (it'll change it's mind!).
Now remember that these types of answers may well end up quoted online and then learnt by AI with that circular referenced source as the source. We have no truth at that point.
And seriously try the above question. It's a great example of AI repeatedly stating an authoritative answer that's completely made up.
dwringer · 1d ago
When I asked, "Good afternoon! I'm trying to settle a bet with a friend (no money on the line, just a friendly "bet"!) Which MS DOS productivity program had a playable version of the game Connect Four built in as an easter egg?", it went into a very detailed explanation of how to get to the "Hall of Tortured Souls" easter egg in Excel 5.0, glossing over the fact that I said "MS DOS" and also conflating the easter eggs by telling me specifically that the "excelkfa" cheat code would open a secret door/bridge to the connect four game.
So, I retried with, "Good afternoon! I'm trying to settle a bet with a friend (no money on the line, just a friendly "bet"!) Which *MS DOS* [not Win95, i.e., Excel 5] productivity program had a playable version of the game Connect Four built in as an easter egg?". I got Lotus 1-2-3 once, Excel 4 twice, and Borland Quattro Pro three different times, all from that prompt.
The correct answer you point out in another subthread was never returned as a possibility, and the responses all definitely came across as confident. Definitely a fascinating example.
MostlyStable · 1d ago
Claude 4 Sonnet gave the (reasonable given the obscurity, but wrong) answer that there was no such easter egg:
>I'm not aware of any MS-DOS productivity program that had Connect Four as a built-in easter egg. While MS-DOS era software was famous for including various easter eggs (like the flight simulator in Excel 97, though that was Windows-era), I can't recall Connect Four specifically being hidden in any major DOS productivity applications.
>The most well-known DOS productivity suites were things like Lotus 1-2-3, WordPerfect, dBase, and later Microsoft Office for DOS, but I don't have reliable information about Connect Four being embedded in any of these.
>It's possible this is a case of misremembered details - perhaps your friend is thinking of a different game, a different era of software, or mixing up some details. Or there might be an obscure productivity program I'm not familiar with that did include this easter egg.
>Would you like me to search for more information about DOS-era software easter eggs to see if we can track down what your friend might be thinking of?
That seems like a pretty reasonable response given the details, and included the appropriate caveat that the model was not aware of any such easter egg, and didn't confidently state that there was none.
SlowTao · 1d ago
>It's possible this is a case of misremembered details - perhaps your friend is thinking of a different game, a different era of software, or mixing up some details. Or there might be an obscure productivity program I'm not familiar with that did include this easter egg.
I am not a fan of this kind of communication. It doesn't know so try to deflect the short coming it onto the user.
Im not saying that isn't a valid concern, but it can be used as an easy out of its gaps in knowledge.
fn-mote · 1d ago
> I am not a fan of this kind of communication. It doesn't know so try to deflect the short coming it onto the user.
This is a very human-like response when asked a question that you think you know the answer to, but don't want to accuse the asker of having an incorrect premise. State what you think, then leave the door open to being wrong.
Whether or not you want this kind of communication from a machine, I'm less sure... but really, what's the issue?
The problem of the incorrect premise happens all of the time. Assuming the person asking the question is correct 100% of the time isn't wise.
richardwhiuk · 1d ago
Humans use the phrase "I don't know.".
AI never does.
MostlyStable · 1d ago
>I'm not aware of any MS-DOS productivity program...
>I don't know of any MS-DOS productivity programs...
I dunno, seems pretty similar to me.
And in a totally unreltaed query today, I got the following response:
>That's a great question, but I don't have current information...
Sounds a lot like "I don't know".
bigiain · 1d ago
>> And in a totally unreltaed query today, I got the following response:
>That's a great question,
Found the LLM who's training corpus includes transcripts of every motivational speaker and TED talk Q&A ever...
MostlyStable · 1d ago
Yeah, I've been meaning to tweak my system prompt to try and avoid some of that kind of language, but haven't gotten around to it yet.
justsomehnguy · 1d ago
Because there is no "I don't know" in the training data. Can you imagine a forum where in the response for a question of some obscure easter egg there are hunddeds of "I don't know"?
recursive · 1d ago
You gave one explanation, but the problem remains.
nfriedly · 1d ago
Gemini 2.5 Flash me a similar answer, although it was a bit more confident in it's incorrect answer:
> You're asking about an MS-DOS productivity program that had ConnectFour built-in.
I need to tell you that no mainstream or well-known MS-DOS productivity program (like a word processor, spreadsheet, database, or integrated suite) ever had the game ConnectFour built directly into it.
Aeolun · 1d ago
> didn't confidently state that there was none
And better. Didn’t confidently state something wrong.
ziml77 · 1d ago
Whenever I ask these AI "Is the malloc function in the Microsoft UCRT just a wrapper around HeapAlloc?", I get answers that are always wrong.
They claim things like the function adds size tracking so free doesn't need to be called with a size or they say that HeapAlloc is used to grab a whole chunk of memory at once and then malloc does its own memory management on top of that.
That's easy to prove wrong by popping ucrtbase.dll into Binary Ninja. The only extra things it does beyond passing the requested size off to HeapAlloc are: handle setting errno, change any request for 0 bytes to requests for 1 byte, and perform retries for the case that it is being used from C++ and the program has installed a new-handler for out-of-memory situations.
Legend2440 · 1d ago
ChatGPT 4o waffles a little bit and suggests the Microsoft Entertainment pack (which is not productivity software or MS-DOS), but says at the end:
>If you're strictly talking about MS-DOS-only productivity software, there’s no widely known MS-DOS productivity app that officially had a built-in Connect Four game. Most MS-DOS apps were quite lean and focused, and games were generally separate.
I suspect this is the correct answer, because I can't find any MS-DOS Connect Four easter eggs by googling. I might be missing something obscure, but generally if I can't find it by Googling I wouldn't expect an LLM to know it.
AnotherGoodName · 1d ago
ChatGPT in particular will give an incorrect (but unique!) answer every time. At the risk of losing a great example of AI hallucination, it's Autosketch
Wow, that is quite obscure. Even with the name I can't find any references to it on Google. I'm not surprised that the LLMs don't know about it.
You can always make stuff up to trigger AI hallucinations, like 'which 1990s TV show had a talking hairbrush character?'. There's no difference between 'not in the training set' and 'not real'.
> There's no difference between 'not in the training set' and 'not real'.
I know what you meant but this is the whole point of this conversation. There is a huge difference between "no results found" and a confident "that never happened", and if new LLMs are trained on old ones saying the latter then they will be trained on bad data.
dowager_dan99 · 1d ago
>> You can always make stuff up to trigger AI hallucinations
Not being able to find an answer to a made up question would be OK, it's ALWAYS finding an answer with complete confidence that is a major problem.
robocat · 1d ago
I imagine asking for anything obscure where there's plenty of noise can cause hallucinations. What Google search provides the answer? If the answer isn't in the training data, what do you expect? Do you ask people obscure questions, and do you then feel better than them when they guess wrong?
I just tried:
What MS-DOS program contains an easter-egg of an Amiga game?
And got some lovely answers from ChatGPT and Gemini.
Aside I personally would associate "productivity program" with productivity suite (like MS Works) so I would have trouble googling an answer (I started as a kid on Apple ][ and have worked with computers ever since so my ignorance is not age or skill related).
Nition · 1d ago
The good option would be for the LLM to say it doesn't know. It's the making up answers that's the problem.
spogbiper · 1d ago
interesting. gemini 2.5 pro considered that it might be "AutoCAD" but decided it was not:
"A specific user recollection of playing "Connect Four" within a version of AutoCAD for DOS was investigated. While this suggests the possibility of such a game existing within that specific computer-aided design (CAD) program, no widespread documentation or confirmation of this feature as a standard component of AutoCAD could be found. It is plausible that this was a result of a third-party add-on, a custom AutoLISP routine (a scripting language used in AutoCAD), or a misremembered detail."
Applejinx · 1d ago
I wouldn't worry about losing examples. These things are Mandela Effect personified. Anything that is generally unknown and somewhat counterintuitive will be Hallucination Central. It can't NOT be.
groby_b · 1d ago
In what world is that 'productivity software'?
Sure, it helps you do a job more productively, but that's roughly all non-entertainment software. And sure, it helps a user create documents, but, again, most non-entertainment software.
Even in the age of AI, GIGO holds.
squeaky-clean · 1d ago
"Productivity software" typically refers to any software used for work rather than entertainment. It doesn't mean software such as a todo list or organizer. Look up any laptop review and you'll find they segment benchmarks between gaming and "productivity". Just because you personally haven't heard of it doesn't mean it's not a widely used term.
> Productivity software (also called personal productivity software or office productivity software) is application software used for producing information (such as documents, presentations, worksheets, databases, charts, graphs, digital paintings, electronic music and digital video). Its names arose from it increasing productivity
AnotherGoodName · 1d ago
Debatable but regardless you could reformulate the question however you want and still won't get anything other than hallucinations fwiw since there's no references to this on the internet. You need to load up autosketch 2.0 in a dos emulator and see it for yourself.
Amusingly i get an authoritative but incorrect "It's autocad!" if i narrow down the question to program commonly used by engineers that had connect four built in.
overfeed · 1d ago
> I might be missing something obscure, but generally if I can't find it by Googling I wouldn't expect an LLM to know it.
The Google index is already polluted by LLM output, albeit unevenly, depending on the subject. It's only going to spread to all subjects as content farms go down the long tail of profitability, eking profits; Googling won't help because you'll almost always find a result that's wrong, as will LLMs that resort to searching.
Don't get me started on Google's AI answers that assert wrong information and launders fanfic/reddit/forum and elevating all sources to the same level.
dowager_dan99 · 1d ago
It gave me two answers (one was Borland sidekick) which I then asked "are you sure about that?" waffled and said actually neither of those it's IBM Handshaker to which I said "I don't think so, I think it's another productivity program" and it replied on further review it's not IBM Handshaker, there are no productivity programs that include Connect Four. No wonder CTO like this shit so much, it's the perfect bootlick.
relaxing · 1d ago
If I can find something by Googling I wouldn’t need an LLM to know it.
dowager_dan99 · 1d ago
Any current question to an LLM is just a textual interpretation of the search results though; the use the same source of truth (or lies in many cases)
kbenson · 1d ago
So, like normal history just sped up exponentially to the point it's noticeable in not just our own lifetime (which it seemed to reach prior to AI), but maybe even within a couple years.
I'd be a lot more worried about that if I didn't think we were doing a pretty good job of obfuscating facts the last few years ourselves without AI. :/
spogbiper · 1d ago
just tried this with gemini 2.5 flash and pro several times, it just keeps saying it doesn't know of any such thing and suggesting it was a software bundle where the game was included alongside the productivity application or I'm not remembering correctly.
not great (assuming there actually is such a software) but not as bad as making something up
probably chatgpt search function already finds this thread soon to answer correctly, hn domain does well on seo and shows up on search results soon enough
jonchurch_ · 1d ago
What is the correct answer?
AnotherGoodName · 1d ago
Autosketch for MS-Dos had connect four. It's under "game" in the file menu.
This is an example of a random fact old enough no one ever bothered talking about it on the internet. So it's not cited anywhere but many of us can just plain remember it. When you ask ChatGPT (as of now on June 6th 2025) it gives a random answer every time.
Now that i've stated this on the internet in a public manner it will be corrected but... There's a million such things that i could give as an example. Some question obscure enough that no one's given an answer on the internet before so AI doesn't know but recent enough that many of us know the answer so we can instantly see just how much AI hallucinates.
To give some context, i wanted to go back to it for nostalgia sake but couldn't quite remember the name of the application. I asked various AI's what was the application i'm trying to remember and they were all off the mark. In the end only my own neurons finally lighting up got me the answer i was looking for.
$ strings disk1.img | grep 'game'
The object of the game is to get four
Start a new game and place your first
So if ChatGPT cares to analyze all files on the internet, it should know the correct answer...
(edit: formatting)
ericrallen · 1d ago
Interestingly, the Kagi Assistant managed to find this thread while researching the question, but every model I tested (without access to the higher quality Ultimate plan models) was unable to retrieve the correct answer.
It will be interesting to see if/when this information gets picked up by models.
WillAdams · 1d ago
Interestingly, Copilot in Windows 11 claims that it was Excel 95 (which actually had a Flight Simulator Easter Egg).
ofrzeta · 1d ago
Next time try asking which software has the classic quote by William of Ockham in the About menu.
warkdarrior · 1d ago
> random fact old enough no one ever bothered talking about it on the internet. So it's not cited anywhere but many of us can just plain remember it.
And since it is not written down on some website, this fact will disappear from the world once "many of us" die.
bongodongobob · 1d ago
Wait until you meet humans on the Internet. Not only do they make shit up, but they'll do it maliciously to trick you.
abeppu · 1d ago
> which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
I think these are both basically somewhere between wrong and misleading.
Needing to generate your own data through actual experience is very expensive, and can mean that data acquisition now comes with real operational risks. Waymo gets real world experience operating its cars, but the "limit" on how much data you can get per unit time depends on the size of the fleet, and requires that you first get to a level of competence where it's safe to operate in the real world.
If you want to repair cars, and you _don't_ start with some source of knowledge other than on-policy roll-outs, then you have to expect that you're going to learn by trashing a bunch of cars (and still pay humans to tell the robot that it failed) for some significant period.
There's a reason you want your mechanic to have access to manuals, and have gone through some explicit training, rather than just try stuff out and see what works, and those cost-based reasons are true whether the mechanic is human or AI.
Perhaps you're using an off-policy RL approach -- great! If your off-policy data is demonstrations from a prior generation model, that's still AI-contaminated training data.
So even if you're trying to learn by doing, there are still meaningful limits on the supply of training data (which may be way more expensive to produce than scraping the web), and likely still AI-contaminated (though perhaps with better info on the data's provenance?).
nradov · 1d ago
There is an enormous amount of actual car repair experience training data on YouTube but it's all copyrighted. Whether AI companies should have to license that content before using it for training is a matter of some dispute.
AnotherGoodName · 1d ago
>Whether AI companies should have to license that content before using it for training is a matter of some dispute.
We definitely do not have the right balance of this right now.
eg. I'm working on a set of articles that give a different path to learning some key math knowledge (just comes at it from a different point of view and is more intuitive). Historically such blog posts have helped my career.
It's not ready for release anyway but i'm hesitant to release my work in this day and age since AI can steal it and regurgitate it to the point where my articles appear unoriginal.
It's stifling. I'm of the opinion you shouldn't post art, educational material, code or anything that you wish to be credited for on the internet right now. Keep it to yourself or else AI will just regurgitate it to someone without giving you credit.
Legend2440 · 1d ago
The flip side is: knowledge is not (and should not be!) copyrightable. Anyone can read your articles and use the knowledge it contains, without paying or crediting you. They may even rewrite that knowledge in their own words and publish it in a textbook.
AI should be allowed to read repair manuals and use them to fix cars. It should not be allowed to produce copies of the repair manuals.
AnotherGoodName · 1d ago
Using the work of others with no credit given to them would at the very least be considered a dick move.
AI is committing absolute dick moves non-stop.
nradov · 1d ago
Some people claim that the entire trillion dollar Apple empire is based on using the work of Xerox PARC. Was that a dick move? Perhaps, but at this point it hardly matters.
seadan83 · 1d ago
An AI does not know what "fix" means, let alone be able to control anything that would physically fix the car. So, for an AI to fix a car means to give instructions on how to do that, in other words, reproduce pertinent parts of the repair manual. One, Is this a fair framing? Two, is this a distinction without a difference?
throw10920 · 1d ago
> The flip side is: knowledge is not (and should not be!) copyrightable.
Irrelevant. Books and media are not pure knowledge, and those are what is being discussed here, not knowledge.
> Anyone can read your articles and use the knowledge it contains, without paying or crediting you.
Completely irrelevant. AI are categorically different than humans. This is not a valid comparison to make.
This is also a dishonest comparison, because there's a difference between you voluntarily publishing an article for free on the internet (which doesn't even mean that you're giving consent to train on your content), and you offering a paid book online that you have to purchase.
> AI should be allowed to read repair manuals and use them to fix cars.
Yes, after the AI trainers have paid for the repair manuals at the rate that the publishers demand, in exactly the same way that you have to pay for those manuals before using them.
Of course, because AI can then leverage that knowledge at a scale orders of magnitude greater than a human, the cost should be orders of magnitude higher, too.
smikhanov · 1d ago
Prediction: there won’t be any AI systems repairing cars before there will be general intelligence-capable humanoid robots (Ex Machina-style).
There also won’t be any AI maids in five-star hotels until those robots appear.
This doesn’t make your statement invalid, it’s just that the gap between today and the moment you’re describing is so unimaginably vast that saying “don’t worry about AI slop contaminating your language word frequency databases, it’ll sort itself out eventually” is slightly off-mark.
sebtron · 1d ago
I don't understand the obsession with humanoid robots that many seem to have. Why would you make a car repairing machine human-shaped? Like, what would it use its legs for? Wouldn't it be better to design it tailored to its purpose?
TGower · 1d ago
Economies of scale. The humanoid form can interact with all of the existing infrastructure for jobs currently done by humans, so that's the obvious form factor for companies looking to churn out robots to sell by the millions.
thaumasiotes · 1d ago
Can, but an insectoid form factor and much smaller size could easily be better. It's not so common that being of human size is an advantage even where things are set up to allow for humans.
Consider how chimney sweeps used to be children.
tartoran · 1d ago
Not only that but if humanoid robots were available commercially (and viable) they could be used as housemaids or for.. companionship if not more. Of course, we're entering SciFi territory but it's long been a SciFi theme.
numpad0 · 1d ago
They want a child.
smikhanov · 1d ago
Legs? To jump into the workshop pit, among other things. Palms are needed to hold a wrench or a spanner, fingers are needed to unscrew nuts.
Cars are not built to accommodate whatever universal repair machine there could be, cars are built with an expectation that a mechanic with arms and legs will be repairing it, and will be for a while.
A non-humanoid robot in a human-designed world populated by humans looks and behaves like this, at best: https://youtu.be/Hxdqp3N_ymU
sheiyei · 1d ago
This is such a bad take that I have a hard time believing it's not just trolling.
Really, a robot which could literally have an impact wrench built into it would HOLD a SPANNER and use FINGERS to remove bolts?
Next I'm expecting you say self-driving cars will necessarily require a humanoid sitting in the driver's seat to be feasible. And delivery robots (broadly in use in various places around the world) have a tiny humanoid robot inside them to make the go.
smikhanov · 1d ago
Really, a robot which could literally have an impact wrench built into it would HOLD a SPANNER and use FINGERS to remove bolts?
Sure, why not? A built-in impact wrench is built in forever, but a palm and fingers can hold a wrench, a spanner, a screwdriver, a welding torch, a drill, an angle grinder and trillion other tools of every possible size and configuration, that any workshop already has. You suggest to build all those tools into a robot? The multifunctional device you imagine is now incredibly expensive and bulky, likely are not reaching into narrow gaps between car's parts, still not having as many degrees of freedom as human hand, and is limited by the set of tools the manufacturer thought of, unlike the hand, which can grab any previously unexpected tool with ease.
Still want to repair the car with just the built-in wrench?
sheiyei · 1d ago
Ugh, still missed by a long shot. How about instead of a convoluted set of dozens of tiny, weak joints, there's a connection that delivers power (electric, pneumatic, torque, you name it) to any toolhead you want, and the robot can swap out like existing manufacturing robots do. A hand tool for picking things up may be reasonable in rare cases,, but even that won't look like a human hand, if it's not made by a madman. But yeah, let's prioritize a bad compromise of a humanoid with 5 000 joints instead of basically an arm with 10 joints that achieves the same thing, because little robot cute and look like me me
smikhanov · 1d ago
Alright, let's run this thought experiment further then.
You suggest a connector to connect to a set of robot-compatible tools, fine. That set is again limited by what the robot manufacturer thought of in advance, so you're out of luck if you need to weld things, for example, but your robot doesn't come with a compatible welder. Attaching and detaching those tools now becomes a weak point: you either need a real human replacing the tools (ruining the autonomy), or you need to devise a procedure for your robot to switch tools somehow by detaching one from itself, putting it on a workbench for further use, and attaching a new one from a workbench.
The more universal and autonomous that switching procedure becomes, the more you're in the business of actually reinventing a human hand.
But let's assume that you've succeeded in that, against all odds. You now have a powerful robotic arm, connected to a base, that can work with a set of tools it can itself attach and detach. Now imagine for a second that this arm can't reach a certain point in the car it repairs and needs to move itself across the workshop.
Suddenly you're in the business of reinventing the legs.
SoftTalker · 1d ago
More and more, cars are not built with repair in mind. At least not as a top priority. There are many repairs that now require removal of substantial unrelated components or perhaps the entire engine because the failed thing is just impossible to reach in situ.
Nuts and bolts are used because they are good mechanical fasteners that take advantage of the enormous "squeezing" leverage a threaded faster provides. Robots already assemble cars, and we still use nuts and bolts.
bluGill · 1d ago
Cars were always like that. Once in a while they worry about repairs but often they don't, and never have.
ToucanLoucan · 1d ago
It blows my mind that some folks are still out here thinking LLMs are the tech-tree towards AGI and independently thinking machines, when we can't even get copilot to stop suggesting libraries that don't exist for code we fully understand and created.
I'm sure AGI is possible. It's not coming from ChatGPT no matter how much Internet you feed to it.
Legend2440 · 1d ago
Well, we won't be feeding it internet - we'll be using RL to learn from interaction with the real world.
LLMs are just one very specific application of deep learning, doing next-word-prediction of internet text. It's not LLMs specifically that's exciting, it's deep learning as a whole.
bravesoul2 · 1d ago
Long-run you want AGI then? Once we get AGI, the spam will be good?
Currently, there is no reason to believe that "AI contamination" is a practical issue for AI training runs.
AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
numpad0 · 1d ago
Yeah, the thinking behind "low background steel" concept is that AI training on synthetic data could lead into a "model collapse" that render the AIs anyhow completely mad and useless. That either didn't happen, or all the AI companies internally holds a working filter to sieve out AI data. I'd bet on the former. I still think there might be chances of model collapse happening to humans after too much exposure to AI generated data, but that's just my anecdotal observations and gut feelings.
demosthanos · 1d ago
> AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
This is really bad reasoning for a few reasons:
1) We've gotten much better at training LLMs since 2022. The negative impacts of AI slop in the training data certainly don't outweigh the benefits of orders of magnitude more parameters and better training techniques, but that doesn't mean they have no negative impact.
2) "Outperform" is a very loose term and we still have no real good answer for measuring it meaningfully. We can all tell that Gemini 2.5 outperforms GPT-4o. What's trickier is distinguishing between Gemini 2.5 and Claude 4. The expected effect size of slop at this stage would be on that smaller scale of differences between same-gen models.
Given that we're looking for a small enough effect size that we know we're going to have a hard time proving anything with data, I think it's reasonable to operate from first principles in this case. First principles say very clearly that avoiding training on AI-generated content is a good idea.
ACCount36 · 1d ago
No, I mean "model" AIs, created explicitly for dataset testing purposes.
You take small AIs, of the same size and architecture, and with the same pretraining dataset size. Pretrain some solely on skims from "2019 only", "2020 only", "2021 only" scraped datasets. The others on skims from "2023 only", "2024 only". Then you run RLHF, and then test the resulting AIs on benchmarks.
The latter AIs tend to perform slightly better. It's a small but noticeable effect. Plenty of hypothesis on why, none confirmed outright.
You're right that performance of frontier AIs keeps improving, which is a weak strike against the idea of AI contamination hurting AI training runs. Like-for-like testing is a strong strike.
HanayamaTriplet · 1d ago
I can understand that years before ChatGPT would not have any LLM-generated text, but how much does the year actually correlate with how much LLM text is in the dataset? Wouldn't special-purpose datasets with varying ratios of human and LLM text be better for testing effects of "AI contamination"?
rjsw · 1d ago
I don't think people have really got started on generating slop, I expect it to increase by a lot.
schmookeeg · 1d ago
I'm not as allergic to AI content as some (although I'm sure I'll get there) -- but I admire this analogy to low-background steel. Brilliant.
jgrahamc · 1d ago
I am not allergic to it either (and I created the site). The idea was to keep track of stuff that we know humans made.
ris · 1d ago
> I'm not as allergic to AI content as some
I suspect it's less about phobia, more about avoiding training AI on its own output.
This is actually something I'd been discussing with colleagues recently. Pre-AI content is only ever going to become more precious because it's one thing we can never make more of.
Ideally we'd have been cryptographically timestamping all data available in ~2015, but we are where we are now.
abound · 1d ago
One surprising thing to me is that using model outputs to train other/smaller models is standard fare and seems to work quite well.
So it seems to be less about not training AI on its own outputs and more about curating some overall quality bar for the content, AI-generated or otherwise
jgrahamc · 1d ago
Back in the early 2000s when I was doing email filtering using naive Bayes in my POPFile email filter one of the surprising results was that taken the output of the filter as correct and retraining on a message as if it had been labelled by a human worked well.
bhickey · 1d ago
Were you thresholding the naïve Bayes score or doing soft distillation?
jgrahamc · 22h ago
POPFile was doing something incredibly simple (if enabled). Imagine there are two classes of email (ham and spam) (POPFile was actually built to do classification for arbitrary categories but often used as a spam filter). When a message was received and classified its classification was assumed to be correct and the entire message was fed into the training as if the user had specifically told the program to train on it (which was only done when messages were incorrectly classified).
In the two class case the two classes (ham and spam) were so distinct that this had the effect of causing parameters that were essentially uniquely associated with each class to become more and more important to that class. But also, it caused the filter to pick up new parameters that were specific to each class (e.g. as spammers changed their trickery to evade the filters they would learn the new tricks).
There was a threshold involved. I had a cut off score so that only when the classifier was fairly "certain" if the message was ham or spam would it re-train on the message.
glenstein · 1d ago
>more about avoiding training AI on its own output.
Exactly. The analogy I've been thinking of is if you use some sort of image processing filter over and over again to the point that it overpowers the whole image and all you see is the noise generated from the filter. I used to do this sometimes with Irfanview and it's sharp and blur.
And I believe that I've seen TikTok videos showing AI constantly iterating over an image and then iterating over its output with the same instructions and seeming to converge on a style of like a 1920s black and white cartoon.
And I feel like there might be such a thing as a linguistic version of that. Even a conceptual version.
seadan83 · 1d ago
I'm worried about humans training on AI output. Example, a rare fish had a viral AI image made. The image is completely fake. Though, when you search for that fish, the image is what comes up, repeatedly. It is hard to know it is all fake, looks real. Content fabrication at scale has a lot of second order impacts.
smikhanov · 1d ago
It’s about keeping different corpuses of written material that was created by humans, for research purposes. You wouldn’t want to contaminate your human language word frequency databases with AI slop, the linguists of this world won’t like it.
This has been a common metaphor since the launch of ChatGPT.
glenstein · 1d ago
Nicely done! I think I've heard of this framing before, of considering content to be free from AI "contamination." I believe that idea has been out there in the ether.
But I think the suitability of low background steel as an analogy is something you can comfortably claim as a successful called shot.
echelon · 1d ago
I really think you're wrong.
The processes we use to annotate content and synthetic data will turn AI outputs into a gradient that makes future outputs better, not worse.
It might not be as obvious with LLM outputs, but it should be super obvious with image and video models. As we select the best visual outputs of systems, slight errors introduced and taste-based curation will steer the systems to better performance and more generality.
It's no different than genetics and biology adapting to every ecological niche if you think of the genome as a synthetic machine and physics as a stochastic gradient. We're speed running the same thing here.
If something looks like ai, and if LLMs are that great at identifying patterns, who's to say this won't itself become a signal LLMs start to pickup on and improve through?
Ferret7446 · 3h ago
This "problem" is self-contradicting.
If you can distinguish AI content, then you can just do that.
If you can't, what's the problem?
nialv7 · 1d ago
Does this analogy work? It's exceedingly hard to make new low-background steels, since those radioactive particles are everywhere. But it's not difficult to make AI-free content - well just don't use AI to write it.
Ferret7446 · 3h ago
> It's exceedingly hard to make new low-background steels
It's not. It's just cheaper to salvage.
nwbt · 1d ago
It is, even if not impossible, entirely impracticable to prove any work is AI free. So no one but you can be sure.
lurk2 · 1d ago
Who is going to generate this AI-free content, for what reason, and with what money?
arjie · 1d ago
People do. I do, for instance. My blog is self-hosted, entirely human-written, and it is done for the sake of enjoyment. It doesn't cost much to host. An entirely static site generator would actually be free, but I don't mind paying the 55¢/kWh and the $60/month ISP fee to host it.
wahern · 1d ago
That only begs the question of how to verify what content is AI-free. Was this comment generated by a human? IIRC, one of the big AI startups (OpenAI?) used HN as a proving ground--a sort of Turning Test platform--for years.
vouaobrasil · 1d ago
I make all my YouTube videos and for that matter, everything I do AI free. I hate AI.
lurk2 · 1d ago
Once your video is out in the wild there’s as of yet no reliable way to discern whether it was AI-generated or not. All content posted to public forums will have this problem.
Training future models without experiencing signal collapse will thus require either 1) paying for novel content to be generated (they will never do this as they aren’t even licensing the content they are currently training on), 2) using something like mTurk to identify AI content in data sets prior to training (probably won’t scale), or 3) going after private sources of data via automated infiltration of private forums such as Discord servers, WhatsApp groups, and eventually private conversations.
vouaobrasil · 1d ago
There is the web of trust. If you really trust a person to say that their stuff isn't AI, then that's probably the most reliable way of knowing. For example, I have a few friends and I know their stuff isn't AI edited because they hate it too. Of course, there is no 100% certainty but it's as certain as knowing that they're your friend at least.
lurk2 · 1d ago
But the question is about whether or not AI can continue to be trained on these datasets. How are scrapers going to quantify trust?
E: Never mind, I didn’t read the OP. I had assumed it was to do with identifying sources of uncontaminated content for the purposes of training models.
absurdo · 1d ago
Clickbait title that’s all.
submeta · 1d ago
I have started to write „organic“ content again, as I am fed up with ultra polished super noisy texts by colleagues.
I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.
And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.
heavensteeth · 1d ago
> I would have written a shorter letter, but I did not have the time.
- Blaise Pascal
im also unfortunately immediately weary of pretty, punctuated prose now. when something is thrown together with and features quips, slang, and informalities it makes it feel a lot more human.
gorgoiler · 1d ago
This site is literally named for the Y combinator! Module some philosophical hand waving, if there’s one thing we ought to demand of our inference models it’s the ability to find the fixed point of a function that takes content and outputs content, then consumes that same content!
I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.
vunderba · 1d ago
Was the choice to go with a very obviously AI generated image for the banner intentional? If I had to guess it almost looks like DALL-E version 2.
blululu · 1d ago
Gratuitous AI slop is really not a good look. tai;dr is becoming my default response to this kind of thing. I want to hear someone’s thoughts, not an llm’s compression artifacts.
juancroldan · 1d ago
Love that term and gonna adopt it! My default tai;dr response to colleagues is asking AI to write a response for me, and paste it back without reading
Ekaros · 1d ago
Wouldn't actually curated content be still better? That is content were say lot of blogspam and and other content potentially generated by certain groups was removed? As I distinctly remember that lot of content even before AIs was very poor quality.
On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.
tomgag · 1d ago
Interesting idea, I also mentioned the low-background analogy back in 2024:
Any user profile created pre-2022 is low background steel. I’m now finding myself check date created when it seems like the user is outputting low quality content. Much to my dismay, I’m often wrong.
i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.
(plenty of humans are pure noise, too, dont forget)
Animats · 1d ago
Someone else pointed out the problem when I suggested, a few days ago, that it would be useful to have a LLM trained on public domain materials for which copyright has expired. The Great Books series, the out of copyright material in the Harvard libraries, that sort of thing.
That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]
And this is why the Wayback Machine is potentially the most valuable data on the internet
yodon · 1d ago
Anyone who thinks their reading skills are a reliable detector of AI-generated content is either lying to themselves about the validity of their detector or missing the opportunity to print money by selling it.
I strongly suspect more people are in the first category than the second.
uludag · 1d ago
1) If someone had the reading skills to detect AI generated content wouldn't that technically be something very hard to monetize? It's not like said person could clone themselves or mass produce said skill.
Also, for a large number of AI generated images and text (especially low-effort), even basic reading/perception skills can detect AI content. I would agree though that people can't reliably discern high-effort AI generated works, especially if a human was involved to polish it up.
2) True—human "detectors" are mostly just gut feelings dressed up as certainty. And as AI improves, those feelings get less reliable. The real issue isn’t that people can detect AI, but that they’re overconfident when they think they can.
One of the above was generated by ChatGPT to reply to your comment. The other was written by me.
suddenlybananas · 1d ago
It's so obvious that I almost wonder if you made a parody of AI writing on purpose.
Crontab · 1d ago
Off topic:
When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.
jgrahamc · 1d ago
Thanks for being a POPFile user back then! The site is still alive if you need some nostalgia in your life: https://getpopfile.org/docs/welcome
blt · 1d ago
tangentially, does anyone know a good way to limit web searches to the "low-background" era that integrates with address bar, OS right-click menus, etc? I often add a pre-2022 filter on searches manually in reaction to LLM junk results, but I'd prefer to have it on every search by default.
onecommentman · 1d ago
Used paper books, especially poor-but-functional copies known as “reading copies” or “ex-library”, are going for a song on the used book market. Recommend starting your own physical book library, including basic reference texts, and supporting your local public and university libraries. Paper copies of articles in your areas of expertise and interest. Follow the ways of your ancestors.
I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.
jonjacky · 12h ago
This is the best comment -- with the best advice -- in this whole discussion.
mclau157 · 1d ago
is this not just www.archive.org ?
vouaobrasil · 1d ago
Like the idea but I'm not about to create a Tumblr account.
carlosjobim · 1d ago
The shadow libraries are the largest and highest quality source of human knowledge, larger than the Internet in scope and actual content.
It is also uncontaminated by AI.
klysm · 1d ago
Soon this will be contaminated as well unfortunately
carlosjobim · 1d ago
Why? There is no incentive for pirates to put themselves at legal risk for AI generated books which have no value.
And I also expect the torrents to continue to be separated by year and source.
Compare to video files. Nobody is pirating AI slop from YouTube even though it's been around for years.
ChrisArchitect · 1d ago
Love the concept (and the historical story is neat too).
Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.
guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.
for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.
Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.
When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.
I am only (1 - epsilon) joking.
Just like with food: defining the boundaries of what’s allowed will be a nightmare, it will be impossible to prove content is organic, certifying it will be based entirely on networks of trust, it will be utterly contaminated by the thing it professes to be clean of, and it may even be demonstrably worse while still commanding a higher price point.
If you don't go after offenders then you create a lemon markets. Most customers/people can't tell, so they operate on what they can. That doesn't mean they don't want the other things, it means they can't signal what they want. It is about available information, that's what causes lemon markets, information asymmetry.
It's also just a good thing to remember since we're in tech and most people aren't tech literate. Makes it hard to determine what "our customers" want
Btw, private markets are perfectly capable of handling 'markets for lemons'. There might be good excuses for introducing regulation, but markets for lemons ain't.
As a little thought exercise, you can take two minutes and come up with some ways businesses can 'fix' markets for lemons and make a profit in the meantime. How many can you find? How many can you find already implemented somewhere?
Btw, lemon markets aren't actually good for anyone. They are suboptimal for businesses too. They still make money but they make less money than they would were it a market of peaches.
If you build a reputation for honest dealing and high quality, then people can trust that you don't sell them lemons (ie bad used cars in the original example). This reputation is valuable, so (most) companies will try to protect it.
And that's exactly what's happening with some used car dealers.
An informational asymmetry that is beneficial to the businesses will heavily incentivise the businesses to maintain status quo. It's clear that they will actively fight against empowering the consumer.
The consumer has little to no power to force a change outside of regulation, since individually each consumer has asymptotically zero ability to influence the market. They want the goods, but they have no ability to make an informed decision. They can't go anywhere else. What mechanism would force this market to self correct?
Why are you so pessimistic that customers can't go anywhere else?
The classic market for lemons example is about used cars. People can just not buy used cars, eg by buying only new cars. But a dealer with a reputation for honesty can still sell used cars, even if the customer will only learn whether there's a lemon later.
Another solution is to use insurance, or third party inspectors.
I actually think a video of someone typing the content, along with the screen the content is appearing on, would be an acceptably high bar at this present moment. I don’t think it would be hard to fake, but I think it would very rarely be worth the cost of faking it.
I think this bar would be good for about 60 days, before someone trains a model that generates authentication videos for incredibly cheap and sells access to it.
Of course, the output will be no more valuable to the society at large than what a random student writes in their final exam.
So I think the premium product becomes in-person interaction, where the buyer is present for the genesis of the content (e.g. in dialogue).
Image/video/music might have more scalable forms of organic "product". E.g. a high-trust chain of custody from recording device to screen.
1. Those who just want to tick a checkbox will buy mass produced "organic" content. AI slop that had some woefully underpaid intern in a sweatshop add a bit of human touch.
2. People who don't care about virtue signalling but genuinely want good quality will use their network of trust to find and stick to specific creators. E.g. I'd go to the local farmer I trust and buy seasonal produce from them. I can have a friendly chat with them while shopping, they give me honest opinions on what to buy (e.g. this year was great for strawberries!). The stuff they sell on the farm does not have to go through the arcane processes and certifications to be labelled organic, but I've known the farmer for years, I know that they make an effort to minimize pesticide use, they treat their animals with care and respect and the stuff they sell on the farm is as fresh as it can be, and they don't get all their profits scalped by middlemen and huge grocery chains.
How would you define AI generated? Consider a homework and the following scenarios:
1. Student writes everything themselves with pen & paper.
2. Student does some research with an online encyclopedia, proceeds to write with pen and paper. Unbeknownst to them, the online encyclopedia uses AI to answer their queries.
3. Student asks an AI to come up with the structure of the paper, its main points and the conclusion. Proceeds with pen and paper.
4. Student writes the paper themselves, runs the text through AI as a final step, to check for typos, grammar and some styling improvements.
5. Student asks the AI to write the paper for them.
The first one and the last one are obvious, but what about the others?
Edit, bonus:
6. Student writes multiple papers about different topics; later asks an AI to pick the best paper.
1. Not AI 2. Not AI 3. Not AI 4. The characters directly generated by AI are AI characters 5. AI 6. Not AI
The student is missing arms and so dictates a paper word for word exactly
It's hard to imagine that NOT working unless it's implemented poorly.
They are special because they are invisible and sequences of them behave as a single character for cursor movement.
They mirror ASCII so you can encode arbitrary JSON or other data inside them. Quite suitable for marking LLM-generated spans, as long as you don’t mind annoying people with hidden data or deprecated usage.
https://en.m.wikipedia.org/wiki/Tags_(Unicode_block)
My answer would be a clear "no" to all of these, even though the content ultimately ends up fully copy-pasted from an LLM in all those cases.
Yes, machine translations are AI-generated content - I read foreign-language news sites which sometimes has machine translation articles and the quality stands out and not in a good way.
"Maybe" for "writing on paper and using LLM for OCR". It's like automatic meeting transcript - if the speaker has perfect pronunciation, it works well. If they don't, then the meeting notes still look coherent but have little relationship to what speaker said and/or will miss critical parts. Sadly there is no way for reader to know that from reading the transcript, so I'd recommend labeling "AI edited" just in case.
Yes, even if "they give the AI a very detailed outline, constantly ask for rewrites, etc.." it's still AI generated. I am not sure how can you argue otherwise - it's not their words. Also, it's really easy to convince yourself that you are "ruthless in removing any facts they're not 100% sure" while actually you are anything but.
"What if they only use AI to fix the grammar and rewrite bad English into a proper scientific tone?" - I'd label it "AI-edited" if the rewrites are minor or "AI-generated" if the rewrites are major. This one is especially insidious as people may not expect rewrites to change meaning, so they won't inspect them too much, so it will be easier for hallucinations to slip in.
Honestly, I think that's a tough one.
(a) it "feels" like you are doing work. Without you the LLM would not even start. (b) it is very close to how texts are generated without LLMs. Be it in academia, with the PI guiding the process of grad students, or in industry, with managers asking for documentation. In both cases the superior takes (some) credit for the work that is in large parts by others.
At least in academia, if PI takes credit for student's work and does not list them as co-author, it's considered widely unethical. The rules there are simple - someone contributed to the text, they get onto the author list.
If we had same same rule for blogs - "this post is authored by fho and ChatGPT" - then I'd be completely satisfied, as this would be sufficient AI disclosure.
As for industry, I think the rules are very different place-by-place. In some places the authorship does not even come up - the slide deck/document can contain copies from random internet sites, or some previous version of the doc, and the reference will only be present if there is a need (say to lend an authority)
[1] https://github.com/rspeer/wordfreq/blob/master/SUNSET.md
These are immediately, negatively obvious as AI content.
For the other questions the consensus of many publications/journals has been to treat grammar/spellcheck just like non-AI but require that other uses have to be declared. So for most of your questions the answer is a firm "yes".
Like for your last example: to me, the concept "proper scientific tone" exists because humans hand-typed/wrote in a certain way. If we use AI edited/transformed text to act as a source for what "proper scientific tone" looks like, we still could end up with an echo chamber where AI biases for certain words and phrases feed into training data for the next round.
Being strict about how we mark text could mean a world where 99% of text is marked as AI-touched and less than 1% is marked as human-originated. That's still plenty of text to train on, though such a split could also arguably introduce its own (measurable) biases...
That’s how it works with humans too. “That sounds professional because it sounds like the professionals”.
The new encoding can contain a FLOAT32 side channel on every character, to represent its proportional "AI-ness" – kinda like the 'alpha' transparency channel on pixels.
> What if they give the AI a very detailed outline, constantly ask for rewrites and are ruthless in removing any facts they're not 100% sure of if they slip in?
I would like a search engine algorithm that penalizes low quality content. The ones we currently have do a piss poor job of that.
Without knowing the full dataset that got trimmed to the search result you see, how do you evaluate the effectiveness?
A brilliant algorithm that filters out some huge amount of AI slop is still frustrating to the user if any highly ranked AI slop remains. You still click it, immediately notice what it is, and wonder why the algo couldn’t figure this out if you did so quickly
It’s like complaining to a waiter that there’s a fly in your soup, and the waiter can’t understand why you’re upset because there were many more flies in the soup before they brought it to the table and they managed to remove almost all of them
I barely use Google anymore. Mostly just when I know the website I want, but not the URL.
What might make sense is source marking. If you copy and paste text, it becomes a citation. AI source is always cited.
I havebeen thinking that there should be metadata in images for the provenance. Maybe a list of hashes of source images. Real cameras would include the raw sensor data. Again, AI image would be cited.
E.g. you might be fine with the search tool in chatgpt being able to read/link to your content but not be fine with your content being used to improve the base model.
We don’t want to send innocent people to jail! (Use UCS-18 for maximum benefit.)
This is that, but a different implementation. Plain text is like two conductor cables; it’s so useful and cost effective but the moment you add a single abstraction layer above it (a data pin) you can do so much more cool stuff.
Won't work because on day 0 someone will write a conversion library and apparently if you are big enough and have enough lawyers you can just ignore the jail threat (all popular LLMs just scrape internet and skip licensing any text or code. Show me one that isn't)
The core flaw is that any such marker system is trivially easy to circumvent. Any user intending to pass AI content as their own would simply run the text through a basic script to normalize the character set. This isn't a high-level hack; it's a few dozen lines in Python and trivially easy to write for anyone who can follow a few basic Python tutorials or a 5-second task for ChatGPT or Claude.
Technical solutions to something like this exist in the analog world, of course, like the yellow dots on printers that encode date, time, and the printer's serial number. But, there is a fundamental difference: The user has no control over that enforcement mechanism. It's applied at a firmware/hardware layer that they can't access without significant modification. Encoding "human or AI" markers within the content itself means handing the enforcement mechanism directly to the people you're trying to constrain.
The real danger of such a system isn't even just that it's blatantly ineffective; it's that it creates a false sense of security. The absence of "AI-generated" markers would be incorrectly perceived as a guarantee for human origin. This is a far more dangerous state than even our current one, where a healthy level of skepticism is required for all content.
It reminds me of my own methods of circumventing plagiarism checkers back in school. I'm a native German speaker, and instead of copying from German sources for my homework, I would find an English source on the topic, translate it myself, and rewrite it. The core ideas were not my own, but because the text passed through an abstraction layer (my manual translation), it had no direct signature for the checkers to match. (And in case any of my teachers from back then read this: Obviously I didn't cheat in your class, promise.)
Stripping special Unicode characters is an even simpler version of the same principle. The people this system is meant to catch - those aiming to cheat, deceive, or manipulate - are precisely the ones who will bypass it effortlessly. Apart from the most lazy and hapless, of course. But we are already catching those constantly from being dumb enough to include their LLM prompts, or "Sure, I'll do that for you." when copying and pasting. But if you ask me, those people are not the ones we should be worried about.
//edit:
I'm sure there are way smarter people than me thinking about this problem, but I genuinely don't see any way to solve this problem with technology that isn't easily circumvented or extremely brittle.
The most promising would likely be something like unperceivable patterns in the content itself, somehow. Like hiding patterns in the length of words used, length of sentences, punctuation, starting letters for sentences, etc. But even if the big players in AI were to implement something like this immediately, it would be completely moot.
Local open-source models that can be run on consumer hardware already are more than capable enough to re-phrase input text without altering the meaning, and likely wouldn't contain these patterns. Manual editing breaks stylometric patterns trivially - swap synonyms, adjust sentence lengths, restructure paragraphs. You could even attack longer texts piecemeal by having different models rephrase different paragraphs (or sentences), breaking the overall pattern. And if all else fails, there's always my manual approach from high school.
Think of it like knowing the origin of food. Factory-produced food can be nutritious, but some people want organic or local because it reflects a different process, value system, or authenticity. Similarly, pre-AI content often carries a sense of human intention, struggle, or cultural imprint that people feel connected to in a different way.
It’s not necessarily a “psychological need” rooted in fear—it can be about preserving human context in a world where that’s becoming harder to spot. For researchers, historians, or even just curious readers, knowing that something was created without AI helps them understand what it reflects: a human moment, not a machine-generated pattern.
It’s not always about quality—it’s about provenance.
Edit: For those that can't tell this is obviously just copy and pasted from chatgpt response.
https://www.shakespeare.org.uk/explore-shakespeare/shakesped...
It is definitely valid to say he popularised the use of the word, which may have been being used informally in small pockets for some time before.
Writing worth reading as a non-child surprises, challenges, teaches, and inspires. LLM writing tends towards the least surprising, worn out tropes that challenge only the patience and attention of the reader. The eager learner, however will tolerate that , so I suppose that I’ll give them teaching. They are great at children’s stories, where the goal is to rehearse and introduce tropes and moral lessons with archetypes, effectively teaching the listener the language of story.
FWIW I am not particularly a critic of AI and am engaged in AI related projects. I am quite sure that the breakthrough with transformer architecture will lead to the third industrial revolution, for better or for worse.
But there are some things we shouldn’t be using LLMs for.
However, since then, a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.
This will change as contexts get longer and people start feeding large stacks of books and papers into their prompts.
Just like googling, AIing is a skill. You have to know how to evaluate and judge AI responses. Even how to ask the right questions.
Especially asking the right questions is harder than people realize. You see this difference in human managers where some are able to get good results and others aren’t, even when given the same underlying team.
These inproved models do some valuable things better & cheaper than the models, or ensembles of models, that generated their training data. So you could not "just ask" the upstream models. The benefits emerge from further bulk training on well-selected synthetic data from the upstream models.
Yes, it's counterintuitive! That's why it's worth paying attention to, & describing accurately, rather than remaining stuck repeating obsolete folk misunderstandings.
The problem is that it lowers the effort required to produce SEO spam and to “publish” to nearly zero, which creates a perverse incentive to shit on the sidewalk.
The amount of AI created, blatantly false blog posts about drug interactions, for example. Not advertising, just banal filler to create site visits, with dangerously false information.
It’s not like shitting on the sidewalk was never a problem before, it’s just that shitting on the sidewalk as a service (SOTSAAS) maybe is something we should try to avoid.
How much work is "well-curated" doing in that statement?
I find it (very) vaguely like how a person can improve at a sport or an instrument without an expert guiding them through every step up, just by drilling certain behaviors in an adequately-proper way. Training on synthetic data somehow seems to extract a similar iterative improvement in certain directions, without requiring any more natural data. It's somehow succeeding in using more compute to refine yet more value from the original non-synthetic-training-data's entropy.
And, counter to much intuition & forum folklore, it works for AI models, too – with analogous caveats.
But I'm not suggesting they'll advance much, in the near term, without any human-authored training data.
I'm just pointing out the cold hard fact that lots of recent breakthroughs came via training on synthetic data - text prompted by, generated by, & selected by other AI models.
That practice has now generated a bunch of notable wins in model capabilities – contra the upthread post's sweeping & confident wrongness alleging "Ai generated content is inherently a regression to the mean and harms both training and human utility".
How does the banana bread taste at the café around the corner? What's the vibe like there? Is it a good place for people-watching?
What's the typical processing time for a family reunion visa in Berlin? What are the odds your case worker will speak English? Do they still accept English-language documents or do they require a certified translation?
Is the Uzbek-Tajik border crossing still closed? Do foreigners need to go all the way to the northern crossing? Is the Pamir highway doable on a bicycle? How does bribery typically work there? Are people nice?
The world is so much more than the data you have about it.
But also: with regard to claims about what models "can't experience", such claims are pretty contingent on transient conditions, and expiring fast.
To your examples: despite their variety, most if not all could soon have useful answers answers collected by largely-automated processes.
People will comment publicly about the "vibe" & "people-watching" – or it'll be estimable from their shared photos. (Or even: personally-archived life-stream data.) People will describe the banana bread taste to each other, in ways that may also be shared with AI models.
Official info on policies, processing time, and staffing may already be public records with required availability; recent revisions & practical variances will often be a matter of public discussion.
To the extent all your examples are questions expressed in natural-language text, they will quite often be asked, and answered, in places where third parties – humans and AI models – can learn the answers.
Wearable devices, too, will keep shrinking the gap between things any human is able to see/hear (and maybe even feel/taste/smell) and that which will be logged digitally for wider consultation.
I used 'delving' in an HN comment more than a decade before LLMs became a thing!
https://news.ycombinator.com/item?id=1278663
But not experience it the way humans do.
We don’t experience a data series; we experience sensory input in a complicated, nuanced way, modified by prior experiences and emotions, etc. remember that qualia is subjective, with a biological underpinning.
That at least will add extra work to filter usable training data, and costs users minutes a day wading through the refuse.
Now your mind might have immediately went "pffff as if they're doing that" and I agree but only to the extent that it largely wasn't happening prior to AI anyway. The vast majority of internet content was already low quality and rushed out by low paid writers who lacked expertise in what they were writing about. AI doesn't change that.
I wonder if we'll see a resurgence in reputation systems (probably not).
I write blog posts now by dictating into voice notes, transcribing it, and giving it to CGPT or Claude to work on the tone and rhythm.
hm.. I wonder where this kind of label should live? For a personal blog, putting it on every post seems redundant, as if author uses it, it's likely they use it for all posts. And many blogs don't have dedicated "about this blog" section.
I wonder if things will end up like organic food labeling or "made in .." labels. Some blogs might say "100% by human", some might say "Designed by human, made by AI" and some might just say nothing.
Do I need to disclose that I used a keyboard to write it, too?
The stuff I edit with AI is 100% made by a human - me.
Spellcheck and autocorrect can come up with new words, and so is often anthropomorphized, it's not 100% "inanimate tool" anymore.
AI can form its own sentences and come up with its own facts for a much greater degree, so I would not call it "inanimate tool" at all (again, in context of writing text). It is much closer to editor-for-hire or copywriter-for-hire, and I think it should be treated the same as far as attribution goes.
hm.. looks like I am convincing myself into your point :) After all, if another human edits/proofreads my posts before publish, I don't need to disclose that on my post... So why should AI's editing be different?
Don't fall for the utopia fallacy. Humans also publish junk.
For the hard topics, the solution is still the same as pre-AI - search for popular survey papers, then start crawling through the citation network and keeping notes. The LLM output had no idea of what was actually impactful vs what was a junk paper in the niche topic I was interested in so I had no other alternative than quality time with Google Scholar.
We are a long way from deep research even approaching a well-written survey paper written by grad student sweat and tears.
Most people are capable of maybe 4 good hours a day of deep knowledge work. Saving 30 minutes is a lot.
I've found getting a personalized report for the basic stuff is incredibly useful. Maybe you're a world class researcher if it only saves you 15-30 minutes, I'm positive it has saved me many hours.
Grad students aren't an inexhaustible resource. Getting a report that's 80% as good in a few minutes for a few dollars is worth it for me.
But, all provenance systems are gamed. I predict the most reliable methods will be cumbersome and not widespread, thus covering little actual content. The easily-gamed systems will be in widespread use, embedded in social media apps, etc.
Questions: 1. Does there exist a data provenance system that is both easy to use and reliable "enough" (for some sufficient definition of "enough")? Can we do bcrypt-style more-bits=more-security and trade time for security?
2. Is there enough of an incentive for the major tech companies to push adoption of such a system? How could this play out?
If you're training an AI, do you want it to get trained on other AIs' output? That might be interesting actually, but I think you might then want to have both, an AI trained on everything, and another trained on everything except other AIs' output. So perhaps an HTML tag for indicating "this is AI-generated" might be a good idea.
But I don’t think that’s a reasonable goal. Pragmatic example: There’s almost no optional HTML tags or optional HTTP Headers which are used anywhere close to 100% of the times they apply.
Also, I think field is already muddy, even before the game starts. Spell checker, grammar.ly, and translation all had AI contributions and likely affect most of human-generated text on the internet. The heuristic of “one drop of AI” is not useful. And any heuristic more complicated than “one drop” introduces too much subjective complexity for a Boolean data type.
Any current technology which can used to accurately detect pre-AI content would necessarily imply that that same technology could be used to train an AI to generate content that could skirt by the AI detector. Sure, there is going to be a lag time, but eventually we will run out of non-AI content.
It's just not accurate to say they only produce shit. Their rapid adoption demonstrates otherwise.
It may be the case that the non-bad things B does outweigh the bad things. That would be an argument in favor of B. The another group doing bad things has no bearing on the justification for B itself.
They also consume it.
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."
I dont see that:
1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.
2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
3. That LLM output is going to infest everything anyway.
But recent uncontaminated data is hard to find. https://github.com/rspeer/wordfreq/blob/master/SUNSET.md
I really do just bail out whenever anyone uses the word slop.
>As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
Should run the same analysis against the word slop.
Change really is the only constant. The short term predictive game is rigged against hard predictions.
Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
A simple example. "Which MS Dos productivity program had connect four built in?".
I have an MSDOS emulator and know the answer. It's a little obscure but it's amazing how i get a different answer from all the AI's every time. I never saw any of them give the correct answer. Try asking it the above. Then ask it if it's sure about that (it'll change it's mind!).
Now remember that these types of answers may well end up quoted online and then learnt by AI with that circular referenced source as the source. We have no truth at that point.
And seriously try the above question. It's a great example of AI repeatedly stating an authoritative answer that's completely made up.
So, I retried with, "Good afternoon! I'm trying to settle a bet with a friend (no money on the line, just a friendly "bet"!) Which *MS DOS* [not Win95, i.e., Excel 5] productivity program had a playable version of the game Connect Four built in as an easter egg?". I got Lotus 1-2-3 once, Excel 4 twice, and Borland Quattro Pro three different times, all from that prompt.
The correct answer you point out in another subthread was never returned as a possibility, and the responses all definitely came across as confident. Definitely a fascinating example.
>I'm not aware of any MS-DOS productivity program that had Connect Four as a built-in easter egg. While MS-DOS era software was famous for including various easter eggs (like the flight simulator in Excel 97, though that was Windows-era), I can't recall Connect Four specifically being hidden in any major DOS productivity applications.
>The most well-known DOS productivity suites were things like Lotus 1-2-3, WordPerfect, dBase, and later Microsoft Office for DOS, but I don't have reliable information about Connect Four being embedded in any of these.
>It's possible this is a case of misremembered details - perhaps your friend is thinking of a different game, a different era of software, or mixing up some details. Or there might be an obscure productivity program I'm not familiar with that did include this easter egg.
>Would you like me to search for more information about DOS-era software easter eggs to see if we can track down what your friend might be thinking of?
That seems like a pretty reasonable response given the details, and included the appropriate caveat that the model was not aware of any such easter egg, and didn't confidently state that there was none.
I am not a fan of this kind of communication. It doesn't know so try to deflect the short coming it onto the user.
Im not saying that isn't a valid concern, but it can be used as an easy out of its gaps in knowledge.
This is a very human-like response when asked a question that you think you know the answer to, but don't want to accuse the asker of having an incorrect premise. State what you think, then leave the door open to being wrong.
Whether or not you want this kind of communication from a machine, I'm less sure... but really, what's the issue?
The problem of the incorrect premise happens all of the time. Assuming the person asking the question is correct 100% of the time isn't wise.
AI never does.
>I don't know of any MS-DOS productivity programs...
I dunno, seems pretty similar to me.
And in a totally unreltaed query today, I got the following response:
>That's a great question, but I don't have current information...
Sounds a lot like "I don't know".
>That's a great question,
Found the LLM who's training corpus includes transcripts of every motivational speaker and TED talk Q&A ever...
> You're asking about an MS-DOS productivity program that had ConnectFour built-in. I need to tell you that no mainstream or well-known MS-DOS productivity program (like a word processor, spreadsheet, database, or integrated suite) ever had the game ConnectFour built directly into it.
And better. Didn’t confidently state something wrong.
They claim things like the function adds size tracking so free doesn't need to be called with a size or they say that HeapAlloc is used to grab a whole chunk of memory at once and then malloc does its own memory management on top of that.
That's easy to prove wrong by popping ucrtbase.dll into Binary Ninja. The only extra things it does beyond passing the requested size off to HeapAlloc are: handle setting errno, change any request for 0 bytes to requests for 1 byte, and perform retries for the case that it is being used from C++ and the program has installed a new-handler for out-of-memory situations.
>If you're strictly talking about MS-DOS-only productivity software, there’s no widely known MS-DOS productivity app that officially had a built-in Connect Four game. Most MS-DOS apps were quite lean and focused, and games were generally separate.
I suspect this is the correct answer, because I can't find any MS-DOS Connect Four easter eggs by googling. I might be missing something obscure, but generally if I can't find it by Googling I wouldn't expect an LLM to know it.
Not shown fully but https://www.youtube.com/watch?v=kBCrVwnV5DU&t=39s note the game in the file menu.
You can always make stuff up to trigger AI hallucinations, like 'which 1990s TV show had a talking hairbrush character?'. There's no difference between 'not in the training set' and 'not real'.
Edit: Wait, no, there actually was a 1990s TV show with a talking hairbrush character: https://en.wikipedia.org/wiki/The_Toothbrush_Family
This is hard.
I know what you meant but this is the whole point of this conversation. There is a huge difference between "no results found" and a confident "that never happened", and if new LLMs are trained on old ones saying the latter then they will be trained on bad data.
Not being able to find an answer to a made up question would be OK, it's ALWAYS finding an answer with complete confidence that is a major problem.
I just tried:
And got some lovely answers from ChatGPT and Gemini.Aside I personally would associate "productivity program" with productivity suite (like MS Works) so I would have trouble googling an answer (I started as a kid on Apple ][ and have worked with computers ever since so my ignorance is not age or skill related).
"A specific user recollection of playing "Connect Four" within a version of AutoCAD for DOS was investigated. While this suggests the possibility of such a game existing within that specific computer-aided design (CAD) program, no widespread documentation or confirmation of this feature as a standard component of AutoCAD could be found. It is plausible that this was a result of a third-party add-on, a custom AutoLISP routine (a scripting language used in AutoCAD), or a misremembered detail."
Sure, it helps you do a job more productively, but that's roughly all non-entertainment software. And sure, it helps a user create documents, but, again, most non-entertainment software.
Even in the age of AI, GIGO holds.
https://en.m.wikipedia.org/wiki/Productivity_software
> Productivity software (also called personal productivity software or office productivity software) is application software used for producing information (such as documents, presentations, worksheets, databases, charts, graphs, digital paintings, electronic music and digital video). Its names arose from it increasing productivity
Amusingly i get an authoritative but incorrect "It's autocad!" if i narrow down the question to program commonly used by engineers that had connect four built in.
The Google index is already polluted by LLM output, albeit unevenly, depending on the subject. It's only going to spread to all subjects as content farms go down the long tail of profitability, eking profits; Googling won't help because you'll almost always find a result that's wrong, as will LLMs that resort to searching.
Don't get me started on Google's AI answers that assert wrong information and launders fanfic/reddit/forum and elevating all sources to the same level.
I'd be a lot more worried about that if I didn't think we were doing a pretty good job of obfuscating facts the last few years ourselves without AI. :/
not great (assuming there actually is such a software) but not as bad as making something up
No comments yet
Unfortunately that also includes citogenesis.
https://xkcd.com/978/
This is an example of a random fact old enough no one ever bothered talking about it on the internet. So it's not cited anywhere but many of us can just plain remember it. When you ask ChatGPT (as of now on June 6th 2025) it gives a random answer every time.
Now that i've stated this on the internet in a public manner it will be corrected but... There's a million such things that i could give as an example. Some question obscure enough that no one's given an answer on the internet before so AI doesn't know but recent enough that many of us know the answer so we can instantly see just how much AI hallucinates.
To give some context, i wanted to go back to it for nostalgia sake but couldn't quite remember the name of the application. I asked various AI's what was the application i'm trying to remember and they were all off the mark. In the end only my own neurons finally lighting up got me the answer i was looking for.
(edit: formatting)
Here’s an example with Gemini Flash 2.5 Preview: https://kagi.com/assistant/9f638099-73cb-4d58-872e-d7760b3ce...
It will be interesting to see if/when this information gets picked up by models.
And since it is not written down on some website, this fact will disappear from the world once "many of us" die.
I think these are both basically somewhere between wrong and misleading.
Needing to generate your own data through actual experience is very expensive, and can mean that data acquisition now comes with real operational risks. Waymo gets real world experience operating its cars, but the "limit" on how much data you can get per unit time depends on the size of the fleet, and requires that you first get to a level of competence where it's safe to operate in the real world.
If you want to repair cars, and you _don't_ start with some source of knowledge other than on-policy roll-outs, then you have to expect that you're going to learn by trashing a bunch of cars (and still pay humans to tell the robot that it failed) for some significant period.
There's a reason you want your mechanic to have access to manuals, and have gone through some explicit training, rather than just try stuff out and see what works, and those cost-based reasons are true whether the mechanic is human or AI.
Perhaps you're using an off-policy RL approach -- great! If your off-policy data is demonstrations from a prior generation model, that's still AI-contaminated training data.
So even if you're trying to learn by doing, there are still meaningful limits on the supply of training data (which may be way more expensive to produce than scraping the web), and likely still AI-contaminated (though perhaps with better info on the data's provenance?).
We definitely do not have the right balance of this right now.
eg. I'm working on a set of articles that give a different path to learning some key math knowledge (just comes at it from a different point of view and is more intuitive). Historically such blog posts have helped my career.
It's not ready for release anyway but i'm hesitant to release my work in this day and age since AI can steal it and regurgitate it to the point where my articles appear unoriginal.
It's stifling. I'm of the opinion you shouldn't post art, educational material, code or anything that you wish to be credited for on the internet right now. Keep it to yourself or else AI will just regurgitate it to someone without giving you credit.
AI should be allowed to read repair manuals and use them to fix cars. It should not be allowed to produce copies of the repair manuals.
AI is committing absolute dick moves non-stop.
Irrelevant. Books and media are not pure knowledge, and those are what is being discussed here, not knowledge.
> Anyone can read your articles and use the knowledge it contains, without paying or crediting you.
Completely irrelevant. AI are categorically different than humans. This is not a valid comparison to make.
This is also a dishonest comparison, because there's a difference between you voluntarily publishing an article for free on the internet (which doesn't even mean that you're giving consent to train on your content), and you offering a paid book online that you have to purchase.
> AI should be allowed to read repair manuals and use them to fix cars.
Yes, after the AI trainers have paid for the repair manuals at the rate that the publishers demand, in exactly the same way that you have to pay for those manuals before using them.
Of course, because AI can then leverage that knowledge at a scale orders of magnitude greater than a human, the cost should be orders of magnitude higher, too.
There also won’t be any AI maids in five-star hotels until those robots appear.
This doesn’t make your statement invalid, it’s just that the gap between today and the moment you’re describing is so unimaginably vast that saying “don’t worry about AI slop contaminating your language word frequency databases, it’ll sort itself out eventually” is slightly off-mark.
Consider how chimney sweeps used to be children.
Cars are not built to accommodate whatever universal repair machine there could be, cars are built with an expectation that a mechanic with arms and legs will be repairing it, and will be for a while.
A non-humanoid robot in a human-designed world populated by humans looks and behaves like this, at best: https://youtu.be/Hxdqp3N_ymU
Really, a robot which could literally have an impact wrench built into it would HOLD a SPANNER and use FINGERS to remove bolts?
Next I'm expecting you say self-driving cars will necessarily require a humanoid sitting in the driver's seat to be feasible. And delivery robots (broadly in use in various places around the world) have a tiny humanoid robot inside them to make the go.
Still want to repair the car with just the built-in wrench?
You suggest a connector to connect to a set of robot-compatible tools, fine. That set is again limited by what the robot manufacturer thought of in advance, so you're out of luck if you need to weld things, for example, but your robot doesn't come with a compatible welder. Attaching and detaching those tools now becomes a weak point: you either need a real human replacing the tools (ruining the autonomy), or you need to devise a procedure for your robot to switch tools somehow by detaching one from itself, putting it on a workbench for further use, and attaching a new one from a workbench.
The more universal and autonomous that switching procedure becomes, the more you're in the business of actually reinventing a human hand.
But let's assume that you've succeeded in that, against all odds. You now have a powerful robotic arm, connected to a base, that can work with a set of tools it can itself attach and detach. Now imagine for a second that this arm can't reach a certain point in the car it repairs and needs to move itself across the workshop.
Suddenly you're in the business of reinventing the legs.
Nuts and bolts are used because they are good mechanical fasteners that take advantage of the enormous "squeezing" leverage a threaded faster provides. Robots already assemble cars, and we still use nuts and bolts.
I'm sure AGI is possible. It's not coming from ChatGPT no matter how much Internet you feed to it.
LLMs are just one very specific application of deep learning, doing next-word-prediction of internet text. It's not LLMs specifically that's exciting, it's deep learning as a whole.
https://xkcd.com/810/
AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
This is really bad reasoning for a few reasons:
1) We've gotten much better at training LLMs since 2022. The negative impacts of AI slop in the training data certainly don't outweigh the benefits of orders of magnitude more parameters and better training techniques, but that doesn't mean they have no negative impact.
2) "Outperform" is a very loose term and we still have no real good answer for measuring it meaningfully. We can all tell that Gemini 2.5 outperforms GPT-4o. What's trickier is distinguishing between Gemini 2.5 and Claude 4. The expected effect size of slop at this stage would be on that smaller scale of differences between same-gen models.
Given that we're looking for a small enough effect size that we know we're going to have a hard time proving anything with data, I think it's reasonable to operate from first principles in this case. First principles say very clearly that avoiding training on AI-generated content is a good idea.
You take small AIs, of the same size and architecture, and with the same pretraining dataset size. Pretrain some solely on skims from "2019 only", "2020 only", "2021 only" scraped datasets. The others on skims from "2023 only", "2024 only". Then you run RLHF, and then test the resulting AIs on benchmarks.
The latter AIs tend to perform slightly better. It's a small but noticeable effect. Plenty of hypothesis on why, none confirmed outright.
You're right that performance of frontier AIs keeps improving, which is a weak strike against the idea of AI contamination hurting AI training runs. Like-for-like testing is a strong strike.
I suspect it's less about phobia, more about avoiding training AI on its own output.
This is actually something I'd been discussing with colleagues recently. Pre-AI content is only ever going to become more precious because it's one thing we can never make more of.
Ideally we'd have been cryptographically timestamping all data available in ~2015, but we are where we are now.
So it seems to be less about not training AI on its own outputs and more about curating some overall quality bar for the content, AI-generated or otherwise
In the two class case the two classes (ham and spam) were so distinct that this had the effect of causing parameters that were essentially uniquely associated with each class to become more and more important to that class. But also, it caused the filter to pick up new parameters that were specific to each class (e.g. as spammers changed their trickery to evade the filters they would learn the new tricks).
There was a threshold involved. I had a cut off score so that only when the classifier was fairly "certain" if the message was ham or spam would it re-train on the message.
Exactly. The analogy I've been thinking of is if you use some sort of image processing filter over and over again to the point that it overpowers the whole image and all you see is the noise generated from the filter. I used to do this sometimes with Irfanview and it's sharp and blur.
And I believe that I've seen TikTok videos showing AI constantly iterating over an image and then iterating over its output with the same instructions and seeming to converge on a style of like a 1920s black and white cartoon.
And I feel like there might be such a thing as a linguistic version of that. Even a conceptual version.
See (2 years ago): https://news.ycombinator.com/item?id=34085194
But I think the suitability of low background steel as an analogy is something you can comfortably claim as a successful called shot.
The processes we use to annotate content and synthetic data will turn AI outputs into a gradient that makes future outputs better, not worse.
It might not be as obvious with LLM outputs, but it should be super obvious with image and video models. As we select the best visual outputs of systems, slight errors introduced and taste-based curation will steer the systems to better performance and more generality.
It's no different than genetics and biology adapting to every ecological niche if you think of the genome as a synthetic machine and physics as a stochastic gradient. We're speed running the same thing here.
I voiced this same view previously here https://news.ycombinator.com/item?id=44012268
If something looks like ai, and if LLMs are that great at identifying patterns, who's to say this won't itself become a signal LLMs start to pickup on and improve through?
If you can distinguish AI content, then you can just do that.
If you can't, what's the problem?
It's not. It's just cheaper to salvage.
Training future models without experiencing signal collapse will thus require either 1) paying for novel content to be generated (they will never do this as they aren’t even licensing the content they are currently training on), 2) using something like mTurk to identify AI content in data sets prior to training (probably won’t scale), or 3) going after private sources of data via automated infiltration of private forums such as Discord servers, WhatsApp groups, and eventually private conversations.
E: Never mind, I didn’t read the OP. I had assumed it was to do with identifying sources of uncontaminated content for the purposes of training models.
I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.
And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.
- Blaise Pascal
im also unfortunately immediately weary of pretty, punctuated prose now. when something is thrown together with and features quips, slang, and informalities it makes it feel a lot more human.
I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.
On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.
https://gagliardoni.net/#ml_collapse_steel
https://infosec.exchange/@tomgag/111815723861443432
No comments yet
i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.
(plenty of humans are pure noise, too, dont forget)
That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]
[1] https://www.smbc-comics.com/comic/copyright
I strongly suspect more people are in the first category than the second.
Also, for a large number of AI generated images and text (especially low-effort), even basic reading/perception skills can detect AI content. I would agree though that people can't reliably discern high-effort AI generated works, especially if a human was involved to polish it up.
2) True—human "detectors" are mostly just gut feelings dressed up as certainty. And as AI improves, those feelings get less reliable. The real issue isn’t that people can detect AI, but that they’re overconfident when they think they can.
One of the above was generated by ChatGPT to reply to your comment. The other was written by me.
When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.
I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.
It is also uncontaminated by AI.
And I also expect the torrents to continue to be separated by year and source.
Compare to video files. Nobody is pirating AI slop from YouTube even though it's been around for years.
Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.