TFA should be compared with Tufts (2025) A Practical Examination of AI-Generated Text Detectors for Large Language Models.[0] Tufts found that automated detection is very unreliable, while Russell found the opposite for human evaluators.
The explanation for the difference is that automated discrimination has relied mainly on structural factors such as average sentence/paragraph length and frequency of stock words/phrases and certain parts of speech. Human evaluators look at content factors such as repetition of ideas, less precise wording, generalizations rather than concrete examples, overall conceptual coherence, and factual errors.
For most of the issues that human evaluators catch, I can conceive of technical solutions, except for the problem of factual errors. Solving the problem of factuality requires a sufficient model of the world, which is possible only in very restricted domains. I'm afraid the end result of LLM development will be an extremely convincing purveyor of misinformation.
The explanation for the difference is that automated discrimination has relied mainly on structural factors such as average sentence/paragraph length and frequency of stock words/phrases and certain parts of speech. Human evaluators look at content factors such as repetition of ideas, less precise wording, generalizations rather than concrete examples, overall conceptual coherence, and factual errors.
https://arxiv.org/abs/2412.05139
There is some discussion of LLMs and models in another thread. (https://news.ycombinator.com/item?id=44625629)