Deep learning gets the glory, deep fact checking gets ignored

281 chmaynard 37 6/3/2025, 9:31:56 PM rachel.fast.ai ↗

Comments (37)

godelski · 3h ago

  > although later investigation suggests there may have been data leakage

I think this point is often forgotten. Everyone should assume data leakage until it is strongly evidenced otherwise. It is not on the reader/skeptic to prove that there is data leakage, it is the authors who have the burden of proof.

It is easy to have data leakage on small datasets. Datasets where you can look at everything. Data leakage is really easy to introduce and you often do it unknowingly. Subtle things easily spoil data.

Now, we're talking about gigantic datasets where there's no chance anyone can manually look through it all. We know the filter methods are imperfect, so it how do we come to believe that there is no leakage? You can say you filtered it, but you cannot say there's no leakage.

Beyond that, we are constantly finding spoilage in the datasets we do have access to. So there's frequent evidence that it is happening.

So why do we continue to assume there's no spoilage? Hype? Honestly, it just sounds like a lie we tell ourselves because we want to believe. But we can't fix these problems if we lie to ourselves about them.

SamuelAdams · 1h ago

Every system has problems. The better question is: what is the acceptable threshold?

For an example Medicare and Medicade had a fraud rate of 7.66%. Yes, that is a lot of billions, and there is room for improvement, but that doesn’t mean the entire system is failing: 93% of cases are being covered as intended.

The same could be said with these models. If the spoilage rate is 10%, does that mean the whole system is bad? Or is it at a tolerable threshold?

[1]: https://www.cms.gov/newsroom/fact-sheets/fiscal-year-2024-im...

fastaguy88 · 1h ago

In the protein annotation world, which is largely driven by inferring common ancestry between a protein of unknown function and one of known function, common error thresholds range from FDR of 0.001 to 10^-6. Even a 1% error rate would be considered abysmal. This is in part because it is trivial to get 95% accuracy in prediction; the challenging problem is to get some large fraction of the non-trivial 5% correct.

"Acceptable" thresholds are problem specific. For AI to make a meaningful contribution to protein function prediction, it must do substantially better than current methods, not just better than some arbitrary threshold.

antithesizer · 2h ago

The supposed location of the burden of proof is really not the definitive guide to what you ought to believe that people online seem to think it is.

mathgeek · 1h ago

Can you elaborate? You've made a claim, but I really think there'd be value in continuing to what you actually mean.

NormLolBob · 1h ago

They mean “vet your sources and don’t blindly follow the internet hive-mind.” or similar; burden of proof is not what the internet thinks.

Tacked their actual point on to the end of a copy paste of op comments context, ended up writing something barely grammatically correct.

In doing so they prove why exactly not to listen to the internet. So they have that going for them.

tbrownaw · 1h ago

What is the relevance of this generic statement to the discussion at hand?

amelius · 3h ago

Before making AI do research, perhaps we should first let it __reproduce__ research. For example, give it a paper of some deep learning technique and make it produce an implementation of that paper. Before it can do that, I have no hope that it can produce novel ideas.

ojosilva · 3h ago

I thought you were going to say "give AI the first part of a paper (prompt) and let it finish it (completion)" as a validation AI can produce science at par with research results. Before it can do that, I have no hope that it can produce novel ideas.

bee_rider · 1h ago

I guess it would also need the experimental data. It would, I guess, also need some ability to do little experiments and write off those ideas as not worth following up on…

slewis · 1h ago

OpenAI created a benchmark for this: https://openai.com/index/paperbench/

patagurbon · 2h ago

You would have to have a very complete audit trail for the LLM and ensure the paper shows up nowhere in the dataset.

We have rare but not unheard of issues with academic fraud. LLMs fake data and lie at the drop of a hat

TeMPOraL · 2h ago

> You would have to have a very complete audit trail for the LLM and ensure the paper shows up nowhere in the dataset.

We can do both known and novel reproductions. Like with both LLM training process and human learning, it's valuable to take it in two broad steps:

1) Internalize fully-worked examples, then learn to reproduce them from memory;

2) Train on solving problems for which you know the results but have to work out intermediate steps yourself (looking at the solution before solving the task)

And eventually:

3) Train on solving problems you don't know the answer to, have your solution evaluated by a teacher/judge (that knows the actual answers).

Even parroting existing papers is very valuable, especially early on, when the model is learning how papers and research looks like.

tbrownaw · 1h ago

> For example, give it a paper of some deep learning technique and make it produce an implementation of that paper.

Or maybe give it a paper full of statistics about some experimental observations, and have it reproduce the raw data?

bee_rider · 1h ago

Like, have the AI do the experiment? That could be interesting. Although I guess it would be limited to experiments that could be done on a computer.

YossarianFrPrez · 3h ago

Seconded, as not only is this an interesting idea, it might also help solve the issue of checking for reproducibility. Yet even then human evaluators would need to go over the AI-reproduced research with a fine-toothed comb.

Practically speaking, I think there are roles for current LLMs in research. One is in the peer review process. LLMs can assist in evaluating the data-processing code used by scientists. Another is for brainstorming and the first pass at lit reviews.

thrance · 1h ago

Side note: I wonder why it's not normalized for more papers to come with a reference implementation. Wouldn't have to be efficient, or even be easily runnable. Could be a link to a repository with a few python scripts.

kenjackson · 3h ago

"And for most deep learning papers I read, domain experts have not gone through the results with a fine-tooth comb inspecting the quality of the output. How many other seemingly-impressive papers would not stand up to scrutiny?"

Is this really not the case? I've read some of the AI papers in my field, and I know many other domain experts have as well. That said I do think that CS/software based work is generally easier to check than biology (or it may just be because I know very little bio).

a_bonobo · 2h ago

Validation of biological labels easily takes years - in the OP's example it was a 'lucky' (huge!) coincidence that somebody already had spent years on one of the predicted proteins' labels. Nobody is going to stake 3-5 years of their career on validating some random model's predictions.

suddenlybananas · 2h ago

My impression with linguistics is that people do go over the papers that use these techniques carefully and come up with criticisms of them, but people don't take linguists seriously so people from other related disciplines ignore the criticisms.

croemer · 2h ago

Don't call "Nature Communications" "Nature". The prestige is totally different. Also, altmetrics aren't that relevant, maybe if you want to measure public hype.

rustcleaner · 3h ago

What AI needs is a 'reality checker' subsystem. LLMs are like the phantasmal part of your psyche constantly jibbering phrases (ideas), but what keeps all our internal jibberjabbers in our brains from making endless false statements is a "does my statement describe something falsifiable" and "is there a detectable falsification."

looks around the room at all the churchgoers

Well on second review, this isn't true for everybody...

airstrike · 3h ago

I couldn't agree more. On a random night a few months ago I found myself in that curious half-asleep-half-awake state and this time I had became aware of my brain's constant jibbering phrases. It was as if I could hear my thoughts before the filter pass through which they become actual cohesive sentences.

I could "see" hundreds of words/thoughts/meanings being generated in a diffuse way, all at the same time but also slowly evolving over time and then see my brain distill them into a sentence. It would happen repeatedly every second ridiculously fast yet also "slow enough" that I could see it happen.

It's just my personal half-asleep hallucination, so obviously take from it what you will (~nothing) but I can't shake the feeling we need a similar algorithm. If I ever pursue a doctorate degree, this is what I'll be trying.

TimTheTinker · 2h ago

Human "reality checker" systems are analogous to a discriminator in a generative adversarial network, but strongly informed by emotion.

Psychology tells us that regardless of how "emotional" we are, our sense of truth/falsehood goes first through an emotional circuit, which is informed by underlying beliefs.

If someone states something you strongly disagree with, your first internal response will be emotional; then your thoughts will pick it up from there.

rustcleaner · 46m ago

Popper's definition of Science™ is an algorithm for establishing and testing truth-statements against reality. An educated man would still be susceptible to the weakness you describe, but his education ideally has informed and practiced him in the weaknesses and folley of unchecked emotionalism. Man is emotionally driven, but we can hope for a rationally and logically informed drive-gating system.

8bitsrule · 2h ago

Fits my limited experiences with LLM (as a researcher). Very impressive apparent written language comprehension and written expression. But when it comes to getting to the -best possible answer- (particulary on unresolved questions), the nearly-instant responses (e.g. to questions that one might spend a half-day on without resolution) are seldom satisfactory. Complicated questions take time to explore, and IME an LLM's lack-of-resolution (because of it's inability) is, so far, set aside in favor of confident-sounding (even if completely-wrong) responses.

slt2021 · 3h ago

Fantastic article by Rachel Thomas!

This is basically another argument that deep learning works only as a [generative] information retrieval - i.e a stochastic parrot, due to the fact that the training data is a very lossy representation of the underlying domain.

Because the data/labels of genes do not always represent the underlying domain (biology) perfectly, the output can be false/invalid/nonsensical.

in cases where it works very well - there is data leakage, because by design LLMs are information retrieval tools. It comes form the information theory standpoint, a fundamental "unknown unknown" for any model.

my takeaway is that its not a fault of the algorithm, its more the fault of the training dataset.

We humans operate fluidly in the domain of natural language, and even a kid can read and evaluate whether text make sense or not - this explains the success of models trained on NLP.

but in domains where training data represents the fundamental domain with losses, it will be imperfect.

aucisson_masque · 3h ago

It’s like fake news is taking in science now. Saying any stupid thing will attract much more view and « likes » than those debunking them.

Except that we can’t compare twitter to nature journal. Science is supposed to be immune to these kind of bullshit thanks to reputed journals and pair reviewing, blocking a publication before it does any harm.

Was that a failure of nature ?

lamename · 3h ago

Have you seen the statistics about high impact journals having higher retraction/unverified rates on papers?

The root causes can be argued...but keep that in mind.

No single paper is proof. Bodies of work across many labs, independent verification, etc is the actual gold standard.

tbrownaw · 2h ago

> It’s like fake news is taking in science now.

I didn't think this was new? Like, it's been a few years since that replication crisis things kicked off.

godelski · 3h ago

Yes. And let's not get started on that ML Quantum Wormhole bullshit...

We've taken this all too far. It is bad enough to lie to the masses in Pop-Sci articles. But we're straight up doing it in top tier journals. Some are good faith mistakes, but a lot more often they seem like due diligence just wasn't ever done. Both by researchers and reviewers.

I at least have to thank the journals. I've hated them for a long time and wanted to see their end. Free up publishing and bullshit novelty and narrowing of research. I just never thought they'd be the ones to put the knife through their own heart.

But I'm still not happy about that tbh. The only result of this is that the public grows to distrust science more and more. In a time where we need that trust more than ever. We can't expect the public to differentiate nuanced takes about internal quibbling. And we sure as hell shouldn't be giving ammunition to the anti-science crowds, like junk science does...

toofy · 2h ago

this seems strange to me, shouldn’t we expect a high quality journal to retract often as we gather more information?

obviously this is hyperbole of two extremes, but i certainly trust a journal far more if it actively and loudly looks to correct mistakes over one that never corrects anything or buries its retractions.

a rather important piece of science is correcting mistakes by gathering and testing new information. we should absolutely be applauding when a journal loudly and proactively says “oh, it turns out we were wrong when we declared burying a chestnut under the oak tree on the third thursday of a full moon would cure your brothers infected toenail.”

lamename · 3h ago

The Bullshit asymmetry principle comes to mind https://en.wikipedia.org/wiki/Brandolini%27s_law

softwaredoug · 2h ago

We also love deep cherry picking. Working hard to find that one awesome time some ML / AI thing worked beautifully and shouting its praises to the high heavens. Nevermind the dozens of other times we tried and failed...

TeMPOraL · 2h ago

Even more so, we also love deep stochastic parroting. Working hard to ignore direct experience, growing amount of reports, and to avoid reasoning from first principles, in order to confidently deny the already obvious utility of LLMs, and backing that position with some tired memes.

semiinfinitely · 3h ago

there is no truth- only power.

anthk · 2h ago

There's no power, just physics.

Glow (Mac OS theme engine) (old.reddit.com)

Ask HN: Has anybody built search on top of Anna's Archive?

How FIDO2 works, a technical deep dive (michaelwaterman.nl)

Claude Code's System Prompt (gist.github.com)

Retailer Temu's daily US users halve following end of 'de minimis' loophole (reuters.com)

What Is Quishing? How Hackers Use QR Codes to Steal Your Data (youtube.com)

Your Manager Is Not Your Best Friend (staysaasy.com)

Science-integrity project will root out bad medical papers 'and tell everyone' (nature.com)

Release: QuadParts – FPV Drone Inventory SelfHosted Application (github.com)

We turned public transit into a multiplayer game (blog.transitapp.com)

Running FreeDOS inside a Pokémon Emerald save file [video]

Anthropic decided to cut off all of windsurf capacity to all Claude 3.x models (twitter.com)

Quality and Taste (mfelix.org)

Ask HN: Any good productivity tools out there?

Is Strava's "Athlete Intelligence" useful? (old.reddit.com)

Statement on Anthropic Model Availability (windsurf.com)

Coding Through Chaos: Addiction, Recovery and Acceptance (corecursive.com)

Windsurf says Anthropic is limiting its direct access to Claude AI models (techcrunch.com)

SkyPlanter – Drone-mounted seedling planting system [video] (youtube.com)

I discovered that Bill Gates monopolized ACPI in order to break Linux (enaix.github.io)

Canadian wildfire smoke blankets swath of North America (earthsky.org)

BuildPad – A platform that helps founders go from idea to successful product (buildpad.io)

Economics and labor rights in AI skepticism (henry.codes)

Meta Signs Nuclear Power Deal to Fuel Its AI Ambitions (wsj.com)

The HTTP Query Method (httpwg.org)

AI LLMs can't count lines in a file

What do software developers need to know to succeed in an age of AI? (arxiv.org)

Show HN: AI Email Prioritizer – Auto-Organize Gmail with Nvidia LLM

Hacksaw: Hardware-Centric Kernel Debloating (2023) [pdf] (microsoft.com)

Mikko Hypponen Leaves Anti-Malware Industry to Fight Against Drones (securityweek.com)

Subtxt/Dramatica (subtxt.app)

Show HN: Code Search Mcp for GitHub (github.com)

How pets alter your immune system (bbc.com)

Florida ban on kids using social media likely unconstitutional, judge rules (arstechnica.com)

The End-of-History Illusion (domofutu.substack.com)

Show HN: CrowdRender – collaborative rendering plugin for Blender (crowd-render.com)

Show HN: Hashish v5.5.1 – C++ Hash Cracker, Extreme Performance Boost (github.com)

Tim Sweeney Didn't Expect a Five-Year Fortnite Ban (theverge.com)

The OpenAI Board Drama Is Turning into a Movie (hollywoodreporter.com)

Show HN: Certificate transparency log with LSM-tree based storage (github.com)

Qt's New Bridging Technology – Looking Back to Move Forward (qt.io)

Where's Marty McFly's guitar? Search for Back to the Future prop 4 decades later (apnews.com)

Chocolate-quake: A purist Quake source port (github.com)

IAS Interview with Demis Hassabis on the Future of Knowledge [video] (youtube.com)

Cook My Meat (up.csail.mit.edu)

June of Stories: Read a Short Story Every Day in June (juneofstories.com)

If This, Then That, Except for When (chrbutler.com)

Things are different between system and application monitoring (utcc.utoronto.ca)

Aurora DSQL: How to spend a dollar (marc-bowes.com)

Meta pauses mobile port tracking tech on Android after researchers cry foul (theregister.com)

Deep learning gets the glory, deep fact checking gets ignored

Comments (37)