Superhuman performance of an LLM on the reasoning tasks of a physician

34 amichail 25 5/29/2025, 8:44:46 PM arxiv.org ↗

Comments (25)

Lazarus_Long · 6h ago

In general my smoke test for this kind of things is, if the company (or whatever) gladly accept the full liability for the AI usage.

Cases like: - The AI replaces a salesperson but the sales are not binding or final, in case the client gets a bargain at $0 from the chatbot.

- It replaces drivers but it disengages 1 second before hitting a tree to blame the human.

- Support wants you to press cancel so the reports say "client cancel" and not "self drive is doing laps around a patch of grass".

- Ai is better than doctors at diagnosis, but in any case of misdiagnosis the blame is shifted to the doctor because "AI is just a tool".

- Ai is better at coding that old meat devs, but when the unmaintainable security hole goes to production, the downtime and breaches cannot be blamed on the AI company producing the code, it was the old meat devs fault.

AI companies want the cake and eat it too, until i see them eating the liability, i know, and i know they know, it's not ready for the things they say it is.

odyssey7 · 5h ago

Most doctors have insurance for covering their mistakes. We might expect an AI medical startup to pay analogous premiums when it’s paid analogous fees.

_alternator_ · 3h ago

The obvious next step is not that the LLMs replace doctors, it’s that LLMs become part of the ‘standard of care’, a component of the triage process. You go to the emergency room, and an LLM assessment becomes routine, if not required. This study shows that doing that will significantly increase accurate diagnoses for the start. Everyone wins.

treetalker · 5h ago

Exactly: skin in the game, and to underscore the point, make any debt non-dischargeable in bankruptcy.

OutOfHere · 4h ago

That's completely missing the point. The LLM score substantially higher than the clinician. Statistically this means the clinician will have many more misdiagnoses.

The point is that clinicians don't really get sued most of the time anyway for misdiagnoses. With AI, all one has to do is open up a new chat, tell the AI that its last diagnosis isn't really helping, and it will eagerly give an updated assessment. Compared to a clinician, the AI dramatically lowers the bar of iteratively working with it to help address an issue.

As for drug prescriptions, they are to be processed through an interactions checker anyway.

inopinatus · 4h ago

If you tell a LLM that its last effort was bad, it won't give you a better outcome. It will get worse at whatever you asked for.

The reason is simple. They are trained as plausibility engines. It's more plausible that a bad diagnostician gives you a worse outcome than a good one, and you have literally just prompted it that it's bad at diagnosis.

Sure, you might get another text completion. Will it be correct, actionable, reliable, safe? Even a stopped clock. Good luck rolling those dice with your health.

In summary, do not iterate with prompts for declining competence.

OutOfHere · 3h ago

No, that's a gross frequentist assessment. In reality, the Bayesian assessment is contingent on the first response not helping, and is therefore more likely to be correct, not less. The second response is a conditional response that benefits from new information provided by the user. Accordingly, it's very possible that the LLM will suggest further diagnostic tests to sort out the situation. The same technique also works for code reviews, with stunning effect.

inopinatus · 3h ago

This recommendation isn't about prompts than include notes of "what didn't work". I'm talking about prompts that directly inform the model, "you are modelling an idiot".

The former is reasonable to include when iterating. The latter is a recipe for outcome degradation. GP above gave the latter form. That activates attention from parts of the model guiding towards confabulation and loss of faithfulness.

The model doesn't know what is true, only what is plausible to emit. The hypothesis that plausibility converges with scale towards truth and faithfulness remains very far from proven. Bear in mind that the training data includes large swatches of arbitrary text from the Internet, real life, and from fiction, which includes plenty of examples of people being wrong, stupid, incompetent, repetitive, whimsical, phony, capricious, manipulative, disingenuous, repetitive, argumentative, and mendacious. In the right context these are plausible human-like textual interactions, and the only things really holding it back from completion in such directions are careful training and the system prompt. Worst case scenario, perhaps the corpus included parliamentary proceedings from around the world. "Suppose you were an idiot. And suppose you were a member of Congress. But I repeat myself." - Mark Twain

ncgl · 4h ago

Jesus they're calling us meat devs?

pragmatic · 4h ago

Reminds m of the assassin droid in KOTOR2 that called everyone meat bags.

We're getting there!

bdbenton5255 · 6h ago

As a pure dictionary of knowledge gathering symptoms and performing diagnoses it should be obvious that LLMs can do this more efficiently.

As for everything else, as pointed out, these programs are insufficient. As with programmers and other white collar professions it seems ideal to integrate these tools into the workplace rather than try and replace the human completely.

Businesspeople probably dream of huge profits by replacing their workforce with AI models, and the marketers and proprieters of AI are likely to overpromise what their products can do as is the SV tradition. To promise the moon in order to extract maximum funding.

inopinatus · 6h ago

Thoroughly proves that with cherry-picked examples and careful prompt engineering, you too can ask for more funding for your next paper.

gpt5 · 7h ago

There are two very interesting results here:

1. ChatGPT O1 significantly outperformed any combination of Doctor + Resources (Median score of 86% vs 34%-42% of doctors). Hence superhuman results (at least compared against average physicians)

2. ChatGPT + Doctor performs worse than just ChatGPT alone.

This means that the situation is getting similar to Chess - where adding Magnus Carlsen as a helper to Stockfish (a strong open source chess enginer)) could only make Stockfish worse.

inopinatus · 6h ago

The situation is more akin to a much earlier situation in chess, from 1997, in that Deep Blue could only beat Kasparov with a dedicated team of IBM engineers and GM consultants revising the code between matches, and it still needed a human to interact with the actual chessboard.

We remain a very long way from “ChatGPT will see you now”.

In the meantime, in the real world, I suspect the infamous "Dr Google" is being supplanted by "Dr LLM". It will be difficult to ethically study whether even this leads to generally better patient outcomes.

_________

edit: clarity

htrp · 5h ago

> I suspect the infamous "Dr Google" is being supplanted by "Dr LLM".

Absolutely.

thbb123 · 6h ago

Algorithm aversion and automation biases have been thoroughly studied over the past 70 years of human factors for industrial security. All in all, the thought processes of humans are not always compatible with the evidence on which automation works.

Check out Fitts, HABA-MABA for more results.

adt · 6h ago

Tale as old as time.

https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJ...

spwa4 · 7h ago

I don't understand what the aim is here. LLMs have disadvantages compared to human doctors that make them a really, really bad option.

1) they can't take measurements themselves.

2) they don't adapt on the job. Illnesses do. In other words, if there is a contagious health emergency, an LLM would see the patients ... and ignore the emergency.

3) they are very bad at figuring out if a patient is lying to them (which is a required skill: combined with 2, people would figure out how to get the LLM to prescribe them morfine and ...)

4) they are generally socially problematic. A big part of being a doctor is gently convincing a patient their slightly painful toe does not in fact justify a diagnosis of bone cancer ... WITHOUT doing tests (that would be unethical, as there's zero chance of those tests yielding positive results)

5) they will not adapt to people. LLMs will not adapt, people will. This means patients will exploit LLMs to achieve a whole bunch of aims (like getting drugs, getting days off, getting free hospital stays, ...) and it doesn't matter how good LLMs are. An adaptive system vs a non-adaptive system ... it's a matter of time.

6) they are not themselves patients. This is a fundamental problem: it will be very hard for an LLM to collect new information about "the human condition" and new problems it may generate. There's many examples of this, from patients drinking radium solution (it lights up in the dark, so surely, it must give extra energy, right? Even sexual energy, right?) to rivers or ponds that turn out to have serious diseases lurking around. Meaning a doctor needs to be able to make the decision to go after problems in society when society finds a new, catastrophically dumb, way to hurt itself.

Now you might say "but they would still be good in the developing world, wouldn't they?". Yes, but as the tuberculosis vaccine efforts sadly showed: the developing world is developing partially because they invest nothing whatsoever in (poor) people's health. Nothing. Zero. Rien. Which means, making health services cheaper (e.g. providing a cheap tuberculosis vaccine) ... has the problem that it does not increase the value of zero. They won't pay for healthcare ... and they won't pay for cheaper healthcare. And while Bill Gates ad the US government do pay for a bit of this, they're not sustainable solutions. If, however, you train a local with basic medical skills, there's a lot they can do for free, which actually helps.

timschmidt · 7h ago

3 and 4 can be highly problematic behaviors in doctors. Patients who have real medical issues are often ignored, scolded, or otherwise denied treatment because of a doctor's perception.

derbOac · 6h ago

5 is also something that happens sometimes with physicians and healthcare generally anyway, and could be trained in LLMs is my guess.

1 is often (usually?) not done today by physicians per se anyway.

2 is kind of a strawman about LLMs.

6 is maybe the most challenging critique but is also kind of an empirical one, in the sense that if LLMs routinely outperform physicians in decision making (at least under certain circumstances) it will hard to make the case that it matters.

I have my biases but in general I think at least in the US there needs to be a serious rethinking about how medical decisions can be made and how care can be provided.

I'm skeptical about this paper — the real test will be something like widespread preregistered replication across a wide variety of care settings — but that would happen anyway before it would be adopted. If it works it works and if it doesn't it won't.

My guess is under the best of circumstances it won't get rid of humans, it will just change what they're doing and maybe who is doing that.

howlin · 6h ago

A lot of these problems are already managed by nurses or clinic assistants. It's pretty rare to get a lot of face to face time with an actual M.D. Certainly this is true the more you look at poorer communities.

CityOfThrowaway · 6h ago

#2 and #3 are both just engineering problems at this point.

The foundation models don't adapt quickly, but you can definitely build systems to inject context that changes behaviors

And if you build that system intentionally and correctly, then it's handled for all patients. With human doctors, each individual doctor has to be fed context and change their behavior based on the information, which is stochastic to say the least.

fnordpiglet · 5h ago

I feel like this ignores how LLMs work.

1) of course not they would be fed information, but as we build multi modal models that can achieve more and more world integration there’s no reason why not.

2) They’re very adaptive, by their abductive nature they adapt extraordinarily well to new situations. Perhaps too much - hence the challenge with hallucinations.

3) this isn’t necessarily true, as can be seen by some of the modern alignment in SOTA models being more and more difficult to evade. When prompted and aligned with drug seeking behavior training why would you assume they’re bad at detecting this?

4) again I don’t see why this is true. A general purpose LLM might be, but one that’s been aligned properly should do fine.

5) why do you think LLMs are not adaptive? They adapt through reinforcement and alignment. As a larger corpus of interactions are available they adapt and align towards the training goals. There is extensive research and experience in alignment to date, and models are often continuously adapted. You don’t need to retrain the entire base model you can just retrain a LoRA or embeddings. You can even adapt to specific situations by dynamically pulling in a LoRA or embeddings set for situations.

6) They have human like responses to human situations because they’re trained on a corpus of human language. For a highly specialized model you can ensure specific types of human experience and behavior are well represented and reinforced. You can align the behavior to be what you need.

All this said I don’t think anyone in this is proposing to take humans entirely out of the loop. But there are many situations where ML models or even heuristics out perform human experts in their own field. There’s no reason to believe LLMs, especially when augmented with diagnostic expert system agents, couldn’t generally out perform a doctor in diagnosis. This doesn’t mean the human doctor is irrelevant but that their skills are enhanced and patient outcomes improve with them help of such systems.

Regardless though I feel these criticisms of the approach reflect a naïveté about the ways these models work and what they’re capable of.

doug_durham · 6h ago

They aren't talking about replacing doctors. It's only about LLM's ability to do diagnosis which is a part of being a doctor.

inopinatus · 6h ago

They chose misleadingly hyperbolic language for their title and abstract. The "discussion" section is then similarly loose with meaning and overwrought claims.

The Palindrome Game of the Enigma Codebreakers (visualthesaurus.com)

How Generative Engine Optimization (GEO) rewrites the rules of search (a16z.com)

Building a Linux Electron App (dolthub.com)

The David Lynch Collection (juliensauctions.com)

The Beauty of TanStack Router (tkdodo.eu)

30 years ago, Apple fans met the Mac clone. This is the weird, wild story (macworld.com)

Conversation with a 32nd Generation Samurai (musubi.academy)

Ask HN: Arc is dead, where should we move now?

Absent Fathers (arvat.org)

Why You're Not Shipping (newsletter.posthog.com)

Spider Sense Male Enhancement (facebook.com)

Everything I Created: May 2025 Edition (williammeller.com)

Stack Overflow's New Plan to Fight AI-Induced Death Spiral (m.slashdot.org)

Grow Valley (Game) (eyezmaze.com)

Limits to Growth was right about collapse (thenextwavefutures.wordpress.com)

How to Do Ambitious Research in the Modern Era [video] (youtube.com)

SpaceX town residents may lose right to using their property for its current use (cnbc.com)

Tasks Per Day – A minimalist productivity app that works

Show HN: ChatGPT Library Exporter – Download ChatGPT Image Data (chromewebstore.google.com)

Show HN: Bing Maps Leads Scraper (chromewebstore.google.com)

The Linux 6.15 kernel arrives – and it's big a victory for Rust fans (zdnet.com)

The Hays Code (1930) (josephsmithfoundation.org)

Polio Victim with Incentive Pays Price for His Success (1977) (nytimes.com)

Trump's visa clampdown plunges 275,000 Chinese students into uncertainty (washingtonpost.com)

War and Wilderness: British Soldiers in Revolutionary America (historytoday.com)

The Talk Show Live at WWDC with Gruber will not have any Apple execs this year (daringfireball.net)

Beyond Therapy: Biotechnology and the Pursuit of Happiness (2003) (biotech.law.lsu.edu)

Inigo Quilez – Unlocking Creativity with Signed Distance Fields [video] (youtube.com)

White House releases health report written by LLM, with hallucinated citations (nytimes.com)

Unicorn Studio: The Design Tool for WebGL Magic (unicorn.studio)

Apple is adding Mach-O's riscv32 support to LLVM (github.com)

Show HN: MCP Server SDK in Bash (~250 lines, zero runtime) (github.com)

Mac browser Arc being discontinued in favor of new Dia app (9to5mac.com)

How Should We Think About the Renaissance? (chronicle.com)

Why so many top hackers hail from Russia (2017) (krebsonsecurity.com)

Open source software for the visual AI space (comfy.org)

Show HN: templUI – The UI Kit for templ (CLI-based, like shadcn/UI) (templui.io)

Spread of sexual deepfake images created by generative AI growing in Japan (mainichi.jp)

Show HN: Donut Browser, a Browser Orchestrator (donutbrowser.com)

U.S. Woman Dies from Mad Cow-Like Brain Disease That Lay Dormant for 50 Years (gizmodo.com)

Triangle splatting: radiance fields represented by triangles (trianglesplatting.github.io)

Pianocorder (2022) (pianocorder.info)

The Proposed US Tax Regime for Non-US Investors and Companies (mwe.com)

Read-it-later app Pocket is shutting down – here are the best alternatives (techcrunch.com)

Consider this 'e-tattoo' if you have a stressful job (thetimes.com)

Show HN: I built a tool that turn PDF into link(and QR code) (pdftolink.online)

The radix 2^51 trick (2017) (chosenplaintext.ca)

Compose your agent swarm with Heurist Mesh (heurist.ai)

How to add custom fonts in Tailwind CSS v4 (and v3) (harrisonbroadbent.com)

Chinese paraglider survives accidental 8k-metre-high flight above the clouds (theguardian.com)

Superhuman performance of an LLM on the reasoning tasks of a physician

Comments (25)