From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]

70 fertrevino 47 8/21/2025, 10:52:14 PM fertrevino.com ↗

I recently worked on running a thorough healthcare eval on GPT-5. The results show a (slight) regression in GPT-5 performance compared to GPT-4 era models.

I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf

Comments (47)

aresant · 2h ago

Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

TrainedMonkey · 1h ago

GPT-5 feels like cost engineering. The model is incrementally better, but they are optimizing for least amount of compute. I am guessing investors love that.

narrator · 1h ago

I agree. I have found GPT-5 significantly worse on medical queries. It feels like it skips important details and is much worse than o3, IMHO. I have heard good things about GPT-5 Pro, but that's not cheap.

I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.

After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.

yieldcrv · 1h ago

Yeah look at their open source models and how you get such high parameters in such low vram

Its impressive but a regression for now, in direct comparison to just high parameter model

woeirua · 2h ago

Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.

credit_guy · 1h ago

Here's my experience: for some coding tasks where GPT 4.1, Claude Sonnet 4, Gemini 2.5 Pro were just spinning for hours and hours and getting nowhere, GPT 5 just did the job without a fuss. So, I switched immediately to GPT 5, and never looked back. Or at least I never looked back until I found out that my company has some Copilot limits for premium models and I blew through the limit. So now I keep my context small, use GPT 5 mini when possible, and when it's not working I move to the full GPT 5. Strangely, it feels like GPT 5 mini can corrupt the full GPT 5, so sometimes I need to go back to Sonnet 4 to get unstuck. To each their own, but I consider GPT 5 a fairly bit move forward in the space of coding assistants.

benlc · 8m ago

Interestingly I'm experiencing the opposite as you. Was mostly using Claude Sonnet 4 and GPT 4.1 through copilot for a few months and was overall fairly satisfied with it. First task I threw at GPT 5, it excelled in a fraction of the time Sonnet 4 normally takes, but after a few iterations, it all went downhill. GPT 5 almost systematically does things I didn't ask it to do. After failing to solve an issue for almost an hour, I switched back to Claude which fixed it in the first try. YMMV

czk · 57m ago

its possible to use gpt-5-high on the plus plan with codex-cli, its a whole different beast! i dont think theres any other way for plus users to leverage gpt-5 with high reasoning.

codex -m gpt-5 model_reasoning_effort="high"

xnx · 3h ago

Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?

hypoxia · 3h ago

Did you try it with high reasoning effort?

ares623 · 2h ago

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

dcre · 2h ago

This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

brendoelfrendo · 1h ago

Bad news: it doesn't seem to work as well as you might think: https://arxiv.org/pdf/2508.01191

As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.

To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."

hodgehog11 · 33m ago

I keep wondering whether people have actually examined how this work draws its conclusions before citing it.

This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?

I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.

ipaddr · 15m ago

"This is science at its worst, where you start at an inflammatory conclusion and work backwards"

Science starts with a guess and you run experiments to test.

hodgehog11 · 2m ago

True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale.

aprilthird2021 · 1h ago

> Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.

dcre · 1h ago

Trusting things to work based on practical experience and without formal verification is the norm rather than the exception. In formal contexts like software development people have the means to evaluate and use good judgment.

I am much more worried about the problem where LLMs are actively misleading low-info users into thinking they’re people, especially children and old people.

SequoiaHope · 2h ago

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

ares623 · 2h ago

I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

Art9681 · 1h ago

It can be summarized as "Did you RTFM?". One shouldn't expect optimal results if the time and effort wasn't invested in learning the tool, any tool. LLMs are no different. GPT-5 isn't one model, it's 6: gpt-5, gpt-5 mini, gpt-nano. Each takes high|medium|low configurations. Anyone who is serious about measuring model capability would go for the best configuration, especially in medicine.

I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API.

What was the reasoning_effort? verbosity? temperature?

These things matter.

furyofantares · 1h ago

Something I've experienced with multiple new model releases is plugging them into my app makes my app worse. Then I do a bunch of work on prompts and now my app is better than ever. And it's not like the prompts are just better and make the old model work better too - usually the new prompts make the old model worse or there isn't any change.

So it makes sense to me that you should try until you get the results you want (or fail to do so). And it makes sense to ask people what they've tried. I haven't done the work yet to try this for gpt5 and am not that optimistic, but it is possible it will turn out this way again.

theshackleford · 1h ago

> I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?

In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?

For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.

ares623 · 12m ago

I get trying and improving until you get it right. But I just can't make the bridge in my head around

1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works

Plus, knowing it's all probabilistic, how do you know, without knowing ahead of time already, that the result is correct? Is that not the classic halting problem?

chairmansteve · 1h ago

Or...

"Did you try a room full of chimpanzees with typewriters?"

username135 · 2h ago

I wonder what changed with the models that created regression?

teaearlgraycold · 2h ago

Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.

degamad · 1h ago

Obligxkcd: https://xkcd.com/1838/

causality0 · 40m ago

I've definitely seen some unexpected behavior from gpt5. For example, it will tell me my query is banned and then give me a full answer anyway.

kumarvvr · 1h ago

I have an issue with the words "understanding", "reasoning", etc when talking about LLMs.

Are they really understanding, or putting out a stream of probabilities?

munchler · 44m ago

Does it matter from a practical point of view? It's either true understanding or it's something else that's similar enough to share the same name.

axdsk · 14m ago

The polygraph is a good example.

The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.

I think these misnomers can cause real issues like thinking the LLM is "reasoning".

jmpeax · 52m ago

Do you yourself really understand, or are you just depolarizing neurons that have reached their threshold?

octomind · 27m ago

It can be simultaneously true that human understanding is just a firing of neurons but that the architecture and function of those neural structures is vastly different than what an LLM is doing internally such that they are not really the same. Encourage you to read Apple’s recent paper on thinking models; I think it’s pretty clear that the way LLMs encode the world is drastically inferior to what the human brain does. I also believe that could be fixed with the right technical improvements, but it just isn’t the case today.

dmead · 50m ago

He doesn't know the answer to that and neither do you.

lucisferre · 49m ago

What pseudo scientific nonsense.

hodgehog11 · 49m ago

What does understanding mean? Is there a sensible model for it? If not, we can only judge in the same way that we judge humans: by conducting examinations and determining whether the correct conclusions were reached.

Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".

dang · 40m ago

OK, we've removed all understanding from the title above.

sema4hacker · 55m ago

The latter. When "understand", "reason", "think", "feel", "believe", and any of a long list of similar words are in any title, it immediately makes me think the author already drank the kool aid.

manveerc · 44m ago

In the context of coding agents, they do simulate “reasoning” when you feed them the output and it is able to correct itself.

qwertytyyuu · 43m ago

I agree with “feel” and “believe” but what words would you suggest instead of “understand” and “reason’?

sema4hacker · 20m ago

None. Don't anthropomorphize at all. Note that "understanding" has now been removed from the HN title but not the linked pdf.

woeirua · 3h ago

Interesting topic, but I'm not opening a PDF from some random website. Post a summary of the paper or the key findings here first.

42lux · 3h ago

It's hacker news. You can handle a PDF.

jeffbee · 3h ago

I approve of this level of paranoia, but I would just like to know why PDFs are dangerous (reasonable) but HTML is not (inconsistent).

HeatrayEnjoyer · 2h ago

PDFs can run almost anything and have an attack surface the size of Greece's coast.

zamadatix · 2h ago

That's not very different than web browsers, but usually security concerned people just disable scripting functionality and such in their viewer (browser, pdf reader, rtf viewer, etc) instead of focusing on the file extension it comes in.

I think pdf.js even defaults to not running scripts in PDFs by default (would need to double check), if you want to view it in the browser's sandbox. Of course there's still always text rendering based security attacks and such but, again, there's nothing unique to that vs a webpage in a browser.

Control shopping cart wheels with your phone (2021) (begaydocrime.com)

From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf] (fertrevino.com)

Uv format: Code Formatting Comes to uv (experimentally) (pydevtools.com)

1981 Sony Trinitron KV-3000R: The Most Luxurious Trinitron [video] (youtube.com)

Crimes with Python's Pattern Matching (2022) (hillelwayne.com)

Elegant mathematics bending the future of design (actu.epfl.ch)

An interactive guide to SVG paths (joshwcomeau.com)

Benchmarks for Golang SQLite Drivers (github.com)

AI tooling must be disclosed for contributions (github.com)

Show HN: Splice – CAD for Cable Harnesses and Electrical Assemblies (splice-cad.com)

The first human spinal cord repair using the patient own cells (eurekalert.org)

How does the US use water? (construction-physics.com)

Building AI products in the probabilistic era (giansegato.com)

DeepSeek-v3.1 Release (api-docs.deepseek.com)

Beyond sensor data: Foundation models of behavioral data from wearables (arxiv.org)

Weaponizing image scaling against production AI systems (blog.trailofbits.com)

My other email client is a daemon (feyor.sh)

Text.ai (YC X25) Is Hiring Founding Full-Stack Engineer (ycombinator.com)

How well does the money laundering control system work? (journals.uchicago.edu)

Miles from the ocean, there's diving beneath the streets of Budapest (cnn.com)

Philosophical Thoughts on Kolmogorov-Arnold Networks (2024) (kindxiaoming.github.io)

The attr() function in CSS now supports types (amitmerchant.com)

Using Podman, Compose and BuildKit (emersion.fr)

Happy 100000th birthday, Debian (lists.debian.org)

The Onion Brought Back Its Print Edition. The Gamble Is Paying Off (wsj.com)

Show HN: OS X Mavericks Forever (mavericksforever.com)

Beyond the Logo: How We're Weaving Full Images Inside QR Codes (blog.nitroqr.com)

Mirage 2 – Generative World Engine (demo.dynamicslab.ai)

Google scores six-year Meta cloud deal worth over $10B (cnbc.com)

Privately-Owned Rail Cars (amtrak.com)

The power of two random choices (2012) (brooker.co.za)

Mark Zuckerberg freezes AI hiring amid bubble fears (telegraph.co.uk)

Launch HN: Skope (YC S25) – Outcome-based pricing for software products

The contrarian physics podcast subculture (timothynguyen.org)

Why is D3 so Verbose? (theheasman.com)

I forced every engineer to take sales calls and they rewrote our platform (old.reddit.com)

The Core of Rust (jyn.dev)

Show HN: Using Common Lisp from Inside the Browser (turtleware.eu)

D4D4 (nmichaels.org)

Show HN: ChartDB Cloud – Visualize and Share Database Diagrams (app.chartdb.io)

Unity reintroduces the Runtime Fee through its Industry license (unity.com)

Epson MX-80 Fonts (mw.rat.bz)

In the long run, LLMs make us dumber (desunit.com)

You Should Add Debug Views to Your DB (chrispenner.ca)

Adding my home electricity uptime to status.href.cat (aggressivelyparaphrasing.me)

Show HN: I replaced vector databases with Git for AI memory (PoC) (github.com)

Margin debt surges to record high (advisorperspectives.com)

SK hynix dethrones Samsung as world’s top DRAM maker (koreajoongangdaily.joins.com)

A Decoder Ring for AI Job Titles (dbreunig.com)

The Open-Office Trap (2014) (newyorker.com)

From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]

Comments (47)