From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]
70 fertrevino 47 8/21/2025, 10:52:14 PM fertrevino.com ↗
I recently worked on running a thorough healthcare eval on GPT-5. The results show a (slight) regression in GPT-5 performance compared to GPT-4 era models.
I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf
eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).
But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).
Hallucination resistance better but only modestly.
Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.
I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.
After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.
Its impressive but a regression for now, in direct comparison to just high parameter model
codex -m gpt-5 model_reasoning_effort="high"
“Did you try running it over and over until you got the results you wanted?”
As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.
To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."
This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?
I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.
Science starts with a guess and you run experiments to test.
I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.
One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.
I am much more worried about the problem where LLMs are actively misleading low-info users into thinking they’re people, especially children and old people.
I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API.
What was the reasoning_effort? verbosity? temperature?
These things matter.
So it makes sense to me that you should try until you get the results you want (or fail to do so). And it makes sense to ask people what they've tried. I haven't done the work yet to try this for gpt5 and am not that optimistic, but it is possible it will turn out this way again.
Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?
In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?
For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.
1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works
Plus, knowing it's all probabilistic, how do you know, without knowing ahead of time already, that the result is correct? Is that not the classic halting problem?
"Did you try a room full of chimpanzees with typewriters?"
Are they really understanding, or putting out a stream of probabilities?
The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.
I think these misnomers can cause real issues like thinking the LLM is "reasoning".
Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".
I think pdf.js even defaults to not running scripts in PDFs by default (would need to double check), if you want to view it in the browser's sandbox. Of course there's still always text rendering based security attacks and such but, again, there's nothing unique to that vs a webpage in a browser.