From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding [pdf]

41 fertrevino 17 8/21/2025, 10:52:14 PM fertrevino.com ↗
I recently worked on running a thorough healthcare eval on GPT-5. The results show a (slight) regression in GPT-5 performance compared to GPT-4 era models.

I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf

Comments (17)

aresant · 48m ago
Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

woeirua · 19m ago
Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.
xnx · 1h ago
Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?
hypoxia · 1h ago
Did you try it with high reasoning effort?
ares623 · 28m ago
Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

SequoiaHope · 25m ago
What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.
ares623 · 22m ago
I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"
theshackleford · 2m ago
> I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?

In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?

For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.

dcre · 23m ago
This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!
aprilthird2021 · 3m ago
> Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.

username135 · 51m ago
I wonder what changed with the models that created regression?
teaearlgraycold · 36m ago
Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.
woeirua · 1h ago
Interesting topic, but I'm not opening a PDF from some random website. Post a summary of the paper or the key findings here first.
42lux · 1h ago
It's hacker news. You can handle a PDF.
jeffbee · 1h ago
I approve of this level of paranoia, but I would just like to know why PDFs are dangerous (reasonable) but HTML is not (inconsistent).
HeatrayEnjoyer · 1h ago
PDFs can run almost anything and have an attack surface the size of Greece's coast.
zamadatix · 1h ago
That's not very different than web browsers, but usually security concerned people just disable scripting functionality and such in their viewer (browser, pdf reader, rtf viewer, etc) instead of focusing on the file extension it comes in.

I think pdf.js even defaults to not running scripts in PDFs by default (would need to double check), if you want to view it in the browser's sandbox. Of course there's still always text rendering based security attacks and such but, again, there's nothing unique to that vs a webpage in a browser.