The issue of anti-cheat on Linux (tulach.cc)

I recently worked on running a thorough healthcare eval on GPT-5. The results show a (slight) regression in GPT-5 performance compared to GPT-4 era models.

I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf

Comments (17)

aresant · 48m ago

Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

woeirua · 19m ago

Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.

xnx · 1h ago

Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?

hypoxia · 1h ago

Did you try it with high reasoning effort?

ares623 · 28m ago

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

SequoiaHope · 25m ago

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

ares623 · 22m ago

I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

theshackleford · 2m ago

> I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?

In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?

For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.

dcre · 23m ago

This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

aprilthird2021 · 3m ago

> Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.

username135 · 51m ago

I wonder what changed with the models that created regression?

teaearlgraycold · 36m ago

Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.

woeirua · 1h ago

Interesting topic, but I'm not opening a PDF from some random website. Post a summary of the paper or the key findings here first.

42lux · 1h ago

It's hacker news. You can handle a PDF.

jeffbee · 1h ago

I approve of this level of paranoia, but I would just like to know why PDFs are dangerous (reasonable) but HTML is not (inconsistent).

HeatrayEnjoyer · 1h ago

PDFs can run almost anything and have an attack surface the size of Greece's coast.

zamadatix · 1h ago

That's not very different than web browsers, but usually security concerned people just disable scripting functionality and such in their viewer (browser, pdf reader, rtf viewer, etc) instead of focusing on the file extension it comes in.

I think pdf.js even defaults to not running scripts in PDFs by default (would need to double check), if you want to view it in the browser's sandbox. Of course there's still always text rendering based security attacks and such but, again, there's nothing unique to that vs a webpage in a browser.

The issue of anti-cheat on Linux (tulach.cc)

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization (arxiv.org)

Why OS Yamato Lets Your Data Fade Away (github.com)

MAGA's March Toward a Command Economy (insights.som.yale.edu)

Control Shopping Cart Wheels with Your Phone (begaydocrime.com)

Can I Travel Cheaply? (traveldiscountsite.com)

The classical key to the AI revolution (engelsbergideas.com)

Nisar will scan nearly all of Earth's land and ice surface twice every 12 days (gizmodo.com)

A Jupyter widget from a TypeScript React component and styled with tailwind (nitro.bio)

Israeli army database suggests at least 83% of Gaza dead were civilians (972mag.com)

When a Bank Fails, the Public Pays. When Software Fails, Nobody Does (substack.com)

NASA's Juno Mission Leaves Legacy of Science at Jupiter (scientificamerican.com)

Utopia, a clean, free serif font originally designed by Adobe (bhushan-mohanraj.github.io)

The AI Doomers Are Getting Doomier (theatlantic.com)

Limit vs. Style (vibe.des.io)

Google scores six-year Meta cloud deal worth over $10B (cnbc.com)

Claude AI Nuked My Git Repo (geextor.com)

Y Combinator backs Epic in Apple appeal, calls App Store fee a tax on innovation (9to5mac.com)

Herdling (herdling.game)

Ask HN: Non-Smart TV Recommendations?

PostgreSQL's explain analyze made readable (explain.depesz.com)

Prediction of Bearing Layer Depth Using Machine Learning Algorithms (mdpi.com)

Harper Evolves (elijahpotter.dev)

My love for Bitcoin is like the eternal love of people

Three more species of giraffe than previously thought, scientists say (bbc.co.uk)

Staff Cuts and Turmoil Hit the CFTC While the Crypto It Oversees Booms (bloomberg.com)

Y Combinator Files Brief Supporting Epic Games (macrumors.com)

German contest to live in depopulated Soviet-era city proves global hit (theguardian.com)

Increasingly on the Sidelines: Labor Force Participation Continues to Slide (restaurant.org)

One Week of Bugs (danluu.com)

Dev gets 4 years for creating kill switch on ex-employer's systems (bleepingcomputer.com)

FBI Plans to Lower Recruiting Standards, Alarming Agents (nytimes.com)

Pinterest Board Downloader (chromewebstore.google.com)

SignPact – self-serve e-signature workflow is live (signpact.ai)

FTC sues LA Fitness for 'exceedingly difficult' gym cancellation policies (abc7.com)

General Fusion closes US$22M financing; welcomes new Board members (generalfusion.com)

Ferguson's Law: Debt service, military spending, and the fiscal limits of power (hoover.org)

Show HN: Better and Faster Search / Discovery (guiver.ai)

Gut Neurons Help the Body Fight Inflammation (news.weill.cornell.edu)

Ask HN: Dropping YC-Accepted Startup

Fluid dynamics feels natural once you start with quantum mechanics (2021) [video] (youtube.com)

Meta Signs $10B Cloud Deal with Google (twitter.com)

A Short Introduction to Optimal Transport and Wasserstein Distance (alexhwilliams.info)

California is about to end its popular EV carpool lane decal program (ktvu.com)

Russia orders state-backed Max messenger app to be pre-installed on new phones (theguardian.com)

Show HN: Swift package wrapping OpenAI's Tiktoken (github.com)

Russia order state-backed MAX, WhatsApp rival, pre-installed on phones, tablets (reuters.com)

Emerging evidence of abrupt changes in the Antarctic environment (nature.com)

macOS/iOS security update – may have been exploited in targeted attack (support.apple.com)

Invisible Hands on the Scale (culturalcourage.substack.com)

From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding [pdf]

Comments (17)