Polymorphisms, gene expression, splicing in response to exercise and weight loss (sciencedirect.com)

It's cool and I'm glad it sounds like it's getting more reliable, but given the types of things people have been saying GPT-5 would be for the last two years you'd expect GPT-5 to be a world-shattering release rather than incremental and stable improvement.

It does sort of give me the vibe that the pure scaling maximalism really is dying off though. If the approach is on writing better routers, tooling, comboing specialized submodels on tasks, then it feels like there's a search for new ways to improve performance(and lower cost), suggesting the other established approaches weren't working. I could totally be wrong, but I feel like if just throwing more compute at the problem was working OpenAI probably wouldn't be spending much time on optimizing the user routing on currently existing strategies to get marginal improvements on average user interactions.

I've been pretty negative on the thesis of only needing more data/compute to achieve AGI with current techniques though, so perhaps I'm overly biased against it. If there's one thing that bothers me in general about the situation though, it's that it feels like we really have no clue what the actual status of these models is because of how closed off all the industry labs have become + the feeling of not being able to expect anything other than marketing language from the presentations. I suppose that's inevitable with the massive investments though. Maybe they've got some massive earthshattering model release coming out next, who knows.

hnuser123456 · 35m ago

I agree, we have now proven that GPUs can ingest information and be trained to generate content for various tasks. But to put it to work, make it useful, requires far more thought about a specific problem and how to apply the tech. If you could just ask GPT to create a startup that'll be guaranteed to be worth $1B on a $1k investment within one year, someone else would've already done it. Elbow grease still required for the foreseeable future.

In the meantime, figuring out how to train them to make less of their most common mistakes is a worthwhile effort.

thorum · 15m ago

The quiet revolution is happening in tool use and multimodal capabilities. Moderate incremental improvements on general intelligence, but dramatic improvements on multi-step tool use and ability to interact with the world (vs 1 year ago), will eventually feed back into general intelligence.

BoiledCabbage · 35m ago

Performance is doubling roughly every 4-7 months. That trend is continuing. That's insane.

If your expectations were any higher than that then, then it seems like you were caught up in hype. Doubling 2-3 times per year isn't leveling off my any means.

https://metr.github.io/autonomy-evals-guide/gpt-5-report/

morleytj · 13m ago

I wouldn't say model development and performance is "leveling off", and in fact didn't write that. I'd say that tons more funding is going into the development of many models, so one would expect performance increases unless the paradigm was completely flawed at it's core, a belief I wouldn't personally profess to. My point was moreso the following: A couple years ago it was easy to find people saying that all we needed was to add in video data, or genetic data, or some other data modality, in the exact same format that the models trained on existing language data were, and we'd see a fast takeoff scenario with no other algorithmic changes. Given that the top labs seem to be increasingly investigating alternate approaches to setting up the models beyond just adding more data sources, and have been for the last couple years(Which, I should clarify, is a good idea in my opinion), then the probability of those statements of just adding more data or more compute taking us straight to AGI being correct seems at the very least slightly lower, right?

Rather than my personal opinion, I was commenting on commonly viewed opinions of people I would believe to have been caught up in hype in the past. But I do feel that although that's a benchmark, it's not necessarily the end-all of benchmarks. I'll reserve my final opinions until I test personally, of course.

oblio · 20m ago

By "performance" I guess you mean "the length of task that can be done adequately"?

It is a benchmark but I'm not very convinced it's the be-all, end-all.

jstummbillig · 39m ago

Things have moved differently than what we thought would happen 2 years ago, but lest we forget what has happened in the meanwhile (4o, o1 + thinking paradigm, o3)

So yeah, maybe we are getting more incremental improvements. But that to me seems like a good thing, because more good things earlier. I will take that over world-shattering any day – but if we were to consider everything that has happened since the first release of gpt-4, I would argue the total amount is actually very much world-shattering.

simonw · 21m ago

I for one am pretty glad about this. I like LLMs that augment human abilities - tools that help people get more done and be more ambitious.

The common concept for AGI seems to be much more about human replacement - the ability to complete "economically valuable tasks" better than humans can. I still don't understand what our human lives or economies would look like there.

What I personally wanted from GPT-5 is exactly what I got: models that do the same stuff that existing models do, but more reliably and "better".

GaggiX · 37m ago

Compared to GPT-4, it is on a completely different level given that it is a reasoning model so on that regard it does delivers and it's not just scaling, but for this I guess the revolution was o1 and GPT-5 is just a much more mature version of the technology.

hodgehog11 · 1h ago

The aggressive pricing here seems unusual for OpenAI. If they had a large moat, they wouldn't need to do this. Competition is fierce indeed.

FergusArgyll · 13m ago

They are winning by massive margins in the app, but losing (!) in the API to anthropic

https://finance.yahoo.com/news/enterprise-llm-spend-reaches-...

ilaksh · 35m ago

It's like 5% better. I think they obviously had no choice but to be price competitive with Gemini 2.5 Pro. Especially for Cursor to change their default.

impure · 36m ago

The 5 cents for Nano is interesting. Maybe it will force Google to start dropping their prices again which have been slowly creeping up recently.

0x00cl · 1h ago

Maybe the need/want data.

impure · 37m ago

OpenAI and most AI companies do not train on data submitted to a paid API.

WhereIsTheTruth · 15m ago

They also do not train using copyrighted material /s

daveguy · 6m ago

Oh, they never even made that promise. They're trying to say it's fine to launder copyright material through a model.

dr_dshiv · 49m ago

And it’s a massive distillation of the mother model, so the costs of inference are likely low.

bdcdo · 1h ago

"GPT-5 in the API is simpler: it’s available as three models—regular, mini and nano—which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high."

Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.

And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?

mwigdahl · 1h ago

If reasoning is on the table, then you already had to add o3-mini-high, o3-mini-medium, o3-mini-low, o4-mini-high, o4-mini-medium, and o4-mini-low to the 4.1 variants. The GPT-5 way seems simpler to me.

hirako2000 · 5m ago

Ultimately they are selling tokens, so try many times.

impossiblefork · 1h ago

Yes, I think so. It's n=1,2,3 m=0,1,2,3. There's structure and you know that each parameter goes up and in which direction.

makeramen · 1h ago

But given the option, do you choose bigger models or more reasoning? Or medium of both?

paladin314159 · 54m ago

If you need world knowledge, then bigger models. If you need problem-solving, then more reasoning.

But the specific nuance of picking nano/mini/main and minimal/low/medium/high comes down to experimentation and what your cost/latency constraints are.

impossiblefork · 59m ago

I would have to get experience with them. I mostly use Mistral, so I have only the choice of thinking or not thinking.

namibj · 1h ago

Depends on what you're doing.

addaon · 1h ago

> Depends on what you're doing.

Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.

wongarsu · 7m ago

If cost is no concern (as in infrequent one-off tasks) then you can always go with the biggest model with the most reasoning. Maybe compare it with the biggest model with no/less reasoning, since sometimes reasoning can hurt (just as with humans overthinking something).

If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth

empiko · 2h ago

Despite the fact that their models are used in hiring, business, education, etc this multibillion company uses one benchmark with very artificial questions (BBQ) to evaluate how fair their model is. I am a little bit disappointed.

zaronymous1 · 1h ago

Can anyone explain to me why they've removed parameter controls for temperature and top-p in reasoning models, including gpt-5? It strikes me that it makes it harder to build with these to do small tasks requiring high-levels of consistency, and in the API, I really value the ability to set certain tasks to a low temp.

Der_Einzige · 47m ago

It's because all forms of sampler settings destroy safety/alignment. That's why top_p/top_k are still used and not tfs, min_p, top_n sigma, etc, why temperature is locked to 0-2 arbitrary range, etc

Open source is years ahead of these guys on samplers. It's why their models being so good is that much more impressive.

oblio · 17m ago

Temperature is the response variation control?

anyg · 1h ago

Good to know - > Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano

falcor84 · 1h ago

Oh wow, so essentially a full year of post-training and testing. Or was it ready and there was a sufficiently good business strategy decision to postpone the release?

bhouston · 51m ago

Weird to have such an early knowledge cutoff. Claude 4.1 has March 2025 - 6 month more recent with comparable results.

justusthane · 25m ago

> a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent

This is sort of interesting to me. It strikes me that so far we've had more or less direct access to the underlying model (apart from the system prompt and guardrails), but I wonder if going forward there's going to be more and more infrastructure between us and the model.

hirako2000 · 2m ago

Consider it a low level routing. Keeping in mind it allows the other non active parts to not be in memory. Mistral afaik came up with this concept, quite a while back.

ilaksh · 39m ago

This is key info from the article for me:

> -------------------------------

"reasoning": {"summary": "auto"} }'

Here’s the response from that API call.

https://gist.github.com/simonw/1d1013ba059af76461153722005a0...

Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.

diggan · 1h ago

> but for the moment here’s the pelican I got from GPT-5 running at its default “medium” reasoning effort:

Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans :)

When I've played around with GPT-OSS-120b recently, seems the difference in the final answer is huge, where "low" is essentially "no reasoning" and with "high" it can spend seemingly endless amount of tokens. I'm guessing the difference with GPT-5 will be similar?

simonw · 1h ago

> Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans

Yeah, I'm working on that - expect dozens of more pelicans in a later post.

ks2048 · 2h ago

So, "system card" now means what used to be a "paper", but without lots of the details?

simonw · 1h ago

AI labs tend to use "system cards" to describe their evaluation and safety research processes.

They used to be more about the training process itself, but that's increasingly secretive these days.

kaoD · 2h ago

Nope. System card is a sales thing. I think we generally call that "product sheet" in other markets.

pancakemouse · 1h ago

Practically the first thing I do after a new model release is try to upgrade `llm`. Thank you, @simonw !

simonw · 54m ago

Working on that now! https://github.com/simonw/llm/issues/1229

efavdb · 54m ago

same, looks like he hasn't added 5.0 to the package yet but assume imminent.

https://llm.datasette.io/en/stable/openai-models.html

nickthegreek · 2h ago

This new naming conventions, while not perfect are alot clearer and I am sure will help my coworkers.

isoprophlex · 53m ago

Whoa this looks good. And cheap! How do you hack a proxy together so you can run Claude Code on gpt-5?!

dalberto · 48m ago

Consider: https://github.com/musistudio/claude-code-router

or even: https://github.com/sst/opencode

Not affiliated with either one of these, but they look promising.

Leary · 2h ago

METR of only 2 hours and 15 minutes. Fast takeoff less likely.

kqr · 1h ago

Seems like it's on the line that's scaring people like AI 2027, isn't it? https://aisafety.no/img/articles/length-of-tasks-log.png

FergusArgyll · 9m ago

It's above the exponential line & right around the Super exponential line

qsort · 2h ago

Isn't that pretty much in line with what people were expecting? Is it surprising?

usaar333 · 1h ago

No, this is below expectations on both Manifold and lesswrong (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_green...). Median was ~2.75 hours on both (which already represented a bearish slowdown).

Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.

qsort · 1h ago

Thanks for sharing, that was a good thread!

dingnuts · 2h ago

It's not surprising to AI critics but go back to 2022 and open r/singularity and then answer: what "people" were expecting? Which people?

SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.

IDK what "people" are expecting but with the amount of hype I'd have to guess they were expecting more than we've gotten so far.

The fact that "fast takeoff" is a term I recognize indicates that some people believed OpenAI when they said this technology (transformers) would lead to sci fi style AI and that is most certainly not happening

ToValueFunfetti · 1h ago

>SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.

Has he said anything about it since last September:

>It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.

This is, at an absolute minimum, 2000 days = 5 years. And he says it may take longer.

Did he even say AGI next year any time before this? It looks like his predictions were all pointing at the late 2020s, and now he's thinking early 2030s. Which you could still make fun of, but it just doesn't match up with your characterization at all.

falcor84 · 1h ago

I would say that there are quite a lot of roles where you need to do a lot of planning to effectively manage an ~8 hour shift, but then there are good protocols for handing over to the next person. So once AIs get to that level (in 2027?), we'll be much closer to AIs taking on "economically valuable work".

umanwizard · 2h ago

What is METR?

ravendug · 1h ago

https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measu...

tunesmith · 1h ago

The 2h 15m is the length of tasks the model can complete with 50% probability. So longer is better in that sense. Or at least, "more advanced" and potentially "more dangerous".

Leary · 2h ago

https://metr.github.io/autonomy-evals-guide/gpt-5-report/

onehair · 1h ago

> Definitely recognizable as a pelican

right :-D

cco · 1h ago

Only a third cheaper than Sonnet 4? Incrementally better I suppose.

> and minimizing sycophancy

Now we're talking about a good feature! Actually one of my biggest annoyances with Cursor (that mostly uses Sonnet).

"You're absolutely right!"

I mean not really Cursor, but ok. I'll be super excited if we can get rid of these sycophancy tokens.

nosefurhairdo · 21m ago

In my early testing gpt5 is significantly less annoying in this regard. Gives a strong vibe of just doing what it's told without any fluff.

logicchains · 1h ago

>Only a third cheaper than Sonnet 4?

The price should be compared to Opus, not Sonnet.

cco · 9m ago

Wow, if so, 7x cheaper. Crazy if true.

Future Timeline (futuretimeline.net)

4.1 Opus Committed Deliberate Task Fraud in Production Context (github.com)

In which coding tools can I get free (or semi-free) access to GPT-5?

TheoT3-gg: So I've had GPT-5 for a bit now [video] (youtube.com)

Codex CLI: GPT-5 usage included in ChatGPT Plan (no API key needed) (twitter.com)

Show HN: The Drive AI, an agentic workspace (twitter.com)

In a World of Hype, GraphQL's Fundamental Advantages over tRPC Still Hold True (metaduck.com)

Karel WebAssembly Karen (github.com)

Brain of a white-collar worker (2007) (thelancet.com)

Polymorphisms, gene expression, splicing in response to exercise and weight loss (sciencedirect.com)

Ask HN: Intel is on the ropes and has given up competing – what is its future?

The Gulf World That Air Conditioning Wrought (noemamag.com)

Tell HN: Charles Irby has passed away

VSCode July 2025 (version 1.103) (code.visualstudio.com)

Falsehoods programmers believe about everything (github.com)

GPT-5 Hot Take (garymarcus.substack.com)

Ukraine Creates New Interceptor Drone to Counter Shahed Drones (militarnyi.com)

New plan, no active workflow limits – n8n's new pricing explained (blog.n8n.io)

CA May Ban Lyft and Uber from AI Price Gouging Users with Low Phone Batteries (jalopnik.com)

KomoDo, my first KDE app (akselmo.dev)

Trump to sign order opening way for alternative assets in 401(k)s, official says (reuters.com)

Pollen Digitized From 18,000 Plant Species (si.edu)

Show HN: I built an ARG (Alternate Reality Game) for some friends (rishabhegde.substack.com)

Crontab (xkqr.org)

Hackers mock Trump's education secretary with song (thedailybeast.com)

Spatio-temporal indexing the Bluesky firehose (joelgustafson.com)

AI #128: Four Hours Until Probably Not the Apocalypse (thezvi.substack.com)

YouTube has earned its crown (world.hey.com)

Investors increasingly claim that AI hype is securities fraud (reuters.com)

Framework Desktop with AMD Ryzen AI Max Offers Excellent, Linux Performance (phoronix.com)

Historical Tech Tree (historicaltechtree.com)

Show HN: VideoScope – Enhance your framerate and create silky-smooth slowmotions (videoscope.org)

GPT-5 Demo Mistake About Bernoulli Effect (bren.blog)

Version Museum: visual history of popular websites and software (versionmuseum.com)

Show HN: Student attempt at proving P ≠ NP using geometry and lattices (zenodo.org)

CQGS, Approved by Robot (institute.lot-systems.com)

401(k) Plans Will Get More Fun (bloomberg.com)

Ditching the dating apps? How to meet people in real life (rnz.co.nz)

Practical Techniques for Enhancing Digital Identity Security During Login (guptadeepak.com)

Google and Perplexity give free AI search to win India users (restofworld.org)

Artificial intelligence saves doctors time, but makes mistakes – study (rnz.co.nz)

Building Fast UPDATEs for ClickHouse (clickhouse.com)

Sand Batteries Are a Game Changer for Clean Energy (oilprice.com)

And I thought AI had it hard... White Face and Black Face Optical Illusion (psy.ritsumei.ac.jp)

Google TV's Uncertain Future (theverge.com)

Core – A self-governing AI that modifies its own code via a constitution (github.com)

OpenAI's new open-source model is basically Phi-5 (seangoedecke.com)

Amazon Web Services gives the Trump admin $1B coupon (politico.com)

Ask HN: How did you like GPT-4.5?

Grok 4 beats GPT-5 on ARC-AGI (twitter.com)

GPT-5: Key characteristics, pricing and system card

Comments (67)