It's cool and I'm glad it sounds like it's getting more reliable, but given the types of things people have been saying GPT-5 would be for the last two years you'd expect GPT-5 to be a world-shattering release rather than incremental and stable improvement.
It does sort of give me the vibe that the pure scaling maximalism really is dying off though. If the approach is on writing better routers, tooling, comboing specialized submodels on tasks, then it feels like there's a search for new ways to improve performance(and lower cost), suggesting the other established approaches weren't working. I could totally be wrong, but I feel like if just throwing more compute at the problem was working OpenAI probably wouldn't be spending much time on optimizing the user routing on currently existing strategies to get marginal improvements on average user interactions.
I've been pretty negative on the thesis of only needing more data/compute to achieve AGI with current techniques though, so perhaps I'm overly biased against it. If there's one thing that bothers me in general about the situation though, it's that it feels like we really have no clue what the actual status of these models is because of how closed off all the industry labs have become + the feeling of not being able to expect anything other than marketing language from the presentations. I suppose that's inevitable with the massive investments though. Maybe they've got some massive earthshattering model release coming out next, who knows.
godelski · 32m ago
> It does sort of give me the vibe that the pure scaling maximalism really is dying off though
I think the big question is if/when investors will start giving money to those who have been predicting this (with evidence) and trying other avenues.
Really though, why put all your eggs in one basket? That's what I've been confused about for awhile. Why fund yet another LLMs to AGI startup. Space is saturated with big players and has been for years. Even if LLMs could get there that doesn't mean something else won't get there faster and for less. It also seems you'd want a backup in order to avoid popping the bubble. Technology S-Curves and all that still apply to AI
Though I'm similarly biased, but so is everyone I know with a strong math and/or science background (I even mentioned it in my thesis more than a few times lol). Scaling is all you need just doesn't check out
thorum · 2h ago
The quiet revolution is happening in tool use and multimodal capabilities. Moderate incremental improvements on general intelligence, but dramatic improvements on multi-step tool use and ability to interact with the world (vs 1 year ago), will eventually feed back into general intelligence.
darkhorse222 · 1h ago
Completely agree. General intelligence is a building block. By chaining things together you can achieve meta programming. The trick isn't to create one perfect block but to build a variety of blocks and make one of those blocks a block-builder.
coolKid721 · 1h ago
[flagged]
dang · 1h ago
Can you please make your substantive points thoughtfully? Thoughtful criticism is welcome but snarky putdowns and onliners, etc., degrade the discussion for everyone.
You've posted substantive comments in other threads, so this should be easy to fix.
I agree, we have now proven that GPUs can ingest information and be trained to generate content for various tasks. But to put it to work, make it useful, requires far more thought about a specific problem and how to apply the tech. If you could just ask GPT to create a startup that'll be guaranteed to be worth $1B on a $1k investment within one year, someone else would've already done it. Elbow grease still required for the foreseeable future.
In the meantime, figuring out how to train them to make less of their most common mistakes is a worthwhile effort.
morleytj · 1h ago
Certainly, yes, plenty of elbow grease required in all things that matter.
The interesting point as well to me though, is that if it could create a startup that was worth $1B, that startup wouldn't be worth $1B.
Why would anyone pay that much to invest in the startup if they could recreate the entire thing with the same tool that everyone would have access to?
BoiledCabbage · 2h ago
Performance is doubling roughly every 4-7 months. That trend is continuing. That's insane.
If your expectations were any higher than that then, then it seems like you were caught up in hype. Doubling 2-3 times per year isn't leveling off my any means.
I wouldn't say model development and performance is "leveling off", and in fact didn't write that. I'd say that tons more funding is going into the development of many models, so one would expect performance increases unless the paradigm was completely flawed at it's core, a belief I wouldn't personally profess to. My point was moreso the following: A couple years ago it was easy to find people saying that all we needed was to add in video data, or genetic data, or some other data modality, in the exact same format that the models trained on existing language data were, and we'd see a fast takeoff scenario with no other algorithmic changes. Given that the top labs seem to be increasingly investigating alternate approaches to setting up the models beyond just adding more data sources, and have been for the last couple years(Which, I should clarify, is a good idea in my opinion), then the probability of those statements of just adding more data or more compute taking us straight to AGI being correct seems at the very least slightly lower, right?
Rather than my personal opinion, I was commenting on commonly viewed opinions of people I would believe to have been caught up in hype in the past. But I do feel that although that's a benchmark, it's not necessarily the end-all of benchmarks. I'll reserve my final opinions until I test personally, of course. I will say that increasing the context window probably translates pretty well to longer context task performance, but I'm not entirely convinced it directly translates to individual end-step improvement on every class of task.
oblio · 2h ago
By "performance" I guess you mean "the length of task that can be done adequately"?
It is a benchmark but I'm not very convinced it's the be-all, end-all.
nomel · 2m ago
> It is a benchmark but I'm not very convinced it's the be-all, end-all.
Who's suggesting it is?
brandall10 · 1h ago
To be fair, this is one of the pathways GPT-5 was speculated to take as far back at 6 or so months ago - simply being an incremental upgrade from a performance perspective, but a leap from a product simplification approach.
At this point it's pretty much given it's a game of inches moving forward.
AbstractH24 · 44m ago
> It's cool and I'm glad it sounds like it's getting more reliable, but given the types of things people have been saying GPT-5 would be for the last two years you'd expect GPT-5 to be a world-shattering release rather than incremental and stable improvement.
Are you trying to say the curve is flattening? That advances are coming slower and slower?
As long as it doesn't suggest a dot com level recession I'm good.
jstummbillig · 2h ago
Things have moved differently than what we thought would happen 2 years ago, but lest we forget what has happened in the meanwhile (4o, o1 + thinking paradigm, o3)
So yeah, maybe we are getting more incremental improvements. But that to me seems like a good thing, because more good things earlier. I will take that over world-shattering any day – but if we were to consider everything that has happened since the first release of gpt-4, I would argue the total amount is actually very much world-shattering.
simonw · 2h ago
I for one am pretty glad about this. I like LLMs that augment human abilities - tools that help people get more done and be more ambitious.
The common concept for AGI seems to be much more about human replacement - the ability to complete "economically valuable tasks" better than humans can. I still don't understand what our human lives or economies would look like there.
What I personally wanted from GPT-5 is exactly what I got: models that do the same stuff that existing models do, but more reliably and "better".
morleytj · 2h ago
I'd agree on that.
That's pretty much the key component these approaches have been lacking on, the reliability and consistency on the tasks they already work well on to some extent.
I think there's a lot of visions of what our human lives would look like in that world that I can imagine, but your comment did make me think of one particularly interesting tautological scenario in that commonly defined version of AGI.
If artificial general intelligence is defined as completed "economically valuable tasks" better than human can, it requires one to define "economically valuable." As it currently stands, something holds value in an economy relative to human beings wanting it. Houses get expensive because many people, each of whom have economic utility which they use to purchase things, want to have houses, of which there is a limited supply for a variety of reasons. If human beings are not the most effective producers of value in the system, they lose capability to trade for things, which negates that existing definition of economic value. Doesn't matter how many people would pay $5 dollars for your widget if people have no economic utility relative to AGI, meaning they cannot trade that utility for goods.
In general that sort of definition of AGI being held reveals a bit of a deeper belief, which is that there is some version of economic value detached from the humans consuming it. Some sort of nebulous concept of progress, rather than the acknowledgement that for all of human history, progress and value have both been relative to the people themselves getting some form of value or progress. I suppose it generally points to the idea of an economy without consumers, which is always a pretty bizarre thing to consider, but in that case, wouldn't it just be a definition saying that "AGI is achieved when it can do things that the people who control the AI system think are useful." Since in that case, the economy would eventually largely consist of the people controlling the most economically valuable agents.
I suppose that's the whole point of the various alignment studies, but I do find it kind of interesting to think about the fact that even the concept of something being "economically valuable", which sounds very rigorous and measurable to many people, is so nebulous as to be dependent on our preferences and wants as a society.
belter · 1h ago
> Maybe they've got some massive earthshattering model release coming out next, who knows.
Nothing in the current technology offers a path to AGI. These models are fixed after training completes.
echoangle · 46m ago
Why do you think that AGI necessitates modification of the model during use? Couldn’t all the insights the model gains be contained in the context given to it?
godelski · 20m ago
Because time marches on and with it things change.
You could maybe accomplish this if you could fit all new information into context or with cycles of compression but that is kinda a crazy ask. There's too much new information, even considering compression. It certainly wouldn't allow for exponential growth (I'd expect sub linear).
I think a lot of people greatly underestimate how much new information is created every day. It's hard if you're not working on any research and seeing how incremental but constant improvement compounds. But try just looking at whatever company you work for. Do you know everything that people did that day? It takes more time to generate information than process information so that's on you side, but do you really think you could keep up? Maybe at a very high level but in that case you're missing a lot of information.
Think about it this way: if that could be done then LLM wouldn't need training or tuning because you could do everything through prompting.
echoangle · 16m ago
The specific instance doesn’t need to know everything happening in the world at once to be AGI though. You could feed the trained model different contexts based on the task (and even let the model tell you what kind of raw data it wants) and it could still hypothetically be smarter than a human.
I’m not saying this is a realistic or efficient method to create AGI, but I think the argument „Model is static once trained -> model can’t be AGI“ is fallacious.
Like I already said, the model can remember stuff as long as it’s in the context. LLMs can obviously remember stuff they were told or output themselves, even a few messages later.
belter · 1m ago
AGI needs to genuinely learn and build new knowledge from experience, not just generate creative outputs based on what it has already seen.
LLMs might look “creative,” but they are just remixing patterns from their training data and what is in the prompt. They cant actually update themselves or remember new things after training as there is no ongoing feedback loop.
This is why you can’t send an LLM to medical school and expect it to truly “graduate”. It cannot acquire or integrate new knowledge from real-world experience the way a human can.
Without a learning feedback loop, these models are unable to interact meaningfully with a changing reality or fulfill the expectation from an AGI: Contribute to new science and technology.
godelski · 15m ago
> the model can remember stuff as long as it’s in the context.
> You would need an infinite context or compression
Only if AGI would require infinite knowledge, which it doesn’t.
GaggiX · 2h ago
Compared to GPT-4, it is on a completely different level given that it is a reasoning model so on that regard it does delivers and it's not just scaling, but for this I guess the revolution was o1 and GPT-5 is just a much more mature version of the technology.
cchance · 1h ago
SAM is a HYPE CEO, he literally hypes his company nonstop, then the announcements come and ... they're... ok, so people aren't really upset, but they end up feeling lackluster at the hype... Until the next cycle comes around...
If you want actual big moves, watch google, anthropic, qwen, deepseek.
Qwen and Deepseek teams honestly seem so much better at under promising and over delivering.
Cant wait to see what Gemini 3 looks like too.
techpression · 2h ago
"They claim impressive reductions in hallucinations. In my own usage I’ve not spotted a single hallucination yet, but that’s been true for me for Claude 4 and o3 recently as well—hallucination is so much less of a problem with this year’s models."
This has me so confused, Claude 4 (Sonnet and Opus) hallucinates daily for me, on both simple and hard things. And this is for small isolated questions at that.
godelski · 2m ago
There were also several hallucinations during the announcement. (I also see hallucinations every time I use Claude and GPT, which is several times a week. Paid and free tiers)
So not seeing them means either lying or incompetent. I always try to attribute to stupidity rather than malice (Hanlon's razor).
The big problem of LLMs is that they optimize human preference. This means they optimize for hidden errors.
Personally I'm really cautious about using tools that have stealthy failure modes. They just lead to many problems and lots of wasted hours debugging, even when failure rates are low. It just causes everything to slow down for me as I'm double checking everything and need to be much more meticulous if I know it's hard to see. It's like having a line of Python indented with an inconsistent white space character. Impossible to see. But what if you didn't have the interpreter telling you which line you failed on or being able to search or highlight these different characters. At least in this case you'd know there's an error. It's hard enough dealing with human generated invisible errors, but this just seems to perpetuate the LGTM crowd
bluetidepro · 1h ago
Agreed. All it takes is a simple reply of “you’re wrong.” to Claude/ChatGPT/etc. and it will start to crumble on itself and get into a loop that hallucinates over and over. It won’t fight back, even if it happened to be right to begin with. It has no backbone to be confident it is right.
diggan · 41m ago
> All it takes is a simple reply of “you’re wrong.” to Claude/ChatGPT/etc. and it will start to crumble on itself and get into a loop that hallucinates over and over.
Yeah, it's seems to be a terrible approach to try to "correct" the context by adding clarifications or telling it what's wrong.
Instead, start from 0 with the same initial prompt you used, but improve it so the LLM gets it right in the first response. If it still gets it wrong, begin from 0 again. The context seems to be "poisoned" really quickly, if you're looking for accuracy in the responses. So better to begin from the beginning as soon as it veers off course.
cameldrv · 1h ago
Yeah it may be that previous training data, the model was given a strong negative signal when the human trainer told it it was wrong. In more subjective domains this might lead to sycophancy. If the human is always right and the data is always right, but the data can be interpreted multiple ways, like say human psychology, the model just adjusts to the opinion of the human.
If the question is about harder facts which the human disagrees with, this may put it into an essentially self-contradictory state, where the locus of possibilitie gets squished from each direction, and so the model is forced to respond with crazy outliers which agree with both the human and the data. The probability of an invented reference being true may be very low, but from the model's perspective, it may still be one of the highest probability outputs among a set of bad choices.
What it sounds like they may have done is just have the humans tell it it's wrong when it isn't, and then award it credit for sticking to its guns.
ashdksnndck · 48m ago
I put in the ChatGPT system prompt to be not sycophantic, be honest, and tell me if I am wrong. When I try to correct it, it hallucinates more complicated epicycles to explain how it was right the first time.
laacz · 1h ago
I suppose that Simon, being all in with LLMs for quite a while now, has developed a good intuition/feeling for framing questions so that they produce less hallucinations.
simonw · 1h ago
Yeah I think that's exactly right. I don't ask questions that are likely to product hallucinations (like citations from papers about a topic to an LLM without search access), so I rarely see them.
madduci · 1h ago
I believe it depends in inputs. For me, Claude 4 has consistently generated hallucinations, especially was pretty confident in generating invalid JSONs, for instance Grafana Dashboards, which were full of syntactic errors.
Yeah hallucinations are very context dependent. I’m guessing OP is working in very well documented domains
simonw · 1h ago
What kind of hallucinations are you seeing?
OtherShrezzing · 1h ago
I rewrote a 4 page document from first to third person a couple of weeks back. I gave Claude Sonnet 4 the document after editing, so it was entirely written in the third person. I asked it to review & highlight places where it was still in the first person.
>Looking through the document, I can identify several instances where it's written in the first person:
And it went on to show a series of "they/them" statements. I asked it to clarify if "they" is "first person" and it responded
>No, "they" is not first person - it's third person. I made an error in my analysis. First person would be: I, we, me, us, our, my. Second person would be: you, your. Third person would be: he, she, it, they, them, their. Looking back at the document more carefully, it appears to be written entirely in third person.
Even the good models are still failing at real-world use cases which should be right in their wheelhouse.
simonw · 1h ago
That doesn't quite fit the definition I use for "hallucination" - it's clearly a dumb error, but the model didn't confidently state something that's not true (like naming the wrong team who won the Super Bowl).
OtherShrezzing · 39m ago
>"They claim impressive reductions in hallucinations. In my own usage I’ve not spotted a single hallucination yet, but that’s been true for me for Claude 4 and o3 recently as well—hallucination is so much less of a problem with this year’s models."
Could you give an estimate of how many "dumb errors" you've encountered, as opposed to hallucinations? I think many of your readers might read "hallucination" and assume you mean "hallucinations and dumb errors".
jmull · 29m ago
That's a good way to put it.
As a user, when the model tells me things that are flat out wrong, it doesn't really matter whether it would be categorized as a hallucination or a dumb error. From my perspective, those mean the same thing.
techpression · 46m ago
Since I mostly use it for code, made up function names are the most common. And of course just broken code all together, which might not count as a hallucination.
drumhead · 1h ago
"Are you GPT5" - No I'm 4o, 5 hasnt been released yet. "It was released today". Oh you're right, Im GPT5. You have reached the limit of the free usage of 4o
hodgehog11 · 3h ago
The aggressive pricing here seems unusual for OpenAI. If they had a large moat, they wouldn't need to do this. Competition is fierce indeed.
FergusArgyll · 2h ago
They are winning by massive margins in the app, but losing (!) in the API to anthropic
Perhaps they're feeling the effect of losing PRO clients (like me) lately.
Their PRO models were not (IMHO) worth 10X that of PLUS!
Not even close.
Especially when new competitors (eg. z.ai) are offering very compelling competition.
ilaksh · 2h ago
It's like 5% better. I think they obviously had no choice but to be price competitive with Gemini 2.5 Pro. Especially for Cursor to change their default.
impure · 2h ago
The 5 cents for Nano is interesting. Maybe it will force Google to start dropping their prices again which have been slowly creeping up recently.
0x00cl · 3h ago
Maybe the need/want data.
impure · 2h ago
OpenAI and most AI companies do not train on data submitted to a paid API.
dortlick · 1h ago
Why don't they?
echoangle · 40m ago
They probably fear that people wouldn’t use the API otherwise, I guess. They could have different tiers though where you pay extra so your data isn’t used for training.
anhner · 36m ago
If you believe that, I have a bridge I can sell you...
WhereIsTheTruth · 2h ago
They also do not train using copyrighted material /s
simonw · 1h ago
That's different. They train on scrapes of the web. They don't train on data submitted to their API by their paying customers.
johnnyanmac · 1h ago
If they're bold enough to say they train on data they do not own, I am not optimistic when they say they don't train on data people willingly submit to them.
simonw · 1h ago
I don't understand your logic there.
They have confessed to doing a bad thing - training on copyrighted data without permission. Why does that indicate they would lie about a worse thing?
johnnyanmac · 1h ago
>Why does that indicate they would lie about a worse thing?
Because they know their audience. It's an audience that also doesn't care for copyright and would love for them to win their court cases. They are fineaking such an argument to those kinds of people.
Meanwhile, the reaction from the same audience when legal did a very typical subpoena process on said data, data they chose to submit to an online server of their own volition, completely freaked out. Suddenly, they felt like their privacy was invaded.
It doesn't make any logical sense in my mind, but a lot of the discourse over this topic isnt based on logic.
daveguy · 2h ago
Oh, they never even made that promise. They're trying to say it's fine to launder copyright material through a model.
dr_dshiv · 2h ago
And it’s a massive distillation of the mother model, so the costs of inference are likely low.
bdcdo · 3h ago
"GPT-5 in the API is simpler: it’s available as three models—regular, mini and nano—which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high."
Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.
And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?
mwigdahl · 3h ago
If reasoning is on the table, then you already had to add o3-mini-high, o3-mini-medium, o3-mini-low, o4-mini-high, o4-mini-medium, and o4-mini-low to the 4.1 variants. The GPT-5 way seems simpler to me.
impossiblefork · 3h ago
Yes, I think so. It's n=1,2,3 m=0,1,2,3. There's structure and you know that each parameter goes up and in which direction.
makeramen · 3h ago
But given the option, do you choose bigger models or more reasoning? Or medium of both?
paladin314159 · 3h ago
If you need world knowledge, then bigger models. If you need problem-solving, then more reasoning.
But the specific nuance of picking nano/mini/main and minimal/low/medium/high comes down to experimentation and what your cost/latency constraints are.
impossiblefork · 3h ago
I would have to get experience with them. I mostly use Mistral, so I have only the choice of thinking or not thinking.
gunalx · 1h ago
Mistral also has small medium and large. With both small and medium håving a thinking one, devstral codestral ++
Not really that mich simpler.
impossiblefork · 1h ago
Ah, but I never route to these manually. I only use LLMs a little bit, mostly to try to see what they can't do.
namibj · 3h ago
Depends on what you're doing.
addaon · 3h ago
> Depends on what you're doing.
Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.
wongarsu · 2h ago
If cost is no concern (as in infrequent one-off tasks) then you can always go with the biggest model with the most reasoning. Maybe compare it with the biggest model with no/less reasoning, since sometimes reasoning can hurt (just as with humans overthinking something).
If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth
vineyardmike · 2h ago
When I read “simpler” I interpreted that to mean they don’t use their Chat-optimized harness to guess which reasoning level and model to use. The subscription chat service (ChatGPT) and the chat-optimized model on their API seem to have a special harness that changes reasoning based on some heuristics, and will switch between the model sizes without user input.
With the API, you pick a model sizes and reasoning effort. Yes more choices, but also a clear mental model and a simple choice that you control.
hirako2000 · 2h ago
Ultimately they are selling tokens, so try many times.
joshmlewis · 25m ago
It seems to be trained to use tools effectively to gather context. In this example against 4.1 and o3 it used 6 in the first turn in a pretty cool way (fetching different categories that could be relevant). Token use increases with that kind of tool calling but the aggressive pricing should make that moot. You could probably get it to not be so tool happy with prompting as well.
Can anyone explain to me why they've removed parameter controls for temperature and top-p in reasoning models, including gpt-5? It strikes me that it makes it harder to build with these to do small tasks requiring high-levels of consistency, and in the API, I really value the ability to set certain tasks to a low temp.
Der_Einzige · 2h ago
It's because all forms of sampler settings destroy safety/alignment. That's why top_p/top_k are still used and not tfs, min_p, top_n sigma, etc, why temperature is locked to 0-2 arbitrary range, etc
Open source is years ahead of these guys on samplers. It's why their models being so good is that much more impressive.
oblio · 2h ago
Temperature is the response variation control?
empiko · 4h ago
Despite the fact that their models are used in hiring, business, education, etc this multibillion company uses one benchmark with very artificial questions (BBQ) to evaluate how fair their model is. I am a little bit disappointed.
cainxinth · 55m ago
It’s fascinating and hilarious that pelican on a bicycle in SVG is still such a challenge.
anyg · 3h ago
Good to know -
> Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano
falcor84 · 3h ago
Oh wow, so essentially a full year of post-training and testing. Or was it ready and there was a sufficiently good business strategy decision to postpone the release?
thorum · 2h ago
The Information’s report from earlier this month claimed that GPT-5 was only developed in the last 1-2 months, after some sort of breakthrough in training methodology.
> As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.
But it could be that this refers to post-training and the base model was developed earlier.
My understanding is that training data cut-offs and dates at which the model were trained are independent things.
AI labs gather training data and then do a ton of work to process it, filter it etc.
Model training teams run different parameters and techniques against that processed training data.
It wouldn't surprise me to hear that OpenAI had collected data up to September 2024, dumped that data in a data warehouse of some sort, then spent months experimenting with ways to filter and process it and different training parameters to run against it.
bhouston · 2h ago
Weird to have such an early knowledge cutoff. Claude 4.1 has March 2025 - 6 month more recent with comparable results.
dortlick · 1h ago
Yeah I thought that was strange. Wouldn't it be important to have more recent data?
bn-l · 1h ago
Is that late enough for it to have heard of svelte 5?
diggan · 3h ago
> but for the moment here’s the pelican I got from GPT-5 running at its default “medium” reasoning effort:
Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans :)
When I've played around with GPT-OSS-120b recently, seems the difference in the final answer is huge, where "low" is essentially "no reasoning" and with "high" it can spend seemingly endless amount of tokens. I'm guessing the difference with GPT-5 will be similar?
simonw · 3h ago
> Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans
Yeah, I'm working on that - expect dozens of more pelicans in a later post.
ks2048 · 5h ago
So, "system card" now means what used to be a "paper", but without lots of the details?
simonw · 4h ago
AI labs tend to use "system cards" to describe their evaluation and safety research processes.
They used to be more about the training process itself, but that's increasingly secretive these days.
kaoD · 4h ago
Nope. System card is a sales thing. I think we generally call that "product sheet" in other markets.
pancakemouse · 3h ago
Practically the first thing I do after a new model release is try to upgrade `llm`. Thank you, @simonw !
I'm curious what platform people are using to test GPT-5? I'm so deep into the claude code world that I'm actually unsure what the best option is outside of claude code...
simonw · 41m ago
I've been using codex CLI, OpenAI's Claude Code equivalent. You can run it like this:
Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.
justusthane · 2h ago
> a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent
This is sort of interesting to me. It strikes me that so far we've had more or less direct access to the underlying model (apart from the system prompt and guardrails), but I wonder if going forward there's going to be more and more infrastructure between us and the model.
hirako2000 · 2h ago
Consider it a low level routing. Keeping in mind it allows the other non active parts to not be in memory. Mistral afaik came up with this concept, quite a while back.
nickthegreek · 4h ago
This new naming conventions, while not perfect are alot clearer and I am sure will help my coworkers.
Leary · 4h ago
METR of only 2 hours and 15 minutes. Fast takeoff less likely.
Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.
qsort · 3h ago
Thanks for sharing, that was a good thread!
dingnuts · 4h ago
It's not surprising to AI critics but go back to 2022 and open r/singularity and then answer: what "people" were expecting? Which people?
SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.
IDK what "people" are expecting but with the amount of hype I'd have to guess they were expecting more than we've gotten so far.
The fact that "fast takeoff" is a term I recognize indicates that some people believed OpenAI when they said this technology (transformers) would lead to sci fi style AI and that is most certainly not happening
ToValueFunfetti · 3h ago
>SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.
Has he said anything about it since last September:
>It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.
This is, at an absolute minimum, 2000 days = 5 years. And he says it may take longer.
Did he even say AGI next year any time before this? It looks like his predictions were all pointing at the late 2020s, and now he's thinking early 2030s. Which you could still make fun of, but it just doesn't match up with your characterization at all.
falcor84 · 3h ago
I would say that there are quite a lot of roles where you need to do a lot of planning to effectively manage an ~8 hour shift, but then there are good protocols for handing over to the next person. So once AIs get to that level (in 2027?), we'll be much closer to AIs taking on "economically valuable work".
The 2h 15m is the length of tasks the model can complete with 50% probability. So longer is better in that sense. Or at least, "more advanced" and potentially "more dangerous".
Only a third cheaper than Sonnet 4? Incrementally better I suppose.
> and minimizing sycophancy
Now we're talking about a good feature! Actually one of my biggest annoyances with Cursor (that mostly uses Sonnet).
"You're absolutely right!"
I mean not really Cursor, but ok. I'll be super excited if we can get rid of these sycophancy tokens.
nosefurhairdo · 2h ago
In my early testing gpt5 is significantly less annoying in this regard. Gives a strong vibe of just doing what it's told without any fluff.
logicchains · 3h ago
>Only a third cheaper than Sonnet 4?
The price should be compared to Opus, not Sonnet.
cco · 2h ago
Wow, if so, 7x cheaper. Crazy if true.
cchance · 1h ago
Its basically opus 4.1 ... but cheaper?
gwd · 1h ago
Cheaper is an understatement... it's less than 1/10 for input and nearly 1/8 for output. Part of me wonders if they're using their massive new investment to sell API below-cost and drive out the competitor. If they're really getting Opus 4.1 performance for half of Sonnet compute cost, they've done really well.
diggan · 43m ago
I'm not sure I'd be surprised, I've been playing around with GPT-OSS last few days, and the architecture seems really fast for the accuracy/quality of responses, way better than most local weights I've tried for the last two years or so. And since they released that architecture publicly, I'd imagine they're sitting on something even better privately.
isoprophlex · 2h ago
Whoa this looks good. And cheap! How do you hack a proxy together so you can run Claude Code on gpt-5?!
It does sort of give me the vibe that the pure scaling maximalism really is dying off though. If the approach is on writing better routers, tooling, comboing specialized submodels on tasks, then it feels like there's a search for new ways to improve performance(and lower cost), suggesting the other established approaches weren't working. I could totally be wrong, but I feel like if just throwing more compute at the problem was working OpenAI probably wouldn't be spending much time on optimizing the user routing on currently existing strategies to get marginal improvements on average user interactions.
I've been pretty negative on the thesis of only needing more data/compute to achieve AGI with current techniques though, so perhaps I'm overly biased against it. If there's one thing that bothers me in general about the situation though, it's that it feels like we really have no clue what the actual status of these models is because of how closed off all the industry labs have become + the feeling of not being able to expect anything other than marketing language from the presentations. I suppose that's inevitable with the massive investments though. Maybe they've got some massive earthshattering model release coming out next, who knows.
Really though, why put all your eggs in one basket? That's what I've been confused about for awhile. Why fund yet another LLMs to AGI startup. Space is saturated with big players and has been for years. Even if LLMs could get there that doesn't mean something else won't get there faster and for less. It also seems you'd want a backup in order to avoid popping the bubble. Technology S-Curves and all that still apply to AI
Though I'm similarly biased, but so is everyone I know with a strong math and/or science background (I even mentioned it in my thesis more than a few times lol). Scaling is all you need just doesn't check out
You've posted substantive comments in other threads, so this should be easy to fix.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
In the meantime, figuring out how to train them to make less of their most common mistakes is a worthwhile effort.
The interesting point as well to me though, is that if it could create a startup that was worth $1B, that startup wouldn't be worth $1B.
Why would anyone pay that much to invest in the startup if they could recreate the entire thing with the same tool that everyone would have access to?
If your expectations were any higher than that then, then it seems like you were caught up in hype. Doubling 2-3 times per year isn't leveling off my any means.
https://metr.github.io/autonomy-evals-guide/gpt-5-report/
Rather than my personal opinion, I was commenting on commonly viewed opinions of people I would believe to have been caught up in hype in the past. But I do feel that although that's a benchmark, it's not necessarily the end-all of benchmarks. I'll reserve my final opinions until I test personally, of course. I will say that increasing the context window probably translates pretty well to longer context task performance, but I'm not entirely convinced it directly translates to individual end-step improvement on every class of task.
It is a benchmark but I'm not very convinced it's the be-all, end-all.
Who's suggesting it is?
At this point it's pretty much given it's a game of inches moving forward.
Are you trying to say the curve is flattening? That advances are coming slower and slower?
As long as it doesn't suggest a dot com level recession I'm good.
So yeah, maybe we are getting more incremental improvements. But that to me seems like a good thing, because more good things earlier. I will take that over world-shattering any day – but if we were to consider everything that has happened since the first release of gpt-4, I would argue the total amount is actually very much world-shattering.
The common concept for AGI seems to be much more about human replacement - the ability to complete "economically valuable tasks" better than humans can. I still don't understand what our human lives or economies would look like there.
What I personally wanted from GPT-5 is exactly what I got: models that do the same stuff that existing models do, but more reliably and "better".
That's pretty much the key component these approaches have been lacking on, the reliability and consistency on the tasks they already work well on to some extent.
I think there's a lot of visions of what our human lives would look like in that world that I can imagine, but your comment did make me think of one particularly interesting tautological scenario in that commonly defined version of AGI.
If artificial general intelligence is defined as completed "economically valuable tasks" better than human can, it requires one to define "economically valuable." As it currently stands, something holds value in an economy relative to human beings wanting it. Houses get expensive because many people, each of whom have economic utility which they use to purchase things, want to have houses, of which there is a limited supply for a variety of reasons. If human beings are not the most effective producers of value in the system, they lose capability to trade for things, which negates that existing definition of economic value. Doesn't matter how many people would pay $5 dollars for your widget if people have no economic utility relative to AGI, meaning they cannot trade that utility for goods.
In general that sort of definition of AGI being held reveals a bit of a deeper belief, which is that there is some version of economic value detached from the humans consuming it. Some sort of nebulous concept of progress, rather than the acknowledgement that for all of human history, progress and value have both been relative to the people themselves getting some form of value or progress. I suppose it generally points to the idea of an economy without consumers, which is always a pretty bizarre thing to consider, but in that case, wouldn't it just be a definition saying that "AGI is achieved when it can do things that the people who control the AI system think are useful." Since in that case, the economy would eventually largely consist of the people controlling the most economically valuable agents.
I suppose that's the whole point of the various alignment studies, but I do find it kind of interesting to think about the fact that even the concept of something being "economically valuable", which sounds very rigorous and measurable to many people, is so nebulous as to be dependent on our preferences and wants as a society.
Nothing in the current technology offers a path to AGI. These models are fixed after training completes.
You could maybe accomplish this if you could fit all new information into context or with cycles of compression but that is kinda a crazy ask. There's too much new information, even considering compression. It certainly wouldn't allow for exponential growth (I'd expect sub linear).
I think a lot of people greatly underestimate how much new information is created every day. It's hard if you're not working on any research and seeing how incremental but constant improvement compounds. But try just looking at whatever company you work for. Do you know everything that people did that day? It takes more time to generate information than process information so that's on you side, but do you really think you could keep up? Maybe at a very high level but in that case you're missing a lot of information.
Think about it this way: if that could be done then LLM wouldn't need training or tuning because you could do everything through prompting.
I’m not saying this is a realistic or efficient method to create AGI, but I think the argument „Model is static once trained -> model can’t be AGI“ is fallacious.
LLMs might look “creative,” but they are just remixing patterns from their training data and what is in the prompt. They cant actually update themselves or remember new things after training as there is no ongoing feedback loop.
This is why you can’t send an LLM to medical school and expect it to truly “graduate”. It cannot acquire or integrate new knowledge from real-world experience the way a human can.
Without a learning feedback loop, these models are unable to interact meaningfully with a changing reality or fulfill the expectation from an AGI: Contribute to new science and technology.
Also you might be interested in this theorem
https://en.wikipedia.org/wiki/Data_processing_inequality
Only if AGI would require infinite knowledge, which it doesn’t.
If you want actual big moves, watch google, anthropic, qwen, deepseek.
Qwen and Deepseek teams honestly seem so much better at under promising and over delivering.
Cant wait to see what Gemini 3 looks like too.
This has me so confused, Claude 4 (Sonnet and Opus) hallucinates daily for me, on both simple and hard things. And this is for small isolated questions at that.
So not seeing them means either lying or incompetent. I always try to attribute to stupidity rather than malice (Hanlon's razor).
The big problem of LLMs is that they optimize human preference. This means they optimize for hidden errors.
Personally I'm really cautious about using tools that have stealthy failure modes. They just lead to many problems and lots of wasted hours debugging, even when failure rates are low. It just causes everything to slow down for me as I'm double checking everything and need to be much more meticulous if I know it's hard to see. It's like having a line of Python indented with an inconsistent white space character. Impossible to see. But what if you didn't have the interpreter telling you which line you failed on or being able to search or highlight these different characters. At least in this case you'd know there's an error. It's hard enough dealing with human generated invisible errors, but this just seems to perpetuate the LGTM crowd
Yeah, it's seems to be a terrible approach to try to "correct" the context by adding clarifications or telling it what's wrong.
Instead, start from 0 with the same initial prompt you used, but improve it so the LLM gets it right in the first response. If it still gets it wrong, begin from 0 again. The context seems to be "poisoned" really quickly, if you're looking for accuracy in the responses. So better to begin from the beginning as soon as it veers off course.
If the question is about harder facts which the human disagrees with, this may put it into an essentially self-contradictory state, where the locus of possibilitie gets squished from each direction, and so the model is forced to respond with crazy outliers which agree with both the human and the data. The probability of an invented reference being true may be very low, but from the model's perspective, it may still be one of the highest probability outputs among a set of bad choices.
What it sounds like they may have done is just have the humans tell it it's wrong when it isn't, and then award it credit for sticking to its guns.
>Looking through the document, I can identify several instances where it's written in the first person:
And it went on to show a series of "they/them" statements. I asked it to clarify if "they" is "first person" and it responded
>No, "they" is not first person - it's third person. I made an error in my analysis. First person would be: I, we, me, us, our, my. Second person would be: you, your. Third person would be: he, she, it, they, them, their. Looking back at the document more carefully, it appears to be written entirely in third person.
Even the good models are still failing at real-world use cases which should be right in their wheelhouse.
Could you give an estimate of how many "dumb errors" you've encountered, as opposed to hallucinations? I think many of your readers might read "hallucination" and assume you mean "hallucinations and dumb errors".
As a user, when the model tells me things that are flat out wrong, it doesn't really matter whether it would be categorized as a hallucination or a dumb error. From my perspective, those mean the same thing.
https://finance.yahoo.com/news/enterprise-llm-spend-reaches-...
Their PRO models were not (IMHO) worth 10X that of PLUS!
Not even close.
Especially when new competitors (eg. z.ai) are offering very compelling competition.
They have confessed to doing a bad thing - training on copyrighted data without permission. Why does that indicate they would lie about a worse thing?
Because they know their audience. It's an audience that also doesn't care for copyright and would love for them to win their court cases. They are fineaking such an argument to those kinds of people.
Meanwhile, the reaction from the same audience when legal did a very typical subpoena process on said data, data they chose to submit to an online server of their own volition, completely freaked out. Suddenly, they felt like their privacy was invaded.
It doesn't make any logical sense in my mind, but a lot of the discourse over this topic isnt based on logic.
Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.
And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?
But the specific nuance of picking nano/mini/main and minimal/low/medium/high comes down to experimentation and what your cost/latency constraints are.
Not really that mich simpler.
Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.
If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth
With the API, you pick a model sizes and reasoning effort. Yes more choices, but also a clear mental model and a simple choice that you control.
https://promptslice.com/share/b-2ap_rfjeJgIQsG
Open source is years ahead of these guys on samplers. It's why their models being so good is that much more impressive.
> As recently as June, the technical problems meant none of OpenAI’s models under development seemed good enough to be labeled GPT-5, according to a person who has worked on it.
But it could be that this refers to post-training and the base model was developed earlier.
https://www.theinformation.com/articles/inside-openais-rocky...
https://archive.ph/d72B4
AI labs gather training data and then do a ton of work to process it, filter it etc.
Model training teams run different parameters and techniques against that processed training data.
It wouldn't surprise me to hear that OpenAI had collected data up to September 2024, dumped that data in a data warehouse of some sort, then spent months experimenting with ways to filter and process it and different training parameters to run against it.
Would been interesting to see a comparison between low, medium and high reasoning_effort pelicans :)
When I've played around with GPT-OSS-120b recently, seems the difference in the final answer is huge, where "low" is essentially "no reasoning" and with "high" it can spend seemingly endless amount of tokens. I'm guessing the difference with GPT-5 will be similar?
Yeah, I'm working on that - expect dozens of more pelicans in a later post.
They used to be more about the training process itself, but that's increasingly secretive these days.
https://llm.datasette.io/en/stable/openai-models.html
> -------------------------------
"reasoning": {"summary": "auto"} }'
Here’s the response from that API call.
https://gist.github.com/simonw/1d1013ba059af76461153722005a0...
Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.
This is sort of interesting to me. It strikes me that so far we've had more or less direct access to the underlying model (apart from the system prompt and guardrails), but I wonder if going forward there's going to be more and more infrastructure between us and the model.
Not massively off -- manifold yesterday implied odds this low were ~35%. 30% before Claude Opus 4.1 came out which updated expected agentic coding abilities downward.
SamA has been promising AGI next year for three years like Musk has been promising FSD next year for the last ten years.
IDK what "people" are expecting but with the amount of hype I'd have to guess they were expecting more than we've gotten so far.
The fact that "fast takeoff" is a term I recognize indicates that some people believed OpenAI when they said this technology (transformers) would lead to sci fi style AI and that is most certainly not happening
Has he said anything about it since last September:
>It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.
This is, at an absolute minimum, 2000 days = 5 years. And he says it may take longer.
Did he even say AGI next year any time before this? It looks like his predictions were all pointing at the late 2020s, and now he's thinking early 2030s. Which you could still make fun of, but it just doesn't match up with your characterization at all.
> and minimizing sycophancy
Now we're talking about a good feature! Actually one of my biggest annoyances with Cursor (that mostly uses Sonnet).
"You're absolutely right!"
I mean not really Cursor, but ok. I'll be super excited if we can get rid of these sycophancy tokens.
The price should be compared to Opus, not Sonnet.
or even: https://github.com/sst/opencode
Not affiliated with either one of these, but they look promising.
right :-D