Between Opus aand GPT-5, it's not clear there's a substantial difference in software development expertise. The metric that I can't seem to get past in my attempts to use the systems is context awareness over long-running tasks. Producing a very complex, context-exceeding objective is a daily (maybe hourly) ocurrence for me. All I care about is how these systems manage context and stay on track over extended periods of time.
What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.
abossy · 1h ago
At my company (Charlie Labs), we've had a tremendous amount of success with context awareness over long-running tasks with GPT-5 since getting access a few weeks ago. We ran an eval to solve 10 real Github issues so that we could measure this against Claude Code and the differences were surprisingly large. You can see our write-up here:
Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.
While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.
bartman · 28m ago
I am not (usually) photosensitive, but the animated static noise on your websites causes noticable flickering on various screens I use and made it impossible for me to read your article.
For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.
Totally agree. At the moment I find that frontier LLMs are able to solve most of the problems I throw at them given enough context. Most of my time is spent working out what context they're missing when they fail. So the thing that would help me most is much a much more focussed ability to gather context.
For my use cases, this is mostly needing to be really home in on relevant code files, issues, discussions, PRs. I'm hopeful that GPT5 will be a step forward in this regard that isn't fully captured in the benchmark results. It's certainly promising that it can achieve similar results more cheaply than e.g. Opus.
No comments yet
swader999 · 3h ago
If GPT 5 truly has 400k context, that might be all it needs to meaningfully surpass Opus.
andrewmutz · 2h ago
Having a large context window is very different from being able to effectively use a lot of context.
To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results
dimal · 2h ago
Even with large contexts there's diminishing returns. Just having the ability to stuff more tokens in context doesn't mean the model can effectively use it. As far as I can tell, they always reach a point in which more information makes things worse.
Byamarro · 2h ago
More of a question is its context rot tendency than the size of its context :)
LLMs are supposed to load 3 bibles into their context, but they forget what they were about to do after loading a 600LoC of locales.
simonw · 3h ago
It's 272,000 input tokens and 128,000 output tokens.
6thbit · 8m ago
Oh, I had not grasped that the “context window” size advertised had to include both input and output.
But is it really 272k even if the output was say 10k? Cause it does say “max output” in the docs, so I wonder
zurfer · 52m ago
Woah that's really kind of hidden. But I think you can specify max output tokens. Need to test that!
AS04 · 3h ago
400k context with 100% on the fiction livebench would make GPT-5 the undisputably best model IMHO. Don't think it will achieve that though, sadly.
tekacs · 2h ago
Coupled with the humungous price difference...
nadis · 2h ago
It's pretty vague, but the OP had this callout:
>"GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals."
logicchains · 2h ago
>Between Opus aand GPT-5, it's not clear there's a substantial difference in software development expertise.
If there's no substantial difference in software development expertise then GPT-5 absolutely blows Opus out of the water due to being almost 10x cheaper.
spiderice · 6m ago
Does OpenAI provide a $200/month option that lets me use as much GPT-5 I want inside of Codex?
Because if not, I'd still go with Opus + Claude Code. I'd rather be able to tell my employer, "this will cost you $200/month" than "this might cost you less than $200/month, but we really don't know because it's based on usage"
realusername · 3h ago
Personally I think I'll wait for another 10x improvement for coding because with the current way it's going, they clearly need that.
fsloth · 3h ago
From my experience when used through IDE such as Cursor the current gen Claude model enables impressive speedruns over commodity tasks. My context is a CAD application I’ve been writing as a hobby. I used to work in that field for a decade so have a pretty good touch on how long I would expect tasks to take. I’m using mostly a similar software stack as that at previous job and am definetly getting stuff done much faster on holiday at home than at that previous work. Of course the codebase is also a lot smaller, intrinsic motivation, etc, but still.
realusername · 2h ago
I've done pretty much the same as you (Cursor/Claude) for our large Rails/React codebase at work and the experience has been horrific so far, I reverted back to vscode.
42lux · 3h ago
How often do you have to build the simple scaffolding though?
bdangubic · 3h ago
context awareness over long-running tasks
don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.
bastawhiz · 3h ago
> neither humans nor llms are good at long-running tasks.
That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.
bdangubic · 1h ago
I just finished my workday, 8hrs with Claude Code. No single task took more than 20 minutes total. Cleared context after each task and asked it to summarize for itself the previous task before I cleared context. If I ran this as a continuous 8hr task it would have died after 35-ish minutes. Just know the limitations (like with any other tool) and you’ll be good :)
0x457 · 1h ago
I always find it wild that none of these tools use VCS - completed logical unit of work, make a commit, drop entire context related to that commit, while referencing said commit, continue onto the next stage, rinse and repeat.
Claud always misunderstands how API exported by my service works and every compaction it forgets all over and commits "oh api has changed since last time I've used, let me use different query parameters", my brother Christ nothing has changed, and you are the one who made this API.
echelon · 2h ago
Humans can error correct.
LLMs multiply errors over time.
beoberha · 3h ago
A series of small manageable chunks becomes a long running task :)
If LLMs are going to act as agents, they need to maintain context across these chunks.
vaenaes · 2h ago
You're holding it wrong
risho · 3h ago
over the last week or so I have put probably close to 70 hours into playing around with cursor and claude code and a few other tools (its become my new obsession). I've been blown away by how good and reliable it is now. That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models. I dont care what any benchmark says because the only thing that actually matters is actual use. I'm really hoping that this new gpt model actually works for this usecase because competition is great and the price is also great.
neuronexmachina · 2h ago
> That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models.
Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.
rcarr · 2h ago
I think some of this might come down to stack as well. I watched a t3.gg video[1] recently about Convex[2] and how the nature of it leads to the AI getting it right first time more often. I've been playing around with it the last few days and I think I agree with him.
I think the dev workflow is going to fundamentally change because to maximise productivity out of this you need to get multiple AIs working in parallel so rather than just jumping straight into coding we're going to end up writing a bunch of tickets out in a PM tool (Linear[3] looks like it's winning the race atm) and then working out (or using the AI to work out) which ones can be run in parallel without causing merge conflicts and then pulling multiple tickets into your IDE/Terminal and then cycling through the tabs and jumping in as needed.
Atm I'm still not really doing this but I know I need to make the switch and I'm thinking that Warp[4] might be best suited for this kind of workflow, with the occasional switch over to an IDE when you need to jump in and make some edits.
Oh also, to achieve this you need to use git worktrees[5,6,7].
Sure sounds interesting but... Where on earth do you actually find the time to sit through a 1.5 hour yt video?!
burnished · 20m ago
1.5x and 2x speed help a lot, slow down or repeat segments as needed, don't be afraid to fast forward past irrelevant looking bits (just be eager to backtrack).
mafro · 11m ago
Ask an LLM to transcribe and give the overview and key points
rcarr · 1h ago
Jump in and start coding entire backend with stack not best suited for job and modern AI tools: most likely future hours lost.
Spend 1.5 hours now to learn from an experienced dev on a stack that is better suited for job: most likely future hours gained.
v5v3 · 1h ago
People find time for things they seem important to them.
throwaway_2898 · 3h ago
How much of the product were you able to build to say it was good/reliable? IME, 70 hours can get you to a PoC that "works", building beyond the initial set of features — like say a first draft of all the APIs — does it do well once you start layering features?
petralithic · 1h ago
This has been my experience. The greenfield approach works up to a point, then it just breaks.
ralfd · 3h ago
Just replying to ask you next week what your assessment on GPT5 is.
zarzavat · 2h ago
The magic is the prompting/tool use/finetuning.
I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.
Centigonal · 3h ago
Ditto here, except I'm using Roo and it's Claude and Gemini pro 2.5 that work for me.
pamelafox · 3h ago
I am testing out gpt-5-mini for a RAG scenario, and I'm impressed so far.
I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.
GPT4: Collaborating with engineering, sales, marketing, finance, external partners, suppliers and customers to ensure …… etc
GPT5: I don't know.
Upon speaking these words, AI was enlightened.
ComputerGuru · 1h ago
That is genuinely nice to see. What are you using for the embeddings?
pamelafox · 1h ago
We use text-embedding-3-large, with both quantization and MRL reduction, plus oversampling on the search to compensate for the compression.
0x457 · 1h ago
I get the "good" result with phi-4 and gemma-3n in RAG scenario - i.e. it only used context provided to answer and couldn't answer questions if context lacked the answer without hallucination.
potatolicious · 2h ago
This feels like honestly the biggest gain/difference. I work on things that do a lot of tool calling, and the model hallucinating fake tools is a huge problem. Worse, sometimes the model will hallucinate a response directly without ever generating the tool call.
The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.
6thbit · 5m ago
Can anyone share their experience with codex CLI? I feel like that’s not mentioned enough and gpt5 is already the default model there.
macawfish · 2m ago
Not good sadly, Claude Code seems so much better in terms of overall polish but also in how it handles context. I don't really want to through the LLM into the deep end without proper tools and context, and I get the sense that this is what was happening with in Codex.
jumploops · 3h ago
If the model is as good as the benchmarks say, the pricing is fantastic:
For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.
The big question remains: how well does it handle tools? (i.e. compared to Claude Code)
Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.
addaon · 3h ago
> Output: $10 / 1M tokens
It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.
mkozlows · 2h ago
That's how the browser-based ChatGPT works, but not the API.
simianwords · 3h ago
> that is explicitly made of (at least) two underlying models
what do you mean?
addaon · 2h ago
> a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).
In the API, there’s no router. Developers just pick whether they use the reasoning model or non-thinking ChatGPT model.
croemer · 3h ago
> GPT‑5 also excels at long-running agentic tasks—achieving SOTA results on τ2-bench telecom (96.7%), a tool-calling benchmark released just 2 months ago.
Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.
tedsanders · 43m ago
I wrote that section and made the graphs, so you can blame me. We no doubt highlight the evals that make us look good, but in this particular case I think the emphasis on telecom isn't unprincipled cherry picking.
Telecom was made after retail & airline, and fixes some of their problems. In retail and airline, the model is graded against a ground truth reference solution. But in reality, there can be multiple solutions that solve the problem, and perfectly good answers can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why airline and retail scores haven't climbed with the latest generations of models and are stuck around 60% / 80%. Even a literal superintelligence would probably plateau here.
In telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I would have told you ahead of time telecom is much better than airline/retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, but that one quirk gives them a very poor score. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if they trigger a quirk not present in the eval).
How does the cost compare though? From my understanding o3 is pretty expensive to run. Is GPT-5 less costly? If so if the performance is close to o3 but cheaper, then it may still be a good improvement.
low_tech_punk · 3h ago
I find it strange that GPT-5 is cheaper than GPT-4.1 in input token and is only slightly more expensive in output token. Is it marketing or actually reflecting the underlying compute resources?
AS04 · 3h ago
Very likely to be an actual reflection. That's probably their real achievement here and the key reason why they are actually publishing it as GPT-5. More or less the best or near to it on everything while being one model, substantially cheaper than the competition.
ComputerGuru · 1h ago
But it can’t do audio in/out or image out. Feels like an architectural step back.
conradkay · 54m ago
My understanding is that image output is pretty separate and if it doesn’t seem that way, they’re just abstracting several models into one name
bn-l · 1h ago
Maybe with the router mechanism (to mini or standard) they estimate the average cost will be a lot lower for chatgpt because the capable model won’t be answering dumb questions and then they pass that on to devs?
low_tech_punk · 1h ago
I think the router applies to chatgpt app. The developer APIs expose manual control to select the specific model and level of reasoning.
jstummbillig · 1h ago
I mean... they themselves included that information in the post. It's not exactly a gotcha.
mehmetoguzderin · 3h ago
Context-free grammar and regex support are exciting. I wonder what, or whether, there are differences from the Lark-like CFG of llguidance, which powers the JSON schema of the OpenAI API [^1].
Yeah that was the only exciting part of the announcement for me haha. Can't wait to play around with it.
I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.
chrisweekly · 1h ago
> "I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front."
This run-on sentence swerved at the end; I really can't tell what your point is. Could you reword it for clarity?
petercooper · 1h ago
I read it as "... from other companies, like Google, and OpenAI have been doing a great job on this front"
wewewedxfgdf · 48m ago
Tried it on a tough problem.
GPT-5 solved the problem - which Gemini failed to solve - then failed 6 times in a row to write the code to fix it.
I then gave ChatGPT-5's problem analysis to Google Gemini and it immediately implemented the correct fix.
The lesson - ChatGPT is good at analysis and code reviews, not so good at coding.
cperkins · 16s ago
I have something that both Gemini (via GCA) and CoPilot (Claude) analyzed and came up withe the same diagnosis. Each of them made the exact same wrong solution, and when I pointed that out, got further wrong.
I haven't tried Chat GPT on it yet, hoping to do so soon.
low_tech_punk · 3h ago
The ability to specify a context-free grammar as output constraint? This blows my mind. How do you control the auto regressive sampling to guarantee the correct syntax?
evnc · 3h ago
I assume they're doing "Structured Generation" or "Guided generation", which has been possible for a while if you control the LLM itself e.g. running an OSS model, e.g. [0][1]. It's cool to see a major API provider offer it, though.
The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.
It was (attempted to be) solved by a human before, yet not merged...
With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.
attentive · 1h ago
"Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest."
hmm, they should call it gpt-5-chat-nonreasoning or something.
henriquegodoy · 2h ago
I dont think there's so much difference from opus 4.1 and gpt-5, probably just the context size, waiting for the gemini 3.0
macawfish · 26s ago
Claude 5 is the one I'm most excited about.
backscratches · 1h ago
gpt5 much cheaper
catigula · 3h ago
I thought we were going to have AGI by now.
RS-232 · 3h ago
No shot. LLMs are simple text predictors and they are too stupid to get us to real AGI.
To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.
brookst · 1h ago
Are you saying that only (human?) biological brains can be GI, and that whatever intelligence is, it would emerge from a pure physics-based simulation?
Both of those seem questionable, multiplying them together seems highly unlikely.
jplusequalt · 1h ago
Are you arguing that intelligence is not physical? Could you name a single thing in existence that fundamentally cannot be linked to physics?
evantbyrne · 2h ago
It will be interesting to see if humans can manage to bioengineer human-level general intelligence into another species before computers.
93po · 45m ago
in what way are human brains also not just predictors? our neural pathways are built and reinforced as we have repeated exposure to inputs through any of our senses. our brains are expert pattern-followers, to the point that is happens even when we strongly don't want to (in the case of PTSD, for example, or people who struggle with impulse control and executive functioning).
whats the next sentence i'm going to type? is not just based on the millions of sentences ive typed before and read before? even the premise of me playing devils advocate here, that's a pattern i've learned over my entire life too.
your argument also falls apart a bit when we see emergent behavior, which has definitely happened
nawgz · 1h ago
I don't really see any relationship between being able to model/simulate the brain and being able to exceed the brain in intelligence, can you explain more about that? Simulations sound like more of a computational and analytic problem with regards to having an accurate model.
Maybe your point is that until we understand our own intelligence, which would be reflected in such a simulation, it would be difficult to improve upon it.
machiaweliczny · 2h ago
[flagged]
bopbopbop7 · 2h ago
“some twist” is doing a lot of heavy lifting in that statement.
AppleBananaPie · 1h ago
CS will define, design and implement human level intelligence before neuroscience has done even the first.
That's what I hear when people say stuff like this anyway.
Similar to CS folks throwing around physics 'theories'
IAmGraydon · 2h ago
Not going to happen any time soon, if ever. LLMs are extremely useful, but the intelligence part is an illusion that nearly everyone appears to have fallen for.
jonplackett · 1h ago
This POV is just the opposite extremity - and it’s equally nuts. If you haven’t seen any intelligence at all in an LLm you just aren’t looking.
attentive · 1h ago
> scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot
"When producing frontend code for web apps, GPT‑5 is more aesthetically-minded, ambitious, and accurate. In side-by-side comparisons with o3, GPT‑5 was preferred by our testers 70% of the time."
That's really interesting to me. Looking forward to trying GPT-5!
jngiam1 · 2h ago
I was a little bummed that there wasn't more about better MCP support in ChatGPT, hopefully soon.
cheema33 · 1h ago
MCP is overhyped and most MCP servers are useless. What specific MCP server do you find critical in your regular use? And what functionality is missing that you wish to see in ChatGPT?
low_tech_punk · 3h ago
Tried using gpt-5 family with response API and got error "gpt-5 does not exist or you don't have access to it". I guess they are not rolling out in lock step with the live stream and blog article?
low_tech_punk · 2h ago
Can confirm that they are rolling out. It's working for me.
diggan · 3h ago
Seems they're doing rollout over time, I'm not seeing it anywhere yet.
zaronymous1 · 2h ago
Can anyone explain to me why they've removed parameter controls for temperature and top-p in reasoning models, including gpt-5? It strikes me that it makes it harder to build with these to do small tasks requiring high-levels of consistency, and in the API, I really value the ability to set certain tasks to a low temp.
mwigdahl · 1h ago
Has anyone tried connecting up GPT-5 to Claude Code using the model environment variables?
I opened up the developer playground and the model selection dropdown showed GPT-5 and then it disappeared. Also I don't see it in ChatGPT Pro. What's up?
Fogest · 3h ago
It's probably being throttled due to high usage.
brookst · 1h ago
Shipping something at the moment of announcement is always hell.
IAmGraydon · 2h ago
Not showing in my Pro account either. As someone else mentioned, I’m sure it’s throttling due to high use right now.
6thbit · 3h ago
Seems they have quietly increased the context window up to 400,000
I wonder how good it is compared to Claude Sonnet 4, and when it's coming to GitHub Copilot.
I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.
fleebee · 52m ago
There is an option in GitHub Copilot settings to enable GPT-5 already.
worik · 1h ago
Diminishing returns?
sberens · 2h ago
Interesting there doesn't seem to be benchmarking on codeforces
Looks like they're trying to lock us into using the Responses API for all the good stuff.
No comments yet
belter · 3h ago
We were promised AGI and all we got was code generators...
esafak · 2h ago
LLMs are saturating every benchmark. AGI may not be all that. I am already impressed. Perhaps you need robots to be awed.
bmau5 · 3h ago
It's a logical starting point, given there are pretty defined success/failure criteria
ehutch79 · 2h ago
The hype is real. We were told that we'd have AGI and be out of jobs 2 years ago, let alone today.
brookst · 1h ago
We were also told that AGI would never happen, that it was 6 months away, that it is 20 years away.
I’m not sure of the utility of being so outraged that some people made wrong predictions.
fatty_patty89 · 3h ago
What the fuck?
Nobody else saw the cursor ceo looking through the gpt5 generated code, mindlessly scrolling saying "this looks roughly correct, i would love to merge that" LOL
You can't make this up
bn-l · 1h ago
That explains a lot.
siva7 · 1h ago
amazing time to be alive, alone for this clown show
throwawaybob420 · 1h ago
if you’re not using an LLM to vibe code garbage then are you really a software developer?
isoprophlex · 1h ago
This is the ideal software engineer. You may not like it, but this is what peak software engineering looks like.
/s
ivape · 1h ago
Musk after GPT5 launch: "OpenAI is going to eat Microsoft alive"
This was really a bad release for OpenAI, if benchmarks are even somewhat indicative of how the model will perform in practice.
mediaman · 41m ago
I actually don't agree. Tool use is the key to successful enterprise product integration and they have done some very good work here. This is much more important to commercialization than, for example, creative writing quality (which it reportedly is not good at).
What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.
https://charlielabs.ai/research/gpt-5
Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.
While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.
For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.
[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/G...
For my use cases, this is mostly needing to be really home in on relevant code files, issues, discussions, PRs. I'm hopeful that GPT5 will be a step forward in this regard that isn't fully captured in the benchmark results. It's certainly promising that it can achieve similar results more cheaply than e.g. Opus.
No comments yet
To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results
But is it really 272k even if the output was say 10k? Cause it does say “max output” in the docs, so I wonder
>"GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals."
If there's no substantial difference in software development expertise then GPT-5 absolutely blows Opus out of the water due to being almost 10x cheaper.
Because if not, I'd still go with Opus + Claude Code. I'd rather be able to tell my employer, "this will cost you $200/month" than "this might cost you less than $200/month, but we really don't know because it's based on usage"
don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.
That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.
Claud always misunderstands how API exported by my service works and every compaction it forgets all over and commits "oh api has changed since last time I've used, let me use different query parameters", my brother Christ nothing has changed, and you are the one who made this API.
LLMs multiply errors over time.
If LLMs are going to act as agents, they need to maintain context across these chunks.
Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.
I think the dev workflow is going to fundamentally change because to maximise productivity out of this you need to get multiple AIs working in parallel so rather than just jumping straight into coding we're going to end up writing a bunch of tickets out in a PM tool (Linear[3] looks like it's winning the race atm) and then working out (or using the AI to work out) which ones can be run in parallel without causing merge conflicts and then pulling multiple tickets into your IDE/Terminal and then cycling through the tabs and jumping in as needed.
Atm I'm still not really doing this but I know I need to make the switch and I'm thinking that Warp[4] might be best suited for this kind of workflow, with the occasional switch over to an IDE when you need to jump in and make some edits.
Oh also, to achieve this you need to use git worktrees[5,6,7].
[1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k
[2]: https://www.convex.dev/
[3]: https://linear.app/
[4]: https://www.warp.dev/
[5]: https://docs.anthropic.com/en/docs/claude-code/common-workfl...
[6]:https://git-scm.com/docs/git-worktree
[7]:https://www.tomups.com/posts/git-worktrees/
Spend 1.5 hours now to learn from an experienced dev on a stack that is better suited for job: most likely future hours gained.
I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.
I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.
Screenshot in post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...
I'll run formal evaluations next.
GPT4: Collaborating with engineering, sales, marketing, finance, external partners, suppliers and customers to ensure …… etc
GPT5: I don't know.
Upon speaking these words, AI was enlightened.
The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.
Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens
For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.
The big question remains: how well does it handle tools? (i.e. compared to Claude Code)
Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.
It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.
what do you mean?
From https://openai.com/index/gpt-5-system-card/
Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.
Telecom was made after retail & airline, and fixes some of their problems. In retail and airline, the model is graded against a ground truth reference solution. But in reality, there can be multiple solutions that solve the problem, and perfectly good answers can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why airline and retail scores haven't climbed with the latest generations of models and are stuck around 60% / 80%. Even a literal superintelligence would probably plateau here.
In telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I would have told you ahead of time telecom is much better than airline/retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, but that one quirk gives them a very poor score. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if they trigger a quirk not present in the eval).
Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982
[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...
I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.
This run-on sentence swerved at the end; I really can't tell what your point is. Could you reword it for clarity?
GPT-5 solved the problem - which Gemini failed to solve - then failed 6 times in a row to write the code to fix it.
I then gave ChatGPT-5's problem analysis to Google Gemini and it immediately implemented the correct fix.
The lesson - ChatGPT is good at analysis and code reviews, not so good at coding.
I haven't tried Chat GPT on it yet, hoping to do so soon.
The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.
[0] https://github.com/dottxt-ai/outlines [1] https://github.com/guidance-ai/guidance
Doesnt look like it. Unless they add a fixed pricing, claude imo still would be better from a developer POV
It was (attempted to be) solved by a human before, yet not merged... With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.
hmm, they should call it gpt-5-chat-nonreasoning or something.
To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.
Both of those seem questionable, multiplying them together seems highly unlikely.
whats the next sentence i'm going to type? is not just based on the millions of sentences ive typed before and read before? even the premise of me playing devils advocate here, that's a pattern i've learned over my entire life too.
your argument also falls apart a bit when we see emergent behavior, which has definitely happened
Maybe your point is that until we understand our own intelligence, which would be reflected in such a simulation, it would be difficult to improve upon it.
That's what I hear when people say stuff like this anyway.
Similar to CS folks throwing around physics 'theories'
why isn't it on https://aider.chat/docs/leaderboards/?
"last updated August 07, 2025"
That's really interesting to me. Looking forward to trying GPT-5!
https://extraakt.com/extraakts/openai-s-gpt-5-performance-co...
https://platform.openai.com/docs/models/gpt-5
So, at least twice larger context than those
0 - https://github.com/openai/codex
That's been out for a while and used their 'codex' model, but they updated it today to default to gpt-5 instead.
EDIT: It's out now
https://github.com/spullara/models
I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.
Looks like they're trying to lock us into using the Responses API for all the good stuff.
No comments yet
I’m not sure of the utility of being so outraged that some people made wrong predictions.
You can't make this up
/s
https://x.com/elonmusk/status/1953509998233104649
Anyone know why he said that?