Re-label the "Save" button to be "Publish", to better indicate to users the out (phabricator.wikimedia.org)

Between Opus aand GPT-5, it's not clear there's a substantial difference in software development expertise. The metric that I can't seem to get past in my attempts to use the systems is context awareness over long-running tasks. Producing a very complex, context-exceeding objective is a daily (maybe hourly) ocurrence for me. All I care about is how these systems manage context and stay on track over extended periods of time.

What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.

swader999 · 47m ago

If GPT 5 truly has 400k context, that might be all it needs to meaningfully surpass Opus.

dimal · 14m ago

Even with large contexts there's diminishing returns. Just having the ability to stuff more tokens in context doesn't mean the model can effectively use it. As far as I can tell, they always reach a point in which more information makes things worse.

andrewmutz · 8m ago

Having a large context window is very different from being able to effectively use a lot of context.

To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results

simonw · 25m ago

It's 272,000 input tokens and 128,000 output tokens.

Byamarro · 8m ago

More of a question is its context rot tendency than the size of its context :) LLMs are supposed to load 3 bibles into their context, but they forget what they were about to do after loading a 600LoC of locales.

AS04 · 40m ago

400k context with 100% on the fiction livebench would make GPT-5 the undisputably best model IMHO. Don't think it will achieve that though, sadly.

bdangubic · 50m ago

context awareness over long-running tasks

don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.

bastawhiz · 24m ago

> neither humans nor llms are good at long-running tasks.

That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.

echelon · 7m ago

Humans can error correct.

LLMs multiply errors over time.

beoberha · 39m ago

A series of small manageable chunks becomes a long running task :)

If LLMs are going to act as agents, they need to maintain context across these chunks.

vaenaes · 14m ago

You're holding it wrong

realusername · 51m ago

Personally I think I'll wait for another 10x improvement for coding because with the current way it's going, they clearly need that.

fsloth · 38m ago

From my experience when used through IDE such as Cursor the current gen Claude model enables impressive speedruns over commodity tasks. My context is a CAD application I’ve been writing as a hobby. I used to work in that field for a decade so have a pretty good touch on how long I would expect tasks to take. I’m using mostly a similar software stack as that at previous job and am definetly getting stuff done much faster on holiday at home than at that previous work. Of course the codebase is also a lot smaller, intrinsic motivation, etc, but still.

42lux · 23m ago

How often do you have to build the simple scaffolding though?

risho · 51m ago

over the last week or so I have put probably close to 70 hours into playing around with cursor and claude code and a few other tools (its become my new obsession). I've been blown away by how good and reliable it is now. That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models. I dont care what any benchmark says because the only thing that actually matters is actual use. I'm really hoping that this new gpt model actually works for this usecase because competition is great and the price is also great.

throwaway_2898 · 27m ago

How much of the product were you able to build to say it was good/reliable? IME, 70 hours can get you to a PoC that "works", building beyond the initial set of features — like say a first draft of all the APIs — does it do well once you start layering features?

ralfd · 33m ago

Just replying to ask you next week what your assessment on GPT5 is.

Centigonal · 19m ago

Ditto here, except I'm using Roo and it's Claude and Gemini pro 2.5 that work for me.

pamelafox · 29m ago

I am testing out gpt-5-mini for a RAG scenario, and I'm impressed so far.

I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.

Screenshot in post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...

I'll run formal evaluations next.

potatolicious · 2m ago

This feels like honestly the biggest gain/difference. I work on things that do a lot of tool calling, and the model hallucinating fake tools is a huge problem. Worse, sometimes the model will hallucinate a response directly without ever generating the tool call.

The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.

mehmetoguzderin · 36m ago

Context-free grammar and regex support are exciting. I wonder what, or whether, there are differences from the Lark-like CFG of llguidance, which powers the JSON schema of the OpenAI API [^1].

[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...

msp26 · 23m ago

Yeah that was the only exciting part of the announcement for me haha. Can't wait to play around with it.

I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.

jumploops · 39m ago

If the model is as good as the benchmarks say, the pricing is fantastic:

Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens

For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.

The big question remains: how well does it handle tools? (i.e. compared to Claude Code)

Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.

addaon · 33m ago

> Output: $10 / 1M tokens

It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.

mkozlows · 14m ago

That's how the browser-based ChatGPT works, but not the API.

simianwords · 20m ago

> that is explicitly made of (at least) two underlying models

what do you mean?

croemer · 1h ago

> GPT‑5 also excels at long-running agentic tasks—achieving SOTA results on τ2-bench telecom (96.7%), a tool-calling benchmark released just 2 months ago.

Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.

Fogest · 47m ago

How does the cost compare though? From my understanding o3 is pretty expensive to run. Is GPT-5 less costly? If so if the performance is close to o3 but cheaper, then it may still be a good improvement.

low_tech_punk · 44m ago

I find it strange that GPT-5 is cheaper than GPT-4.1 in input token and is only slightly more expensive in output token. Is it marketing or actually reflecting the underlying compute resources?

AS04 · 38m ago

Very likely to be an actual reflection. That's probably their real achievement here and the key reason why they are actually publishing it as GPT-5. More or less the best or near to it on everything while being one model, substantially cheaper than the competition.

hrpnk · 19m ago

The github issue showed in the livestream is getting lots of traction: https://github.com/openai/openai-python/issues/2472

It was (attempted to be) solved by a human before, yet not merged... With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.

low_tech_punk · 46m ago

Tried using gpt-5 family with response API and got error "gpt-5 does not exist or you don't have access to it". I guess they are not rolling out in lock step with the live stream and blog article?

diggan · 45m ago

Seems they're doing rollout over time, I'm not seeing it anywhere yet.

jaflo · 8m ago

I just wish their realtime audio pricing would go down but it looks like GPT-5 does not have support for that so we’re stuck with the old models.

henriquegodoy · 13m ago

I dont think there's so much difference from opus 4.1 and gpt-5, probably just the context size, waiting for the gemini 3.0

catigula · 44m ago

I thought we were going to have AGI by now.

RS-232 · 20m ago

No shot. LLMs are simple text predictors and they are too stupid to get us to real AGI.

To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.

sberens · 12m ago

Interesting there doesn't seem to be benchmarking on codeforces

timhigins · 51m ago

I opened up the developer playground and the model selection dropdown showed GPT-5 and then it disappeared. Also I don't see it in ChatGPT Pro. What's up?

Fogest · 46m ago

It's probably being throttled due to high usage.

6thbit · 39m ago

Seems they have quietly increased the context window up to 400,000

https://platform.openai.com/docs/models/gpt-5

ralfd · 31m ago

How does that compare to Claude/GPT4?

6thbit · 27m ago

4o - 128k o3 - 200k Opus 4.1 - 200k Sonnet 4 - 200k

So, at least twice larger context than those

hrpnk · 27m ago

gpt4.1 has 1M input and 32k output, Sonnet 4 200k/64k

simianwords · 19m ago

but is it for the model in chatgpt.com as well?

low_tech_punk · 47m ago

The ability to specify a context-free grammar as output constraint? This blows my mind. How do you control the auto regressive sampling to guarantee the correct syntax?

evnc · 18m ago

I assume they're doing "Structured Generation" or "Guided generation", which has been possible for a while if you control the LLM itself e.g. running an OSS model, e.g. [0][1]. It's cool to see a major API provider offer it, though.

The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.

[0] https://github.com/dottxt-ai/outlines [1] https://github.com/guidance-ai/guidance

qsort · 44m ago

You sample only from tokens that could possibly result in a valid production for the grammar. It's an inference-only thing.

low_tech_punk · 43m ago

ah, thanks!

skepticATX · 43m ago

This was really a bad release for OpenAI, if benchmarks are even somewhat indicative of how the model will perform in practice.

sebdufbeau · 49m ago

Has the API rollout started? It's not available in our org, even if we've been verified for a few months

EDIT: It's out now

spullara · 48m ago

it is out yet. i poll the api for the models and update this GitHub hourly.

https://github.com/spullara/models

andrewmcwatters · 1h ago

I wonder how good it is compared to Claude Sonnet 4, and when it's coming to GitHub Copilot.

I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.

te_chris · 17m ago

https://platform.openai.com/docs/guides/latest-model

Looks like they're trying to lock us into using the Responses API for all the good stuff.

belter · 29m ago

We were promised AGI and all we got was code generators...

bmau5 · 20m ago

It's a logical starting point, given there are pretty defined success/failure criteria

fatty_patty89 · 23m ago

What the fuck? Nobody else saw the cursor ceo looking through the gpt5 generated code, mindlessly scrolling saying "this looks roughly correct, i would love to merge that" LOL

You can't make this up

Lawmakers want to end to HR ghosting during the interview process (cnbc.com)

Community Update #35 Baldur's Gate 3 Turns Two (baldursgate3.game)

A real example of how GPT-5 behaves in Amp (twitter.com)

Microsoft is cautiously onboarding Grok 4 following Hitler concerns (theverge.com)

Nearly a million more deaths than births in Japan last year (bbc.com)

Steam Trailer Player Upgrades (store.steampowered.com)

OpenAl Five vs. OG, Game 1 (2019) [video] (youtube.com)

Pushing Boundaries with Claude Code (wordfence.com)

Nim 3.0: Design Principles (nim-lang.org)

Upload Images via APIs (github.com)

Show HN: Octofriend, a cute coding agent that can swap between GPT-5 and Claude (github.com)

TSMC employees leaks 2mm info to Rapidus (tomshardware.com)

Not Everything Needs an Update (kyrylo.org)

Xero Shoes Went from Startup to Global Success (success.com)

Trump Calls on Intel CEO to Resign over China Ties (msn.com)

F1 in Belgium: The best racetrack in the world (arstechnica.com)

Cc (continuous coding) the new AI enabled prefix in the cc –> CI > CD chain

Trump administration is altering previously published climate reports (cnn.com)

Encryption Made for Police and Military Radios May Be Easily Cracked (wired.com)

The death of east London's most radical bookshop (the-londoner.co.uk)

(Re)Building Reddit Mobile CI in 2025 (old.reddit.com)

Show HN: AI-powered cover letter generator based on resume and job description (payrankjobs.ai)

Here Come the AI Worms (wired.com)

Why Server-Side Charts Suck – We Moved Rendering to the Client in Our Chat SDK (withcoherence.com)

GPT-5 (lack of) skill in orbital mechanics (twitter.com)

Reactive HTML Without JavaScript Frameworks (blog.hmpl-lang.dev)

The Eponymous Principles of Management – Coase's Ceiling and Floor (amvaishnav.wordpress.com)

Framework Desktop is a mash-up of a regular desktop PC and the Mac Studio (arstechnica.com)

Historical Tech Tree (historicaltechtree.com)

Re-label the "Save" button to be "Publish", to better indicate to users the out (phabricator.wikimedia.org)

Does Your App Need AI? (app-analyzer.stupid-ideas.com)

Jules – An Asynchronous Coding Agent (jules.google)

Exploiting Retbleed in the Real World (bughunters.google.com)

Selling Domain Www.brooklyn.ventures

DNA Casts Doubt over Theory on What Killed Napoleon's Forces (sciencealert.com)

Intel's Alleged 18A Chip Manufacturing Struggles Put Panther Lake Launch at Risk (hothardware.com)

Linux 6.17 Will Correctly Map by Default F13 to F24 Keys on PS/2 Keyboards (phoronix.com)

Ask HN: Does reporting spam/phishing help at all?

IBM Outlines Steps to Verify Claims of Quantum Advantage (nextplatform.com)

Trump to sign executive order allowing cryptocurrencies, private equity in 401k (thehill.com)

Stanford Course on AI Software Development: The Modern Software Developer (themodernsoftware.dev)

The Scam of Age Verification (pornbiz.com)

VexTrio Unveiled: Inside the Notorious Scam Enterprise (blogs.infoblox.com)

Exit Tax: Leave Germany Before Your Business Gets Big (eidel.io)

Decoding Zuck's Superintelligence Memo (om.co)

The Wolf-Krugman Exchange: The cost of losing trust in the US [video] (youtube.com)

Microchimeric cells from fetus improve mother's health and injury response (pubmed.ncbi.nlm.nih.gov)

Show HN: AI Video Creation Just Leveled Up with Google Veo 3 (nxgntools.com)

Framework Desktop Hands On: First Impressions (Benchmarks, Gaming, AI Models) (boilingsteam.com)

How do politicians view democracy? It depends on whether they win or lose (phys.org)

GPT-5 for Developers

Comments (58)