OpenAI's new GPT-5 models announced early by GitHub

64 bkolobara 56 8/7/2025, 8:06:48 AM theverge.com ↗

Comments (56)

deepdarkforest · 3h ago

> It handles complex coding tasks with minimal prompting...

I find it interesting how marketers are trying to make minimal prompting a good thing, a direction to optimize. Even if i talk to a senior engineer, i'm trying to be specific as possible to avoid ambiguities etc. Pushing the models to just do what they think its best is a weird direction. There are so many subtle things/understandings of the architecture that are just in my head or a colleagues head. Meanwhile, i found that a very good workflow is asking claude code to come back with clarifying questions and then a plan, before just starting to execute.

ls-a · 3h ago

This works well with managers. They think if the task title on jira is a one liner, then it's that simple to implement.

consp · 2h ago

Usually it's exactly the opposite, more often due to missing containment and requirements so you get a vague oneliner. Infinity hours it is ...

ls-a · 2h ago

Then you have engineers that will agree with the manager on everything

KronisLV · 2h ago

> Meanwhile, i found that a very good workflow is asking claude code to come back with clarifying questions and then a plan, before just starting to execute.

RooCode supports various modes https://docs.roocode.com/basic-usage/using-modes

For example, you can first use the Ask mode to explore the codebase and answer your questions, as well as ask you its own about what you want to do. Then, you can switch over to the Code mode to do the actual implementation, or the model itself will ask you to switch to it in other modes, because it's not allowed to change files in the Ask mode.

I think that approach works pretty well, especially when you document what needs to be done in a separate Markdown file or something along the lines of it, that can be then referenced if you have to clean the context, like a new refactoring task for what's been implemented.

> I find it interesting how marketers are trying to make minimal prompting a good thing, a direction to optimize.

This seems like a good thing, though. You're still allowed to be as specific as you want to, but the baseline is a bit better.

igleria · 2h ago

> I find it interesting how marketers are trying to make minimal prompting a good thing

They do that because IMHO the average person seems to prefer something to be easy, rather than correct.

c048 · 2h ago

This is why I don't listen at all to the fearmongers that say programmers will disappear. At most, our jobs will slightly change.

There will always be people that describe a problem, and you'll always need people actually figuring out what's actually wrong.

croes · 2h ago

The problem isn’t the AI but the management that believes the PR. It doesn’t matter if AI can replace developers but if the management thinks it can.

ACCount36 · 2h ago

What makes you look at existing AI systems and then say "oh, this totally isn't capable of describing a problem or figuring out what's actually wrong"? Let alone "this wouldn't EVER be capable of that"?

benterix · 2h ago

> What makes you look at existing AI systems and then say "oh, this totally isn't capable of describing a problem or figuring out what's actually wrong"?

I wouldn't say they're completely incapable.

* They can spot (and fix) low hanging fruit instantly

* They will also "fix" things that were left out there for a reason and break things completely

* even if the code base fits entirely in their context window, as does the complete company knowledge base, including Slack conversations etc., the proposed solutions sometimes take a very strange turn, in spite of being correct 57.8% of the time.

ACCount36 · 2h ago

That's about right. And this kind of performance wouldn't be concerning - if only AI performance didn't go up over time.

Today's AI systems are the worst they'll ever be. If AI is already capable of doing something, you should expect it to become more capable of it in the future.

binary132 · 2h ago

why is “the worst they’ll ever be” such a popular meme with the AI inevitabilist crowd and how do we make their brains able to work again?

ACCount36 · 1h ago

It's popular because it's true.

By now, the main reason people expect AI progress to halt is cope. People say "AI progress is going to stop, any minute now, just you wait" because the alternative makes them very, very uncomfortable.

benterix · 33m ago

Well, to use the processor analogy, with models we reached the situations where the clocks can't do that much more. So the industry switched to multiplying cores etc. but you can actually see the slope plateauing. There are wild developments for the general public like the immediate availability of gpt-oss-120b that I'm running on my MBP right now, there is Claude Code that can work for weeks doing various stuff and being right half of the time, that's all great, but we can all see development of the SOTA models has slowed down and what we are seeing are very nice and useful incremental improvements, not great breakthroughs like we had 3-4 years ago.

(NB I'm a very rational person and based on my lifelong experience and on how many times life surprised me both negatively and positively, I'd say the chance of a great breakthrough occurring short term is 50%, but it has nothing to do or cannot be extrapolated from the current development as this can go any way actually. We already had multiple AI winters and I'm sure humanity will have dozens if not hundreds of them still.)

disgruntledphd2 · 39m ago

> By now, the main reason people expect AI progress to halt is cope. People say "AI progress is going to stop, any minute now, just you wait" because the alternative makes them very, very uncomfortable.

OK, so where is the new data going to come from? Fundamentally, LLMs work by doing token prediction when some token(s) are masked. This process (which doesn't require supervision hence why it scaled) seems to be fundamental to LLM improvement. And basically all of the AI companies have slurped up all of the text (and presumably all of the videos) on the internet. Where does the next order of magnitude increase in data come from?

More fundamentally, lots of the hype is about research/novel stuff which seems to me to be very, very difficult to get from a model that's trained to produce plausible text. Like, how does one expect to see improvements in biology (for example) based on text input and output.

Remember, these models don't appear to reason much like humans, they seem to do well where the training data is sufficient (interpolation) and do badly where there isn't enough data (extrapolation).

I'd love to understand how this is all supposed to change, but haven't really seen much useful evidence (i.e. papers and experiments) on this, just AI CEOs talking their book. Happy to be corrected if I'm wrong.

fragmede · 3m ago

Fundamentally the bottleneck is on data and compute. If we accept as a given that a) some LLM is bad at writing eg rust code because there's much less of it on the Internet compared to say react js code but that b) the LLM is able to generate valid rust code and c) the LLM is able to "tool use"the rust compiler and a runtime to validate the rust it generates, and iterate until the code is valid, and finally d) use that generated rust code to train on, then it seems that barring any algorithmic improvements in training, that the additional data should allow later versions of the LLM to be better at writing rust code. If you don't hold a-d to be possible then sure, maybe it's just AI CEOs talking their book.

The other fundamental bottleneck is compute. Moore's law hasn't gone away, so if the LLM was GPT-3, and used 1 supercomputer's worth of compute for 3 months back in 2022, and the supercomputer used for training is, say, three times more powerful (3x faster CPU and 3x the RAM), then training on a latest generation supercomputer should lead to a more powerful LLM simply by virtue of scaling that up and no algorithmic changes. The exact nature of the improvement isn't easily back of the envelope calculatable, but even with a laymen's understanding of how these things work, that doesn't seem like an unreasonable assumption on how things will go, and not "AI CEOs talking their book". Simply running with a bigger context window should allow the LLM to be more useful.

Finally though, why do you assume that, absent papers up on arvix, that there haven't and won't be any algorithmic improvements to training and inference? We've already seen how allowing the LLM to take longer to process the input (eg "ultrathink" to Claude) allows for better results. It seems unlikely that all possible algorithmic improvements have already been discovered and implemented. Just because OpenAI et Al aren't writing academic papers to share their discovery with the world and are, instead, preferring to keep that improvement private and proprietary, in order to try and gain a competitive edge in a very competitive business seems like a far more reasonable assumption. With literal billions of dollars on the line, would you spend your time writing a paper, or would you try and outcompete your competitors? If simply giving the LLM longer to process the input before user facing output is returned, what other algorithmic improvements on the inference side on a bigger supercomputer with more ram available to it are possible?

Happy to hear opposing points of view, but I don't think any of the things I've theorized here to be totally inconceivable. Of course there's a discussion to be had about diminishing returns, but we'd need a far deeper understanding is the state of the art on all three facets I raised in order to have an in depth and practical discussion on the subject.

IsTom · 1h ago

We're somewhere on an S-curve and you can't really determine on which part by just looking at the past progress.

croes · 2h ago

That’s not how it works. There are already cases where the fix of one problem made a previous existing capability worse.

ACCount36 · 1h ago

That's exactly how it works. Every input of AI performance improves over time, and so do the outcomes.

Can you damage existing capabilities by overly specializing an AI in something? Yes. Would you expect that damage to stick around forever? No.

OpenAI damaged o3's truthfulness by frying it with too much careless RL. But Anthropic's Opus 4 proves that you can get similar task performance gains without sacrificing truthfulness. And then OpenAI comes back swinging with an algorithmic approach to train their AIs for better truthfulness specifically.

satyrun · 1h ago

At this point it is just straight denial.

Like when a relationship is obviously over. Some people enjoy the ending fleeting moments while others delude themselves that they just have to get over the hump and things will go back to normal.

I suspect a lot of the denial is from the 30 something CRUD app lottery winner. One of the smart kids all through school, graduated into a ripping CRUD app job market and then if they didn't even feel the 2022 downturn, they now see themselves as irreplaceable CRUD app genius. Something understandable since the environment has never signaled anything to the contrary until now.

croes · 2h ago

Turn the question around „oh, this totally is capable of describing a problem and figuring out what's actually wrong“

Even a broken clock is right two times a day.

The question is reliability.

What worked today may not work tomorrow and vice versa.

nojito · 2h ago

Because people are overprompting and creating crazy elaborate harnesses. My prompts are maybe 1 - 2 sentences.

There is a definite skill gap between folks who are using these tools effectively and those who do not.

Ratelman · 3h ago

Interesting/unfortunate/expected that GPT-5 isn't touted as AGI or some other outlandish claim. It's just improved reasoning etc. I know it's not the actual announcement and it's just a single page accidentally released, but it at least seems more grounded...? Have to wait and see what the actual announcement entails.

throwaway525753 · 2h ago

At this point it's pretty obvious that the easy scaling gains have been made already and AI labs are scrounging for tricks to milk out extra performance from their huge matrix product blobs:

-Reasoning, which is just very long inference coupled with RL

-Tool use aka an LLM with glue code to call programs based on its output

-"Agents" aka LLMs with tools in a loop

Those are pretty neat tricks, and not at all trivial to get actionable results from (from an engineering point of view), mind you. But the days of the qualitative intelligence leaps from GPT-2 to 3, or 3 to 4, are over. Sure, benchmarks do get saturated, but at incredible cost and forcing AI researchers to make up new "dimensions of scaling" as the ones they were previously banking on stalled. And meanwhile it's all your basic next token prediction blob running it all, just with a few optimizing tricks.

My hunch is that there won't be a wondorous life turning AGI (poorly defined anyway), just consolidating existing gains (distillation, small language models, MoE, quality datasets, etc.) and finding new dimensions and sources of data (biological data and 'sense-data' for robotics come to mind).

binary132 · 1h ago

This is the worst they’ll ever be! It’s not just going to be an ever slower asymptotic improvement that never quite manages to reach escape velocity but keeps costing orders of magnitude more to research, train, and operate….

nialv7 · 3h ago

I wonder whether the markets will crash if gpt5 flops. Because it might be the model that cements the idea that, yes, we have hit a wall.

qsort · 3h ago

I'm the first to call out ridiculous behavior by AI companies but short of something massively below expectations this can't be bad for openai. GPT-5 is going to be positioned as a product for the general public first and foremost. Not everyone cares about coding benchmarks.

nialv7 · 2h ago

llama 4 basically (arguably) destroyed Meta's LLM lab, and it wasn't even that bad of a model.

benterix · 2h ago

> massively below expectations

Well, the problem is that the expectations are already massive, mostly thanks to sama's strategy of attracting VC.

ben_w · 2h ago

OpenAI's announcements are generally a lot more grounded than the hype surrounding them and their stuff.

e.g. if you look at Altman's blog of "superintelligence in a few thousand days", what he actually wrote doesn't even disagreeing with LeCun (famously a nay-sayer) about the timeline.

naveen99 · 42m ago

Few thousands days is decades.

Imustaskforhelp · 3h ago

Yeah, I guess it wouldn't be that big but it will have a lot of hype around it.

I doubt it can even beat opus 4.1

bkolobara · 4h ago

The actual announcement (now deleted on GitHub's blog): https://archive.is/IoMEg

billytrend · 3h ago

Did they photoshop the screenshot from https://github.blog/changelog/2025-05-19-github-models-built... ? Other than the model id, it’s identical.

ukblewis · 2h ago

I get that it looks suspicious, but here’s the archive link: https://archive.is/2025.08.07-035308/https://github.blog/cha...

nxobject · 2h ago

Is the announcement implying that "mainline" GPT-5 is now a reasoning model?

> gpt-5: Designed for logic and multi-step tasks.

blixt · 2h ago

I think the promise back when all the separate reasoning / multimodal models were out was that GPT-5 would be the model to bring it all together (which mostly comes down to audio/video I think since o3/o4 do images really well).

om8 · 2h ago

Of course it is. GPT-5 is one of the most anticipated things in AI right now. To live up to the hype, it needs to be a reasoning model.

ed_mercer · 2h ago

Damn interns!

therodeoen · 4h ago

they are comparing it to llama 4 and cohere v2 in the image…

fnord77 · 3h ago

sama posted a picture of the death star yesterday

Leonardo Chiariglione: “I closed MPEG on 2 June 2020” (leonardo.chiariglione.org)

New AI Coding Teammate: Gemini CLI GitHub Actions (blog.google)

Schools are using AI surveillance to protect students. Sometimes arresting them (apnews.com)

We replaced passwords with something worse (blog.danielh.cc)

Cracking the Vault: How we found zero-day flaws in HashiCorp Vault (cyata.ai)

Claude Code IDE integration for Emacs (github.com)

GoGoGrandparent (YC S16) Is Hiring Back End and Full-Stack Engineers

Show HN: Stasher – Burn-after-read secrets from the CLI, no server, no trust (github.com)

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs (baseten.co)

AI Ethics is being narrowed on purpose, like privacy was (nimishg.substack.com)

Ultra-processed foods make up more than 60% of us kids' diets (bloomberg.com)

Show HN: Aura – Like robots.txt, but for AI actions (github.com)

Debounce (developer.mozilla.org)

Project Hyperion: Interstellar ship design competition (projecthyperion.org)

Rules by which a great empire may be reduced to a small one (1773) (founders.archives.gov)

Children's movie leads art historian to long-lost Hungarian masterpiece (2014) (theguardian.com)

Splatshop: Efficiently Editing Large Gaussian Splat Models (momentsingraphics.de)

Did Craigslist decimate newspapers? Legend meets reality (poynter.org)

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model (github.com)

How AI Conquered the US Economy: A Visual FAQ (derekthompson.org)

PastVu: Historical Photographs on Current Maps (pastvu.com)

About AI (priver.dev)

A candidate giant planet imaged in the habitable zone of α Cen A (arxiv.org)

Litestar is worth a look (b-list.org)

Photographer spends years on street corner capturing same commuters daily (2017) (mymodernmet.com)

Jules, our asynchronous coding agent (blog.google)

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work (collabora.com)

We'd be better off with 9-bit bytes (pavpanchekha.com)

A fast, growable array with stable pointers in C (danielchasehooper.com)

The Bluesky Dictionary (avibagla.com)

The Whispering Earring (Scott Alexander) (croissanthology.com)

What is the average length of a queue of cars? (2023) (e-dorigatti.github.io)

Herbie detects inaccurate expressions and finds more accurate replacements (herbie.uwplse.org)

Automerge 3.0 (automerge.org)

40 Years of the Amiga (goto10retro.com)

Scientists have recreated the Universe's first molecule (sciencedaily.com)

Mac history echoes in current Mac operating systems (tenfourfox.blogspot.com)

Multics (multicians.org)

Comptime.ts: compile-time expressions for TypeScript (comptime.js.org)

Breaking the sorting barrier for directed single-source shortest paths (quantamagazine.org)

303Gen – 303 acid loops generator (303-gen-06a668.netlify.app)

Zig Error Patterns (glfmn.io)

AI in Search is driving more queries and higher quality clicks (blog.google)

Open models by OpenAI (openai.com)

SQLite offline sync for Android quick start (github.com)

Show HN: Sinkzone DNS – Forwarder that blocks everything except your allowlist (github.com)

FDA approves eye drops that fix near vision without glasses (newatlas.com)

Compaq’s Rod Canion broke IBM's hold on the PC market (every.to)

Out-Fibbing CPython with the Plush Interpreter (pointersgonewild.com)

301party.com: Intentionally open redirect (301party.com)

OpenAI's new GPT-5 models announced early by GitHub

Comments (56)