OpenAI o3-pro

175 mfiguiere 97 6/10/2025, 8:15:47 PM help.openai.com ↗

Comments (97)

DanMcInerney · 1d ago
I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing slight improvement, I have seen that each new generation feels actively better at exactly the same tasks I gave the previous generation.

It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.

codingwagie · 1d ago
I'm seeing big advances that arent shown in the benchmarks, I can simply build software now that I couldnt build before. The level of complexity that I can manage and deliver is higher.
IanCal · 8h ago
A really important thing is the distinction between performance and utility.

Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.

shmoogy · 1d ago
Yeah I kind of feel like I'm not moving as fast as I did, because the complexity and features grow - constant scope creep due to moving faster.
alightsoul · 22h ago
mind telling examples?
motorest · 17h ago
Not OP, but a couple of days ago I managed to vibecode my way through a small app that pulled data from a few services and did a few validation checks. By itself its not very impressive, but my input was literally "this is how the responses from endpoint A,B and C look like. This field included somewhere in A must be somewhere in the response from B, and the response from C must feature this and that from response A and B. If the responses include links, check that they exist". To my surprise, it generated everything in one go. No retry nor Agent mode churn needed. In the not so distant past this would require progressing through smaller steps, and I had to fill in tests to nudge Agent mode to not mess up. Not today.
alightsoul · 16h ago
what tools did you use?
motorest · 12h ago
> what tools did you use?

Nothing fancy. Visual Studio Code + Copilot, agent mode, a couple prompt files, and that's it.

energy123 · 1d ago
That would require AIME 2024 going above 100%.

There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.

Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.

If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.

croddin · 1d ago
There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:

"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task

ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task

Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"

- https://x.com/arcprize/status/1932535378080395332

saberience · 1d ago
I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.

Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.

nipah · 1h ago
"most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.
HDThoreaun · 9h ago
arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"
jstummbillig · 1d ago
It's hard to be 100% certain, but I am 90% certain that the benchmarks leveling off, at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).
motorest · 17h ago
> (...) at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?

alightsoul · 22h ago
either that or the improvements aren't as large as before.
XCSme · 1d ago
I remember the saying that from 90% to 99% is a 10x increase in accuracy, but 99% to 99.999% is a 1000x increase in accuracy.

Even though it's a large10% increase first then only a 0.999% increase.

manmal · 1d ago
The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?
lhl · 12h ago
I've been using o3 extensively since release (and a lot of Deep Research). I also use a lot of Claude and Gemini 2.5 Pro (most of the times, for code I'll let all of them go at it and iterate on my fav results).

So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.

I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.

Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).

I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).

Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.

(I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)

manmal · 8h ago
Thanks for your input, very appreciated. Just in case you didn’t mean Claude Code, it’s really good in my experience and mostly stable. If something fails, it just retries and I don’t notice it much. Its autonomous discovery and tool use is really good and I‘m relying more and more on it.
dyauspitr · 1d ago
Don’t they have a full fledged version of o4 somewhere internally at this point?
ankit219 · 1d ago
They do it seems. o1 and o3 were based on the same base model. o4 is going to be based on a newer (and perhaps smarter) base model.
bachittle · 1d ago
it's the same model as o3, just with thinking tokens turned up to the max.
Tiberium · 1d ago
That's simply not true, it's not just "max thinking budget o3" just like o1-pro wasn't "max thinking budget o1". The specifics are unknown, but they might be doing multiple model generations and then somehow picking the best answer each time? Of course that's a gross simplification, but some assume that they do it this way.
firejake308 · 1d ago
> "We also introduced OpenAI o3-pro in the API—a version of o3 that uses more compute to think harder and provide reliable answers to challenging problems"

Sounds like it is just o3 with higher thinking budget to me

cdblades · 1d ago
> That's simply not true, it's not just "max thinking budget o3"

> The specifics are unknown, but they might...

Hold up.

> but some assume that they do it this way.

Come on now.

MallocVoidstar · 1d ago
Good luck finding the tweet (I can't) but at least one OpenAI engineer has said that o1-pro was not just 'o1 thinking longer'.
PhilippGille · 1d ago
This one? Found with Kagi Assistant.

https://x.com/michpokrass/status/1869102222598152627

It says:

> hey aidan, not a miscommunication, they are different products! o1 pro is a different implementation and not just o1 with high reasoning.

cdblades · 10h ago
That's a rather crappy product naming scheme.
boole1854 · 1d ago
I also don't have that tweet saved, but I do remember it.
chad1n · 1d ago
The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these models to check for drops in quality.
simonw · 1d ago
That's definitely not the case here. The new o3-pro is slow - it took two minutes just to draw me an SVG of a pelican riding a bicycle. o3-preview was much faster than that.

https://simonwillison.net/2025/Jun/10/o3-pro/

FergusArgyll · 22h ago
Wow! pelican benchmark is now saturated
esperent · 17h ago
Not until I can count the feathers, ask for a front view of the same pelican, then ask for it to be animated, all still using SVG.
AstroBen · 1d ago
That's one good looking pelican
Terretta · 20h ago
> It's only available via the newer Responses API

And in ChatGPT Pro.

CamperBob2 · 1d ago
Would you say this is the best cycling pelican to date? I don't remember any of the others looking better than this.

Of course by now it'll be in-distribution. Time for a new benchmark...

jstummbillig · 1d ago
I love that we are in the timeline where we are somewhat seriously evaluating probably super human intelligence by their ability to draw a svg of a cycling pelican.
cdblades · 10h ago
I don't love that this is the conversation and when these models bake-in these silly scenarios with training data, everyone goes "see, pelican bike! super human intelligence!"

The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?

CamperBob2 · 8h ago
"I'm taking this talking dog right back to the pound. It told me to go long on AAPL. Totally overhyped"
CamperBob2 · 1d ago
I still remember my jaw hitting the floor when the first DALL-E paper came out, with the baby daikon radish walking a dog. How the actual fuck...? Now we're probably all too jaded to fully appreciate the next advance of that magnitude, whatever that turns out to be.

E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.

simonw · 23h ago
I like the Gemini 2.5 Pro ones a little more: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
k2xl · 1d ago

No comments yet

gkamradt · 1d ago
o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585
weinzierl · 1d ago
Is there a way to figure out likely quantization from the output. I mean, does quantization degrade output quality in certain ways that are different from other modification of other model properties (e.g. size or distillation)?
WhitneyLand · 1d ago
So, we currently have o4-mini and o4-mini-high, which represent medium and high usage of “thinking” or use of reasoning tokens.

This announcement adds o3-pro, which pairs with o3 in the same way the o4 models go together.

It should be called o3-high, but to align with the $200 pro membership it’s called pro instead.

That said o3 is already an incredibly powerful model. I prefer it over the new Anthropic 4 models and Gemini 2.5. It’s raw power seems similar to those others, but it’s so good at inline tool use it usually comes out ahead overall.

Any non-trivial code generation/editing should be using an advanced reasoning model, or else you’re losing time fixing more glitches or missing out on better quality solutions.

Of course the caveat is cost, but there’s value on the frontier.

boole1854 · 1d ago
No, this doesn't seem to be correct, although confusion regarding model names is understandable.

o4-mini-high is the label on chatgpt.com for what in the API is called o4-mini with reasoning={"effort": "high"}. Whereas o4-mini on chatgpt.com is the same thing as reasoning={"effort": "medium"} in the API.

o3 can also be run via the API with reasoning={"effort": "high"}.

o3-pro is different than o3 with high reasoning. It has a separate endpoint, and it runs for much longer.

See https://platform.openai.com/docs/guides/reasoning?api-mode=r...

swyx · 1d ago
here's a nice user review we published: https://www.latent.space/p/o3-pro

sama's highlight[0]:

> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."

I kept nudging the team to go the whole way to just let o3 be their CEO but they didn't bite yet haha

0: https://x.com/sama/status/1932533208366608568

tomComb · 1d ago
Big fan swyx, but both here and in the article there is some bragging about being quoted by sama, and while I acknowledge that that’s not out of the ordinary, I’m concerned about where it leads: what it takes to get quoted by sama (or similar interested party) is saying something good about his product, and having a decent follower count.

Dangerous incentives IMO.

swyx · 21h ago
acked. in my defense i didnt write the article + ben already had a good track record from the o1 article. while our relationship with oai is v v v impt to us, we've also covered negative openai stories: https://www.latent.space/p/clippy-v-anton and will continue to give balanced coverage with the other labs when they do well.

we are definitely not seeking to be openai sycophants, nor would they want us to be.

alightsoul · 22h ago
if o3 is so good why aren't they using it to replace management?
mark_l_watson · 18h ago
I am still not willing to upgrade to a Pro account. I pay $20 a month for both Gemini and ChatGPT, and for what I need this is currently enough.

I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on AI research and sometimes practical applications for my life since then.

I can easily afford $200 for a Pro account but I get this nagging feeling that LLMs are not the final path to the powerful AI I have always dreamed of and I don't want to support this level of hype.

I have lived through a few AI winters and I worry that accountants will tally up the costs, environmental and money, versus the benefits and that we collectively have an 'oh shit' moment.

ikerino · 20h ago
https://www.latent.space/p/o3-pro

Have completed around a dozen chats with o3-pro so far. Can't say I'm impressed, output feels qualitatively very similar to regular o3.

Tried feeding in loads of context as suggested in the article but generally feels like a miss.

nickandbro · 21h ago
"create a svg of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/c3j6TEAP

in case anyone is interested

ikerino · 20h ago
Am I right to say: doesn't look better than anything we've seen before?
tiahura · 1d ago
So, upgrade to Teams and pay the $50? Plus more usage of o3. Seems like it might be a shot at the $100 claude max?
dog436zkj3p7 · 1d ago
What do you mean with "pay the $50"?

Also, does anybody know what limits o3-pro has under the team plan? I don't see it available in the model picker at all (on team).

sanex · 23h ago
I believe teams is $25/user with a 2 user minimum.
dog436zkj3p7 · 23h ago
Ah, thanks for explaining!
ChrisArchitect · 1d ago
Related:

OpenAI dropped the price of o3 by 80%

https://news.ycombinator.com/item?id=44239359

mmsc · 1d ago
I understand that things are moving fast and all, but surely the.. 8? models which are currently available is a bit .. overwhelming for users that just want to get answers to their questions of life? What's the end goal with having so many models available?
nickysielicki · 1d ago
I just can’t believe nobody at the company has enough courage to tell their leadership that their naming scheme is completely stupid and insane. Four is greater than three, and so four should be better than three. The point of a name is to describe something so that you don’t confuse your users, not to be cute.
transcriptase · 1d ago
What’s worse is that the app doesn’t even have descriptions. As if I’m supposed to memorize the use case for each based on:

GPT-4o

o3

o4-mini

o4-mini-high

GPT-4.5

GPT-4.1

GPT-4.1-mini

koakuma-chan · 1d ago
Just use o4-mini for everything
browningstreet · 1d ago
At Techcrunch AI last week, the OpenAI guy started his presentation by acknowledging that OpenAI knows their naming is a problem and they're working on it, but it won't be fixed immediately.
simonw · 1d ago
Sam Altman has said the same thing on Twitter a few times. https://x.com/sama/status/1911906570835022319

> how about we fix our model naming by this summer and everyone gets a few more months to make fun of us (which we very much deserve) until then?

nickysielicki · 1d ago
I’d prefer for them to just fix it asap instead and then keep the existing endpoints around for a year as aliases.
moomin · 1d ago
I know they have a deep relationship with Microsoft, but perhaps they shouldn’t have used Microsoft’s product naming department.
orra · 1d ago
Zune .NET O3... shudders
MallocVoidstar · 1d ago
The reason their naming scheme is so bad is because their initial attempts at GPT-5 failed in training. It was supposed to be done ~1 year ago. Because they'd promised that GPT-5 would be vastly more intelligent than GPT-4, they couldn't just name any random model "GPT-5", so they suddenly had to start naming things differently. So now there's GPT-4.5, GPT-4.1, the o-series, ...
kaoD · 23h ago
Surely there's a less stupid way than naming two very different models o4 and 4o.
aetherspawn · 1d ago
Came here to say this, the naming scheme is ridiculous and is getting more impossible to follow each day.

For example the other day they released a supposedly better model with a lower number..

aetherspawn · 1d ago
I’d honestly prefer they just have 3 personas of varying cost/intelligence: Sam, Elmo and Einstein or something, and then tack on the date, elmo-2025-1 and silently delete the old ones.
levocardia · 1d ago
There's a humorous version of Poe's law that says "any sufficiently genuine attempt to explain the differences between OpenAI's models is indistinguishable from parody"
Osyris · 1d ago
This is a much more expensive model to run and is only available to users who pay the most. I don't see an issue.

However, the "plus" plan absolutely could use some trimming.

bachittle · 1d ago
free users don't have this model selector, and probably don't care which model they get so 4o is good enough. paid users at 20$/month get more models which are better, like o3. paid users at 200$/month get the best models that are also costing OpenAI the most money, like o3-pro. I think they plan to unify them with GPT-5.
stavros · 1d ago
That doesn't help much when we're asymptotically approaching GPT-5. We're probably going to be at GPT-4.9999 soon.
rfw300 · 22h ago
Not necessarily true. GPT-4.1 was released after GPT-4.5-preview. Next model might be GPT-3.7.
nikcub · 1d ago
I'd be curious what proportion of paid users ever switch models. I'd guess < 10%
CuriouslyC · 1d ago
If you're not at least switching from 4o to 4.1 you're doing it wrong.
CamperBob2 · 1d ago
I switch to o1-pro on occasion, but it is slow enough that I don't use it as much as some of the others. It is a reasonably-effective last resort when I'm not getting the answer quality that I think should be achievable. It's the best available reasoning model from any provider by a noticeable margin.

Sounds like o3-pro is even slower, which is fine as long as it's better.

o4-mini-high is my usual go-to model if I need something better than the default GPT4-du jour. I don't see much point in the others and don't understand why they remain available. If o3-pro really is consistently better, it will move o1-pro into that category for me.

paxys · 1d ago
> users that just want to get answers to their questions of life

Those users go to chat.openai.com (or download the app), type text in the box and click send.

resters · 1d ago
Models are used for actual tasks where predictable behavior is a benefit. Models are also used on cutting-edge tasks where smarter/better outputs are highly valued. Some applications value speed and so a new, smaller/cheaper model can be just right.

I think the naming scheme is just fine and is very straightforward to anyone who pays the slightest bit of attention.

macawfish · 1d ago
Overwhelming yet pretty underwhelming
AtlasBarfed · 1d ago
I'd like one to do my test use case:

Port unix-sed from c to java with a full test suite and all options supported.

Somewhere between "it answers questions of life" and "it beats PhDs at math questions", I'd like to see one LLM take this, IMO, rather "pure" language task and succeeed.

It is complicated, but it isn't complex. It's string operations with a deep but not that deep expression system and flag set.

It is well-described and documented on the internet, and presumably training sets. It is succinctly described as a problem that virtually all computer coders would understand what it entailed if it were assigned to them. It is drudgerous, showing the opportunity for LLMs to show how they would improve true productivity.

GPT fails to do anything other than the most basic substitute operations. Claude was only slightly better, but to its detriment hallucinated massive amounts and made fake passing test cases that didn't even test the code.

The reaction I get to this test is ambivalence, but IMO if LLMs could help port entire software packages between languages with similar feature sets (aside from Turing Completeness), then software cross-use would explode, and maybe we could port "vulnerable" code to "safe" Rust en masse.

I get it, it's not what they are chasing customer-wise. They want to write (in n-gate terms) webcrap.

nipah · 1h ago
I have a very simple question with like, 5 lines at best, that basically no model, neither reasoning or simpler could grasp. For obvious reasons I'm not disclosing it here (because I fear data contamination in the long run), but it basically breaks the "reasoning" of those things. Unfortunately, I still can't try the o3-pro because the API version is not easily available, and I'm certainly not willing to pay for it in pro mode, but when it comes to the plus version (if it comes) I'll try. To this date, because of this question (and similar ones) I stand very unimpressed with those models, the marketing is a thousand times larger than reality, and I suspect people in general are surprisingly less capable of detecting intelligence than they think.

The normal o3 also managed to break 3 isolated installations of linux I was trying it with, a few days ago. The task was very simple, simply setup ubuntu with btrfs, timeshift and grub-btrfs and it managed to fail every single time (even when searching the web), so it was not impressive either.

CamperBob2 · 1d ago
How does the latest Gemini 2.5 Pro Ultra Flash Max Hemi XLT release do on that task? It obviously demands a massive context window.
AtlasBarfed · 1h ago
I'll check once I get the nitrous tanks and the aftermarket turbos overnighted from Japan arrive.
carmelion · 1d ago
Jl App