O3 Turns Pro

110 jsnider3 75 6/17/2025, 2:49:50 PM thezvi.substack.com ↗

Comments (75)

vessenes · 2h ago
I'm using Pro. It's definitely a "hand it to the team and have them schedule a meeting to get back to me" speed tool. But, it "feels" better to me than o3, and significantly better than gemini/claude for that use case. I do trust it more on confabulations; my current trust hierarchy would be o3-pro -> o3 -> gemini -> claude opus -> (a bunch of stuff) -> 4o.

That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.

qwertox · 1h ago
What are you using it for? It's not like that wouldn't matter.

With coding using anything is always a hit and miss, so I prefer to have faster models where I can throw away the chat if it turns into an idiot.

Would I wait 15 minutes for a transcription from Python to Rust if I don't know what the result will be? No.

Would I wait 15 minutes if I'd be a mathematician working on some kind of proof? Probably yes.

AaronAPU · 1h ago
I feed most of my questions/code to 4o, Gemini, o3-pro (in that order). By the time I’ve read through 4o, Gemini is ready. Etc.

It’s the progressive jpg download of 2025. You can short circuit after the first model which gives a good enough response.

plufz · 22m ago
How do you reason about the energy consumption/climate impact of feeding the same question to three models? Im not saying there is a clear answer here, would just be interesting to hear your thinking.
themanmaran · 9m ago
The same way you might reason about the climate impact of having a youtube video on in the background I expect.
true_religion · 12m ago
How much energy does an AI model use during inferencing versus a human being?

This is a rhetorical question.

Sure we aren’t capturing every last externality, but optimization of large systems should be pushed toward the creators and operators of those systems. Customers shouldn’t have to validate environmental impact every time they spend 0.05 dollars to use a machine.

dfsegoat · 14m ago
It's a tough question and I do things the same way.

I feel like we are in awkward phase of: "We know this has severe environmental impact - but we need to know if these tools are actually going to be useful and worth adopting..." - so it seems like just keeping the environmental question at the forefront will be important as things progress.

AaronAPU · 6m ago
I don’t have nearly a luxurious enough life for that to be a blip on my radar of concerns.
achierius · 52m ago
Ideally it should be able to do things outside of the realm of programming with strong reliability (at least as strong as human experts), as well as be able to pick up new skills and learn new facts dynamically.
Y_Y · 2h ago
Are you setting the "reasoning effort"? I find going from the default (medium) to high makes a big difference on coding tasks for openai reasoning models.
pas · 5m ago
what/how does that work internally?
IamLoading · 49m ago
The time o3 pro takes is so annoying. I still need some time to get used to that.
bananapub · 2h ago
what do you trust it to do?

the only example uses I see written about on HN appear to basically be Substack users asking o3 marketing questions and then writing substack posts about it, and a smattering of vague posts about debugging.

vessenes · 2h ago
Long form research reporting.

Example: Pull together a list of the top 20 startups funded in Germany this year, valuation, founder and business model. Estimate which is most likely to want to take on private equity investment from a lower mid market US PE fund, as well as which would be most suitable taking into consideration their business model, founders and market; write an approach letter in english and in german aimed at getting a meeting. make sure that it's culturally appropriate for german startup founders.

I have no idea what the output of this query would be by the way, but it's one I would trust to get right on

* the list of startups

* the letter and its cultural sensitivity

* broad strokes of what the startup is doing

Stuff I'd "trust but verify" would be

* Names of the founders

* Size of company and target market

Stuff I'd double check / keep my own counsel on

* Suitability and why (note that o3 pro is def. better at this than o3 which is already not bad; it has some genuinely novel and good ideas, but often misses things.)

leptons · 1h ago
This is all stuff I would expect an LLM to "hallucinate" about. Every bit of it.
thelock85 · 1h ago
I recently tried a version of this landscape analysis within a space I understand very well (CA college access nonprofits) and was shocked at how few organizations were named, let alone described in detail. Even worse, the scope and reach of the named orgs were pretty off the mark. My best guess is that they were the SEO winners of the past.
steveklabnik · 1h ago
These tools can search the web to find this kind of data, and show you what they searched. Double checking is essential because hallucinations are still possible, but it's not like in the past where it would just try to make up the data from its training set. That said, it also may find bad data and give you a summary of that, which isn't a direct hallucination, but can still be inaccurate. This is why checking the sources is helpful too.
majormajor · 1h ago
I wouldn't expect it to hallucinate but how do you evaluate it's ability to distinguish spam from good info? I.e. the "the first four pages of google results is all crap nowdays" problem.
steveklabnik · 1h ago
By looking at the pages it looked at and deciding for yourself, just like you would with a web search you invoked yourself. I’ve generally found it to use trustworthy stuff like Stack Overflow, Wikipedia, and university websites. But I also haven’t used it in this way that much or for very serious things. I’d imagine more obscure questions are more likely to end up involving less trustworthy sites.
vessenes · 1h ago
Well you’d be wrong in this case: Deep research will trigger a series of web searches first then reach out to tooling for follow ups as needed; most of the facts will be grounded in the sources it finds.

With no deep research - agreed; too recent to believe info is accurately stored in the model weights.

bananapub · 1h ago
why would you trust it to get any of that right? things like "top 20 startups in Germany" sound hard to determine.

how do you validate all of that is actually correct?

jazzyjackson · 1h ago
A lot of stuff doesn't need to be accurate, it just needs to be enough information to act on.

Like how there's a ton of psychics, tarot and palm readers around Wall St.

bananapub · 48m ago
That’s fine, but no one - not Sam Altman, not the fans on HN - are promoting them as $120/million token clairvoyants, they’re claiming they are srs bzns “iq maxxing” research tools.

If OP had suggested that they were just medium-quality nonsense generators I would have just agreed and not replied.

lovich · 1h ago
I’ve been using it in my job search by handing it stuff like the hn whose hiring threads, giving it a list of criteria i care about, and have it scour those posts for matching jobs, and then chase down all the companies posting and see if they have anything on their corporate site matching my descriptions.

Then I have it take those matches and try and chase down the hiring manager based on public info.

I did it at first just to see if it was possible, but I am getting direct emails that have been accurate a handful of times and I never would have gotten that on my own

bananapub · 21m ago
This is a good data point - I guess another dimension is incompleteness-tolerance. An LLM is absolutely going to miss some but for your case that doesn’t matter very much.

Thank you!

jes5199 · 1h ago
I haven’t tried pro yet but just yesterday I asked O3 to review a file and I saw a message in the chain-of-thought like “it’s going to be hard to give a comprehensive answer within the time limit” so now I’m tempted
snissn · 2h ago
I’ve found throw the problem at 3 o3 pros and have another one evaluate and synthesize works really well
ActionHank · 33m ago
So like, a whole forest of trees per query is what we're saying here?
rotcev · 1h ago
I use O3-pro not as a coding model, but as a strategic assistant. For me, the long delay between responses makes the model unsuitable for coding workflows, however, it is actually a feature when it comes to getting answers to hard questions impacting my (or my friend's/family's) day to day life.
metalrain · 51m ago
"'take your profits’ in quality versus quantity is up to you."

As mainly AI invester not AI user, I think profitability is great importance. It has been race to top so far, soon we see race to the bottom.

resters · 23m ago
Right! We are in a sense lucky to be getting access to actual state-of-the-art models. Soon the actual model may be kept internal and the customers will get "good enough for solid ROI" distilled versions that can be hosted profitably.
swyx · 1h ago
> Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3)

> The problem with o3-pro is that it is slow.

well maybe Arena is not that silly then. poorly argued/organized article.

b0a04gl · 1h ago
when o3 pricing dropped 80%, most wrote the entire model family off as a downgrade (including me). but usage patterns flipped people finally ran real tasks through it. it's one of the few that holds state across fragmented prompts without collapsing context. used it to audit a messy auth flow spread over 6 services. didn't shortcut, didn't hallucinate edge cases. slow, but deliberate. in kahneman terms, it runs system 2 by default. many still benchmark on token speed, missing what actually matters
lubujackson · 13m ago
I have been using o3 almost exclusively in Cursor now for my "vibe coding" project. I was able to get to a point with faster models before hitting a thrashing problem of forgetting about structure/not updating types/no using right types/ignoring existing functions, etc. Even when providing specific context. o3 rarely hits those issues and can happily implement a fully feature without breaking anything that touches multiple files. Speed is definitely an issue, but much less hassle on the back side.
lysecret · 40m ago
This feels very Ai generated.
motoxpro · 2m ago
I would say the opposite. Unless the person has a lot of custom instructions going on. Getting sentences like "but usage patterns flipped people finally ran real tasks through it." seem like it would take some amount of work.
mettamage · 28m ago
Some people write in similar ways yea. I've also been accused of writing as an AI.

But we're still human mate.

Stop discriminating or actually solve the problem. I've had enough of this attitude.

b0a04gl · 6m ago
yes im agi by the way
boole1854 · 21m ago
Here are my own anecdotes from using o3-pro recently.

My primary use cases where I am willing to wait 10-20 minutes for an answer from the "big slow" model (o3-pro) is code reviews of large amounts of code. I have been comparing results on this task from the three models above.

Oddly, I see many cases where each model will surface issues that the other two miss. In previous months when running this test (e.g., Claude 3.7 Sonnet vs o1-pro vs earlier Gemini), that wasn't the case. Back then, the best model (o1-pro) would almost always find all the issues that the other models found. But now it seems they each have their own blindspots (although they are also all better than the previous generation of models).

With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

Whether o3-pro or Gemini 2.5 Pro is better is less clear. o3-pro will report more issues, but it also has a tendency to confabulate problems. My workflow involves providing the model with a diff of all changes, plus the full contents of the files that were changed. o3-pro seems to have a tendency to imagine and report problems in the files that were not provided to it. It also has an odd new failure mode, which is very consistent: it gets confused by the fact that I provide both the diff and the full file contents. It "sees" parts of the same code twice and will usually report that there has accidentally been some code duplicated. Base o3 does this as well. None of the other models get confused in that way, and I also do not remember seeing that failure mode with o1-pro.

Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.

Back in the o1-pro days, it was fairly straightforward in my testing for this use case that o1-pro was simply better across the board. Now with o3-pro compared particularly with Gemini 2.5 Pro, it's no longer clear whether the bonus of occasionally finding a problem that Gemini misses is worth the trouble of (1) waiting way longer for an answer and (2) sifting through more false positives.

My other common code-related use case is actually writing code. Here, Claude Code (with Opus 4) is amazing and has replaced all my other use of coding models, including Cursor. I now code almost exclusively by peer programming with Claude Code, allowing it to be the code writer while I oversee and review. The OpenAI competitor to Claude Code, called Codex CLI, feels distinctly undercooked. It has a recurring problem where it seems to "forget" that it is an agent that needs to go ahead and edit files, and it will instead start to offer me suggestions about how I can make the change. It also hallucinates running commands on a regular basis (e.g., I tell it to commit the changes we've done, and outputs that it has done so, but it has not.)

So where will I spend my $200 monthly model budget? Answer: Claude, for nearly unlimited use of Claude Code. For highly complex tasks, I switch to Gemini 2.5 Pro, which is still free in AI Studio. If I can wait 10+ minutes, I may hand it to o3-pro. But once my ChatGPT Pro subscription expires this month, I may either stop using o3-pro altogether, or I may occasionally use it as a second opinion by paying on-demand through the API.

franze · 2h ago
I use Claude Code a lot. A lot lot. I make it do Atomic Git commits for me. When it gets stuck and instead of just saying so starts to refactor half of the codebase, I jump back to commit where the issue first appeared and get a summary of the involved files. Those in full text (not files) into o3 pro. And you can be sure it finds the issue or gives a direction where the issue does not appear. Would love o3-pro as am MCP so whenever Claude Code goes on a "lets refactor everything" coding spree it just asks o3 pro.
jgalt212 · 2h ago
> When it gets stuck and instead of just saying so starts to refactor half of the codebase

That's pretty scary.

franze · 1h ago
Atomic Commits.

I put this into Claude.md and need to remind it every other hour. But yeah, you need to jump back every few hours or so.

nevertoolate · 52m ago
Can you give an example what claude works on autonomously for hours? I only use the chat, maybe I’m just not prompting well, but I throw away almost everything claude writes and solve it in significantly less lines of code using the proper abstractions.
ActionHank · 29m ago
Yeah, so far, I've only seen cases where the work is extremely simple and using pervasively used libraries and solutions to create widely implemented solutions. Add something a little out there and things start to unravel.
starik36 · 2h ago
I've tried o3 Pro for my use cases (parsing emails in the legal profession) and didn't have better results than the non pro.

In fact, o1-preview has given me more consistently correct results than any other model. But it's being sunset next month so I have to move to o3.

ActionHank · 28m ago
Out of interest, how widespread would you say this usage is amongst your peers in the legal profession?
starik36 · 16m ago
ChatGPT is pretty widespread. The only obstacle in the past was the fear that confidential documents might be used for training. OpenAI fixed that with a business account type that guarantees no training.

As far as usage of API for business processes (like document processing) - I can't say.

AaronAPU · 1h ago
IMO 4o is much better at people-parsing. The reasoning models o1-pro / o3-pro are really good at writing code and solving algorithmic problems.
starik36 · 13m ago
I've tried it with various models. And 4o is really good given that it returns data at least 10 times faster. But if you ask it to fill out a Json document, o3 (or other reasoning models) is still better, more correct and predictable. Or at least, better enough to justify waiting a minute for the API call to return vs 3-5 seconds.
resters · 21m ago
what is people parsing?
AaronAPU · 10m ago
Things like inferring the meaning of “people parsing” when it isn’t explicitly defined but can be implied by context.

Not strict rational A+B=C, nuance.

starik36 · 12m ago
The email from the lawyer might mention lots of names. Who are the plaintiffs, who are defendants, their attorneys, assistants, or insurance adjusters. The model parses out who is who and connects names to titles to email addresses.
iLoveOncall · 2h ago
> My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait.

My solution for this has been to use non-reasoning models, and so far in 90% of the situations I have received the exact same results from both.

jasonjmcghee · 2h ago
On the complete other end of the spectrum, I found deep research (whether it's actually performing searches or not) to be a significant upgrade in quality. But you need to be cool with having to wait 15-30 minutes. It's certainly not for everything, but definitely worth trying.

It tends to output significantly longer and more detailed output. So when you want that kind of thing- works well. Especially if you need up to date stuff or want to find related sources.

joshstrange · 2h ago
Deep research is very cool, no doubt, but run it on a problem space you are familiar with and you will see the shortcomings.

Anytime I do my own “deep” research I like to then throw the same problem at OpenAI and see how well it fares. Often it misses things or gets things subtly wrong. The results look impressive so it’s easy to fool people and I’m not saying the results are useless, I’ve absolutely gotten value out of it, but I don’t love using it for anything I actually care about.

bcrosby95 · 1h ago
I view the results more as a starting point than an end unto itself. For that I think it's pretty useful.
joshstrange · 5m ago
Absolutely, I agree it's useful as a starting point, sometimes it's all I need (if it's low-stakes and I just wanted a bit more data). I was just cautioning "trusting" it completely, since it's very easy to fall into that trap (I've done it).
matwood · 21m ago
Same, it will pull enough sources together that I end up with an idea of where to go next.
jacobsenscott · 2h ago
All I got for that is it might get a question "right" 63% of the time. But you don't know if it is right or wrong unless you already know the answer. Why do people use these things?
crubier · 2h ago
Writing a Pull Request can take me 8 hours. Reviewing a Pull Request of the same size takes me 30min. Here you go.
Y_Y · 3m ago
P ⊆ NP
infecto · 2h ago
A bit of a meta topic but the one thing that probably grinds my nerves more than it should is this style of comments that are not extremely additive and simply posit some idea without experience or backing statements. Happens a lot in these LLM discussions. Perhaps there is genuine curiosity but it seems to always read as an objection to the idea of these tools coming from someone who has not used them.
ezst · 23m ago
To me it's a useful reminder that those tools are nothing but text generating algorithms, optimised to produce compelling answers, irrespective whether they are truthful or not, having no concept of what's factual, and completely missing the ability to give up when asked for impossible or unreasonable answers outside of their training data.

In essence, they are only adequate in niche situations (like creative writing, marketing, placeholder during iterative design, …) where there's no such social contract and assumption that people operate in good faith and do their best diligence not to deceive others.

Pretending otherwise, not pushing back when LLMs are clearly used outside of those contexts, or dressing them into what they are not (thinking machines, search engines, knowledge archives, …) is doing the work of useful idiots defending tech oligarchs and data thieves against their own interests.

And yeah, I get it, naysayers are annoying. Doesn't mean they are wrong or their voices shouldn't be heard at a time where the legality and ethics of all this are being debated.

nevertoolate · 41m ago
On the other hand I only see downvotes and _never_ an answer on how you are using llms. The anecdotal 8hours to 30 minutes PR sounds great, but in my experience it just won’t happen. How can you set up llm to work autonomously for _hours_? If it is continuous “pair” work I just don’t see the 30 min work solving a beefy PR. In 8 hours coding with a well thought out plan / re-planning, testing one can finish quite interesting stuff. 30 minutes is basically nothing - and this is kinda what you get with an llm in my experience. How do you do it?
huxley · 2h ago
Not necessarily, you don’t need to know the answer, the fabulation might:

* give an error

* return the wrong result

* not be internally consistent with the rest of the content

* be logically impossible

* be factually impossible

* have basic errors

It is entirely possible (and quite common) to know something is wrong without knowing what a right answer is.

ashdksnndck · 2h ago
If you’re referring to the first chart in OP, “comparative evaluations with human testers”, it’s measuring how often o3-pro gave a better answer than o3. It’s not reporting a 63% accuracy rate.
wahnfrieden · 1h ago
Many types of work are time-consuming to produce, and quick to verify.
Sateeshm · 1h ago
I am curious. What are a few examples?
wahnfrieden · 1h ago
Many code-writing tasks.

How long do your teams take to write vs review PRs? How long does it take to review a test case and run it vs write the implementation under test? Or to verify that a fix a regressed test now completes successfully? How long does it take you to do a "design review" of a rendered webpage vs to create a static webpage? How long does it take to evaluate a performance optimization vs write it?

imiric · 6m ago
If your team takes a disproportionately shorter amount of time to review PRs than to write them, I guarantee that your code base has many issues that would've been caught by a more thorough reviewer. Reviewing code doesn't mean slapping a quick "LGTM!" because you trust the author.

> How long does it take to review a test case and run it vs write the implementation under test?

If you blindly trust a passing test and don't review it as production code, I have a bridge to sell you.

> How long does it take to evaluate a performance optimization vs write it?

Factoring in the time to review that the optimization didn't introduce a regression, and isn't a hack that will cause other issues later: the difference shouldn't be too large.

Yes, code usually takes more time and effort to write, but if it's not thoroughly read, understood, and reviewed, it can cause havoc someone will have to deal with later.

This idea that just because LLMs help you write code quicker will make you or the team more productive is delusional. It's just kicking the can down the road. You can ignore it, but sooner or later someone will have to handle it. And you better hope that it happens before it impacts your users.

steveklabnik · 1h ago
"my tests are failing, and I don't know why. can you investigate?"
arrowsmith · 1h ago
Unless P=NP
add-sub-mul-div · 2h ago
The majority of people just want to go home at 5 after putting in as little effort as possible. Their bosses just want to save money in the short term. The interests could not be more aligned and optimized.
bananapub · 2h ago
you're being silly. there are definitely cases in life where "verifying an answer" is much less effort than "producing an answer" (public key cryptography is built on this!). an obvious example is "writing boring code". I can much more quickly review the code to a simple little custom web app than I can sit down and write it. that's great! as a bonus, no one dies if my little dashboard crashes on invalid input or whatever. another thing might be marketing copy - no one really cares if it's good or not and 500 "OK" words on a topic might take an hour to write but five minutes to read and correct the grammar of.

an example of things that are the opposite is "public policy development", which is why it's simply malicious that various corrupt oligarchs are pushing for it to be used for such things.

so, a simple model for you to understand why other people might find these tools useful for some things:

- low stakes - doesn't matter that much if the output isn't Top Quality, either because it's easy to fix or it just doesn't matter

- enormous gap in cost between generation and review - e.g. coding

- review systems exist and are used - I don't care very much if my coworkers use an LLM to write code or not, since all the code gets reviewed by someone else, and if the proposer of the change doesn't even bother to check it themselves then they pay the social cost for it

A_D_E_P_T · 2h ago
Chat just isn't the best format for something that takes 15-20 minutes (on average) to come up with a response. Email would unironically be better. Send a very long and detailed prompt, like a business email, and get a response back whenever it's ready. Then you can refine the prompt in another email, etc.

But I should note that o3-pro has been getting faster for me lately. At first every damn thing, however simple, took 15+ minutes. Today I got a few answers back within 5 minutes.