So I decided I would buy some API credits with my OpenAI account. I ponied up $20 and started Aider with my new API key set and o3 as the model. I get the following after sending a request:
"litellm.NotFoundError: OpenAIException - Your organization must be verified to use the model `o3`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."
At that point, the frustration was beginning to creep in. I returned to OpenAI and clicked on "Verify Organization". It turns out, "Verify Organization" actually means "Verify Personal Identity With Third Party" because I was given the following:
"To verify this organization, you’ll need to complete an identity check using our partner Persona."
Sigh I click "Start ID Check" and it opens a new tab for their "partner" Persona. The initial fine print says:
"By filling the checkbox below, you consent to Persona, OpenAI’s vendor, collecting, using, and utilizing its service providers to process your biometric information to verify your identity, identify fraud, and conduct quality assurance for Persona’s platform in accordance with its Privacy Policy and OpenAI’s privacy policy. Your biometric information will be stored for no more than 1 year."
OK, so now, we've gone from "I guess I'll give OpenAI a few bucks for API access" to "I need to verify my organization" to "There's no way in hell I'm agreeing to provide biometric data to a 3rd party I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder. How do I get my $20 back?"
leetrout · 13h ago
I actually contacted the California AG to get a refund from another AI company after they failed to refund me.
The AG office followed up and I got my refund. Worth my time to file because we should stop letting companies get away with this stuff where they show up with more requirements after paying.
Separately they also do not need my phone number after having my name, address and credit card.
Has anyone got info on why they are taking everyone’s phone number?
pembrook · 2m ago
Source: have dealt with fraud at scale before.
Phone number is the only way to reliably stop MOST fraudulent actors. You can easily block VOIP numbers and ensure the person connected to this number is paying for an actual phone plan, which cuts down dramatically on abuse.
Hence why even Facebook requires a unique, non-VOIP phone number to create an account.
jazzyjackson · 11h ago
(having no insider info:) Because it can be used as a primary key ID across aggregated marketing databases including your voting history / party affiliation, income levels, personality and risk profiles etc etc etc. If a company wants to, and your data hygiene hasn't been tip top, your phone number is a pointer to a ton of intimate if not confidential data. Twitter was fined $150 million for asking for phone numbers under pretense of "protecting your account" or whatever but they actually used it for ad targeting.
>> Wednesday's 9th Circuit decision grew out of revelations that between 2013 and 2019, X mistakenly incorporated users' email addresses and phone numbers into an ad platform that allows companies to use their own marketing lists to target ads on the social platform.
>> In 2022, the Federal Trade Commission fined X $150 million over the privacy gaffe.
>> That same year, Washington resident Glen Morgan brought a class-action complaint against the company. He alleged that the ad-targeting glitch violated a Washington law prohibiting anyone from using “fraudulent, deceptive, or false means” to obtain telephone records of state residents.
>> X urged Dimke to dismiss Morgan's complaint for several reasons. Among other arguments, the company argued merely obtaining a user's phone number from him or her doesn't violate the state pretexting law, which refers to telephone “records.”
>> “If the legislature meant for 'telephone record' to include something as basic as the user’s own number, it surely would have said as much,” X argued in a written motion.
Tangential: please do not use a phone number as a PK. Aside from the nightmare of normalizing them, there is zero guarantee that someone will keep the same number.
ponector · 40m ago
Even better: phone numbers are redistributed after some time of inactivity.
azinman2 · 8h ago
OpenAI doesn’t (currently) sell ads. I really cannot see a world where they’re wanting to sell ads to their API users only? It’s not like you need a phone number to use ChatGPT.
To me the obvious example is fraud/abuse protection.
lcnPylGDnU4H9OF · 2h ago
> It’s not like you need a phone number to use ChatGPT.
I’m pretty sure you do. Claude too. The only chatbot company I’ve made an account with is Mistral specifically because a phone number was not a registration requirement.
ax0ar · 49m ago
They also require it now.
hshdhdhj4444 · 40m ago
They may not sell ads.
They may still buy data from ad companies and store credit cards, etc.
Many of them link users based on phone number.
azinman2 · 28m ago
But to do what with api users? Most api users won’t be individuals…
prmoustache · 1h ago
The fact they don't sell ads doesn't mean they are not in the business of selling users data to third parties.
Also Netflix wasn't initially selling ads and there you have after increasing the price of their plans drastically in the last few years the ad supported subscription is probably the #1 plans because most people aren't willing to shed 15 to 25usd/€ every month to watch content that is already littered with ads.
fsmv · 49m ago
If you sell ads you're actually incentivised not to sell data because then your competitors would be able to reach your users without paying you
KomoD · 1h ago
> It’s not like you need a phone number to use ChatGPT.
When I signed up I had to do exactly that.
hnaccount_rng · 7h ago
They don’t need to. It’s totally sufficient that they can correlate your chat history with your identity. That makes other identifiers more valuable, if they can extract your interests
cmenge · 2h ago
The typical use case of an API is not that you personally use it. I have hundreds of clients all go through my API key, and in most cases they themselves are companies who have n clients.
brookst · 2h ago
It’s a good conspiracy theory, but of course it’s scoped to only ChatGPT users who are also developers and using specifically the o3 model via API. So if it is a conspiracy, it’s a fairly non-ambitious one.
charliebwrites · 13h ago
Doesn’t Sam Altman own a crypto currency company [1] that specifically collects biometric data to identify people?
> I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder
93po · 1h ago
the core tech and premise doesnt collect biometric data, but biometric data is collected for training purposes with consent and compensation. There is endless misinformation (willfully and ignorantly) around worldcoin but it is not, at its core, a biometric collection company
malfist · 1h ago
Collecting biometrics for training purposes is still collecting biometrics.
ddtaylor · 9h ago
I also am using OpenRouter because OpenAI isn't a great fit for me. I also stopped using OpenAI because they expire your API credits even if you don't use them. Yeah, it's only $10, but I'm not spending another dime with them.
numlocked · 3m ago
Hi - I'm the COO of OpenRouter. In practice we don't expire the credits, but have to reserve the right to, or else we have a uncapped liability literally forever. Can't operate that way :) Everyone who issues credits on a platform has to have some way of expiring them. It's not a profit center for us, or part of our P&L; just a protection we have to have.
johnnyyyy · 1h ago
then you shouldn’t use OpenRouter.
ToS: 4.2 Credit Expiration; Auto Recharge
OpenRouter reserves the right to expire unused credits three hundred sixty-five (365) days after purchase
bonki · 5h ago
I wonder if they do this everywhere, in certain jurisdictions this is illegal.
cedws · 7h ago
After how long do they expire?
zeograd · 7h ago
IIRC, 1 year
0xdeafbeef · 4h ago
Same for anthropic
cactusplant7374 · 3h ago
That is so sleezy.
predkambrij · 1h ago
Interesting, it works for me through openrouter, without configured openai integration. Although, I have openai account and did verification with openai before. Conspiricy theory would say that they are exchanging PII so openrouter knows who am I :)
exceptione · 11m ago
Welcome to tech dystopia.
Hello Human Resource, we have all your data, please upload your bio-metric identity, as well as your personal thoughts.
Building the next phase of a corporate totalitarian state, thank you for your cooperation.
Marsymars · 11h ago
Oh I also recently got locked out of my linkedin account until I supply data to Persona.
(So I’m remaining locked out of my linkedin account.)
baq · 5h ago
Meanwhile the FSB and Mossad happily generate fake identities on demand.
romanovcode · 5h ago
The whole point of identity verification is for the same Mossad to gather your complete profile and everything else they can from OpenAI.
Since Mossad and CIA is essentially one organization they already do it, 100%.
shmoogy · 1h ago
I was excited about trying o3 for my apps but I'm not doing this validation.. thanks for the heads up.
conradev · 13h ago
I was more excited by the process, like, there exists a model out there so powerful it requires KYC
which, after using it, fair! It found a zero day
__float · 12h ago
I think they're probably more concerned about fake accounts and people finding ways to get free stuff.
gscott · 6h ago
China is training their AI models using ChatGPT. They want to stop or slow that down.
olalonde · 2h ago
Why? It seems counterproductive given OpenAI's mission statement: "We are building safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome."
abeindoria · 12h ago
What free stuff? It requires a paid API.
DrammBA · 11h ago
With no intention to tarnish your pure world view, paid services with low registration requirements are ideal for account laundering and subscription fraud with stolen credit cards
conradev · 11h ago
I actually think they’re worried about foreign actors using it for…
- generating synthetic data to train their own models
- hacking and exploitation research
etc
hiatus · 12h ago
> which, after using it, fair! It found a zero day
Source?
sothatsit · 11h ago
Recently, Sean Heelan wrote a post "How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation". It might be what they are referring to.
KYC has been around for a few months I believe. Whenever they released some of the additional thought logs you had to be verified.
bratao · 11h ago
You are even luck to be able to verify. Mine give me an error about "Session expired" for months!! Support do not reply.
csomar · 9h ago
> How do I get my $20 back?
Contact support and ask for a refund. Then a charge back.
5Qn8mNbc2FNCiVV · 11h ago
This feels eerily similar to a post I've read a within the last month. Either I'm having a deja vu, it's a coincidence that the same exact story is mentioned or theres something else going on
bgwalter · 11h ago
What should be going on? A regular Google search for "openai persona verify organization" shows withpersona.com in the second search result.
To me it looks like an extremely aggressive data pump.
wqaatwt · 8h ago
There are stories about e.g. Hetzner requiring all sorts of data from people who want to open/verify accounts so perhaps not. Might just be an anti “money laundering” thing. Especially if the credit card company ends up refunding everything..
7bit · 6h ago
What stories? Can you back up that claim with some sources please?
zuppy · 2h ago
Infomaniak did request my personal id or passport for creating a company account.
I'm not going to provide any extra source for you, this is what happened to me, you can either believe it or not.
That's not a source and it is not my responsibility to backup the claims you made. That is yours. If you don't have any sources, and admit to just saying things that are not probable, I can also live with that.
fakedang · 10h ago
As someone not in the US, I do a straight nope out whenever I see a Persona request. I advise everyone else to do the same. Afaik, it's used by LinkedIn and Doordash too.
coderatlarge · 12h ago
this reminds me of how broadcom maintains the “free” tier of vmware.
teruakohatu · 8h ago
Can you explain? Is it not actually free?
belter · 2h ago
With all this plus the saving of all chats they can't operate on the EU. But they do ....
nateburke · 13h ago
I don't recall Persona being in the mix last year when I signed up. Interesting development.
gwhr · 8h ago
I think it was a recent thing [1], but I thought they were only considering it
Just send them a random passport photo from the Internet, what's the deal? Probably they are just vibe-verifying the photo with "Is it legit passport?" prompt anyways.
_joel · 5h ago
I think modern face verification has moved on, it's been video in all my encounters.
gloosx · 5h ago
still no real human is involved, as they mention their verification is automated and prohabilistic — which is especially funny to hear in context of verification. Im pretty sure even a kid can go around it, e.g. on the video showing a photo of a person holding his passport which you can find online.
tethys · 4h ago
No. You have to turn your head, and stuff. Also, even if this would work, they allow only one verification per person per 90 days.
_joel · 2h ago
No, that's not how it works.
sneak · 5h ago
It requires video and an app. They are collecting facial biometrics.
gloosx · 5h ago
App? So you cannot verify without a mobile phone?
verisimi · 2h ago
> OK, so now, we've gone from "I guess I'll give OpenAI a few bucks for API access" to "I need to verify my organization" to "There's no way in hell I'm agreeing to provide biometric data to a 3rd party I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder. How do I get my $20 back?"
This should be illegal. How many are going to do the same as you, but then think that the effort/time/hassle they would waste to try to get their money back would not be worth it? At which point you've effectively donated money to a corp that implements anti-consumer anti-patterns.
sneak · 5h ago
Yeah, same. I am a paying API customer but I am not doing biometric KYC to talk to a bot.
jiggawatts · 13h ago
This is in part "abuse prevention"[1] and in part marketing. Making customers feel like they're signing up to access state secrets makes the models seem more "special". Sama is well known to use these SV marketing tricks, like invite-only access, waiting lists, etc to psychologically manipulate users into thinking they're begging for entry to an exclusive club instead of just swiping a credit card to access an API.
Google tried this with Google Plus and Google Wave, failed spectacularly, and have ironically stopped with this idiotic "marketing by blocking potential users". I can access Gemini Pro 2.5 without providing a blood sample or signing parchment in triplicate.
[1] Not really though, because a significant percentage of OpenAI's revenue is from spammers and bulk-generation of SOE-optimised garbage. Those are valued customers!
miki123211 · 11h ago
Gemini doesn't give you reasoning via API though, at least as far as I'm aware.
jjani · 10h ago
If by reasoning you mean showing CoT, Gemini and OA are the same in this regard - neither provides it, not through the UI nor through the API. The "summaries" both provide have zero value and should be treated as non-existent.
Anthropic exposes reasoning, which has become a big reason to use them for reasoning tasks over the other two despite their pricing. Rather ironic when the other two have been pushing reasoning much harder.
finebalance · 5h ago
Google exposes their reasoning. You can use their new gemini python sdk to get thought traces.
jjani · 8m ago
Google does not expose their reasoning any more. They give "thought summaries" which provide effectively zero value. [1][2]
Maybe you’re thinking of deep research mode which is web UI only for now.
paulcole · 12h ago
HN Don’t Hate Marketing Challenge
Difficulty: Impossible
34679 · 13h ago
Follow-up. I pasted the above into ChatGPT because I genuinely wanted to know how to get my 20 bucks back. That request is in and they say I'll get a response in 2-3 days. I then asked it "What are the potential risks of agreeing to their terms?" Here's the response:
Agreeing to Persona’s terms, especially for biometric identity verification, involves both privacy and long-term data security risks. Here’s a clear breakdown of the main risks you should be aware of:
1. Biometric Data Collection
Risk: Biometric identifiers (like facial recognition, voiceprints, etc.) are extremely sensitive and irreplaceable if compromised.
What they collect: Persona may collect a selfie, video, and metadata, and extract biometric templates from those for facial comparison and liveness detection.
If leaked or abused: Unlike passwords, you can't change your face. A future data breach or misuse could lead to permanent identity compromise.
2. Data Storage & Retention
Risk: Persona says biometric data is kept for up to one year, but:
You’re relying on their internal policies, not a legal guarantee.
There’s no technical detail on how securely it’s stored or whether it’s encrypted at rest.
Worst-case scenario: Poorly secured biometric templates could be stolen, reused, or matched against other data sets by bad actors or governments.
3. Third-Party Sharing and Surveillance Risks
Risk: Your biometric and ID data may be shared with subprocessors (partners/vendors) that you haven’t explicitly vetted.
Persona may transfer your data to cloud providers (like AWS, GCP), verification specialists, or fraud prevention services.
Depending on jurisdiction, data could be subject to subpoenas, surveillance laws, or government backdoors (especially in the U.S.).
4. Consent Ambiguity & Future Use
Risk: The fine print often includes vague consent for "quality assurance", "model improvement", or "fraud detection".
This opens the door to retraining algorithms on your biometric data—even if anonymized, that's still a use of your body as data.
Their privacy policy may evolve, and new uses of your data could be added later unless you opt out (which may not always be possible).
Should You Agree?
Only if:
You absolutely need the service that requires this verification.
You’re aware of the privacy tradeoff and are okay with it.
You trust that Persona and its partners won’t misuse your biometric data—even a year down the line.
If you’re uneasy about this, you’re not alone. Many developers and privacy advocates refuse to verify with biometrics for non-critical services, and companies like OpenAI are increasingly facing criticism for requiring this.
stevage · 13h ago
Huh, I never considered using ChatGPT to read terms of serivce for me. Nice idea.
askl · 6h ago
Crazy, I already gave up registering for chatgpt because they asked for my phone number. I'm not giving that to any random startup.
sschueller · 18h ago
Has anyone noticed that OpenAI has become "lazy"? When I ask questions now it will not give me a complete file or fix. Instead it tells me what I should do and I need to ask a second or third time to just do the thing I asked.
I don't see this happening with for example deepseek.
Is it possible they are saving on resources by having it answer that way?
tedsanders · 17h ago
Yeah, our models are sometimes too lazy. It’s not intentional, and future models will be less lazy.
When I worked at Netflix I sometimes heard the same speculation about intentionally bad recommendations, which people theorized would lower streaming and increase profit margins. It made even less sense there as streaming costs are usually less than a penny. In reality, it’s just hard to make perfect products!
(I work at OpenAI.)
ukblewis · 16h ago
Please be careful about the alternative. I’ve seen o3 doing excessive tool calls and research for relatively simple problems.
jillesvangurp · 7h ago
Yep, it defaults to doing a web search even when that doesn't make sense.
Example, I asked it to write something. And then I asked it to give me that blob of text in markdown format. So everything it needed was already in the conversation. That took a whole minute of doing web searches and what not.
I actually dislike using o3 for this reason. I keep the default to 4o. But sometimes I forget to switch back and it goes off boiling the oceans to answer a simple question. It's a bit too trigger happy with that. In general all this version and model soup is impossible to figure out for non technical users. And I noticed 4o is now sometimes starting to do the same. I guess, too many users never use the model drop down.
Hard_Space · 8h ago
After the last few weeks, where o3 seems desperate to do tool searches or re-crunch a bad gen even though I only asked a question about it, I assumed that the policy is to burn through credits at the fastest possible rate. With this price change, I don't know what's happening now...
anshumankmr · 11h ago
That was a problem in GPT 4 Turbo as well...
jazzyjackson · 11h ago
IMO its just that the models are very nondeterministic, and people get very different kinds of responses from it. I met a number of people who tried it when it first came out and it was just useless so they stopped trying it, other people (including me) got gobsmacking great responses and it felt like AGI was around the corner, but after enough coin flips your luck runs out and you get some lazy responses. Some people have more luck than others and wonder why everyone around them says it's trash.
I am not saying they haven't improved the laziness problem, but it does happen anecdotally. I even got similar sort of "lazy" responses for something I am building with gemini-2.5-flash.
0x1ceb00da · 10h ago
I think it's good. The model will probably make some mistake at first. Not doing the whole thing and just telling the user the direction it's going in gives us a chance to correct its mistakes.
thimabi · 13h ago
Can you share what are the main challenges OpenAI has been facing in terms of increasing access to top-tier and non-lazy models?
TZubiri · 14h ago
but maybe you are saying that because you are a CIA plant that's trying to make the product bad because of complex reasons.
takes tinfoil hat off
Oh, nvm, that makes sense.
TillE · 14h ago
Had a fun experience the other day asking "make a graph of [X] vs [Y]" (some chemistry calculations), and the response was blah blah blah explain explain "let me know if you want a graph of this!" Yeah ok thanks for offering.
csomar · 9h ago
I don't think that's laziness but maybe agent tuning.
mythz · 6h ago
I've been turned off with OpenAI and have been actively avoiding using any of their models for a while, luckily this is easy to do given the quality of Sonnet 4 / Gemini Pro 2.5.
Although I've always wondered how OpenAI could get away with o3's astronomical pricing, what does o3 do better than any other model to justify their premium cost?
jstummbillig · 3h ago
It's just a highly unoptimized space. There is very little market consolidation at this point, everyone is trying things out that lead to wildly different outcomes and processes and costs, even though in the end it's always just a bunch of utf-8 characters. o3 was probably just super expensive to run, and now, apparently, it's not anymore and can beat sonnet/opus 4 on pricing. It's fairly wild.
behnamoh · 19h ago
how do we know it's not a quantized version of o3? what's stopping these firms from announcing the full model to perform well on the benchmarks and then gradually quantizing it (first at Q8 so no one notices, then Q6, then Q4, ...).
I have a suspicion that's how they were able to get gpt-4-turbo so fast. In practice, I found it inferior to the original GPT-4 but the company probably benchmaxxed the hell out of the turbo and 4o versions so even though they were worse models, users found them more pleasing.
CSMastermind · 19h ago
This is almost certainly what they're doing and rebranding the original o3 model as "o3-pro"
tedsanders · 17h ago
Nope, not what we’re doing.
o3 is still o3 (no nerfing) and o3-pro is new and better than o3.
If we were lying about this, it would be really easy to catch us - just run evals.
(I work at OpenAI.)
fastball · 7h ago
Anecdotal, but about a week ago I noticed a sharp drop in o3 performance. For many tasks I will compare Gemini 2.5 Pro with o3, running the same prompt in both. Generally for my personal use o3 and G2.5P have been neck-and neck over the last months, with responses I have been very happy with.
However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).
This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.
IanCal · 3h ago
Are you sure you're using the same models? G2.5P updated almost exactly a week ago.
fny · 2h ago
Unrelated: Can you all come up with a better naming scheme for your models? I feel like this is a huge UX miss.
o4-mini-high o4-mini o3 o3-pro gpt-4o
Oy.
MattDaEskimo · 15h ago
What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well
refulgentis · 14h ago
What dropped benchmark performance?
energy123 · 10h ago
Is it o3 (low), o3 (medium) or o3 (high)? Different model names have crept into the various benchmarks over the last few months.
tedsanders · 9h ago
o3 is a model, and reasoning effort (high/medium/low) is a parameter that goes into the model.
o3 pro is a different thing - it’s not just o3 with maximum remaining effort.
tauntz · 4h ago
Why's it called o3 then if it's a different thing? There's already a rather extreme amount of confusion with the model names and it's not clear _at all_ which model would be "the best" in terms of response quality.
Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):
6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)
7) size (optional): "16k"
Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").
Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).
energy123 · 36m ago
My guess is this comes from an org structure where you have multiple "pods" working on different research. Who comes up with the next shippable model and when that happens is kind of random and the chaotic naming system comes from that. It's just my speculation and could be wildly wrong.
rat9988 · 1h ago
o3 and o3-pro aren't the same thing still makes sense though.
fragmede · 5h ago
Could someone there maybe possibly use, oh I dunno, ChatGPT and come up with some better product names?
bn-l · 16h ago
Not quantized?
tedsanders · 15h ago
Not quantized. Weights are the same.
If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].
[1] `chatgpt-4o-latest` being an explicit exception
linsomniac · 13h ago
>we'd release it as a new model with a new name
Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.
thegeomaster · 14h ago
Google could at least learn something from this attitude, given their recent 03-25 -> 05-06 model alias switcharoo with 0 notice :)
johnb231 · 12h ago
That is a preview / beta model with no expectation of stability. Google did nothing wrong there. No one should be using a preview model in production.
thegeomaster · 12h ago
Hard disagree. Of course technically they didn't do anything explicitly against the public guidance (the checks and balances would never let them), but naming a model with a date very strongly implies immutability.
It's the same logic of why UB in C/C++ isn't a license to do whatever the compiler wants. We're humans and we operate on implications, common-sense assumptions and trust.
johnb231 · 10h ago
The model is labelled as Preview. There are no guarantees of stability or availability for Preview models. Not intended for production workloads.
"At Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months."
0xbadcafebee · 10h ago
There's a very large gulf between "what makes sense to Google" and "what makes sense to Human Beings". I have so many rants about Google's poor treatment of "customers" that they feel like Oracle to me now. Like every time I use them, I'm really just falling prey to my own misguided idea that this time I won't get screwed over.
johnb231 · 9h ago
The users aren't random "human beings" in this case. They are professional software developers who are expected to understand the basics. Deploying that model into production shows a lack of basic competence. It is clearly marked "preview" and is for test only.
0xbadcafebee · 1h ago
That may be true, but it doesn't make the customer's claims not true. What Google did was counter-intuitive. That's a fact. Pointing at some fine print and saying "uhh actually, technically it's your stupid human brain is the problem, not us! we technically are allowed to do anything we want, just look at the fine print!!" does not make things better. We are human beings; we are flawed. That much should be obvious to any human organization. If you don't know how to make things that don't piss off human beings, the problem isn't with the humans.
If the "preview release" you were using was v0.3, and suddenly it started being v0.6 without warning, that would be insane. The only point of providing a version number is to give people an indicator of consistency. The datestamp is a version number. If they didn't want us to expect consistency, they should not have given it a version number. That's the whole point of rolling release branches, they have no version. You don't have "v2.0" of a rolling release, you just have "latest". They fucked up by giving it a datestamp.
This is an extremely old and well-known problem with software interfaces. Either you version it or you don't. If you do version it, and change it, you change the version, and give people dependent on the old version some time to upgrade. Otherwise it breaks things, and that pisses people off. The alternative is not versioning it, which is a signal that there is no consistency to be expected. Any decent software developer should have known all this.
And while I'm at it: what's with the name flip-flopping? In 2014, GCP issued a PR release explaining It was no longer using "Preview", but "Alpha" and "Beta" (https://cloudplatform.googleblog.com/2014/10/new-release-pha...). But the link you showed earlier says "Alpha" and "Beta" are now deprecated. But no PR release? I guess that's our bad for not constantly reading the fine print and expecting it to revert back to something from 11 years ago.
refulgentis · 10h ago
There hasn't been a non-preview Gemini since...November? The previews are the same as everyone else's release cadance, "preview" is just a magic wand that meant the Launchcal (google's internal signoff tool, i.e. "wave will never happen again) needs less signoffs. Then it got to the point date-pinned models were getting swapped in, in the name of doing us a favor, and it's a...novel idea, we can both agree at the least.
I bet someone at Google would be a bit surprised to see someone jumping to legalese to act like this...novelty...is inherently due to the preview status, and based on anything more than a sense that there's no net harm done to us if it costs the same and is better.
I'm not sure they're wrong.
But it also leads to a sort of "nobody knows how anything works because we have 2^N configs and 5 bits" - for instance, 05-06 was also upgraded to 06-05. Except it wasn't, if you sent variable thinking to 05-06 after upgrade it'd fail. (and don't get me started on the 5 different thinking configurations for Gemini 2.5 flash thinking vs. gemini 05-06 vs. 06-05 and 0 thinking)
johnb231 · 10h ago
I honestly have no idea what you are trying to say.
It's a preview model - for testing only, not for production. Really not that complicated.
refulgentis · 9h ago
So you don't have anything to contribute beyond, and aren't interested in anything beyond, citing of terms?
Why are you in the comments section of a engineering news site?
(note: beyond your, excuse me while I'm direct now, boorish know-nothing reply, the terms you are citing have nothing to do with the thing people are actually discussing around you, despite your best efforts. It doesn't say "we might swap in a new service, congrats!", nor does it have anything to say about that. Your legalese at most describes why they'd pull 05-06, not forward 05-06 to 06-05. This is a novel idea.)
johnb231 · 9h ago
This case was simply a matter of people not understanding the terms of service. There is nothing more to be said. It's that simple. The "engineers" should know that before deploying to prod. Basic competence.
And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.
lcnPylGDnU4H9OF · 1h ago
> And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.
It’s always worth considering that this may be your problem. If you still don’t get it, the only valuable reply is one which asks a question. Also, including “it’s not that complicated” only serves to inflame.
refulgentis · 9h ago
John, do you understand that the thing you're quoting says "We reserve the right to pull things", not "We reserve the right to swap in a new service"?
Do you understand that even if it did say that, that wasn't true either? It was some weird undocumentable half-beast?
I have exactly your attitude about their cavalier use of preview for all things Gemini, and even people's use of the preview models.
But I've also been on this site for 15 years and am a bit wow'd by your interlocution style here -- it's quite rare to see someone flip "the 3P provider swapped the service on us!" into "well they said they could turn it off, of course you should expect it to be swapped for the first time ever!" insert dull sneer about the quality of other engineers
johnb231 · 9h ago
How is this so hard to understand? It's a preview service for testing only, not intended for production.
I am done with this thread. We are going around in circles.
refulgentis · 9h ago
Well, no. Well, sure. You're done, but we're not going in circles. It'd just do too much damage to you to have to answer the simple question "Where does the legalese say they can swap in a new service?", so you have to pretend this is circular and just all-so-confusing, de facto, we have to pretend it is confusing and/or obviously wrong to use any Gemini 2+ at all.
It's a cute argument, as I noted, I'm emotionally sympathetic to it even, it's my favorite "get off my lawn." However, I've also been on the Internet long enough to know you write back, at length, when people try anti-intellectualism and why-are-we-even-talking-about-this as interaction.
"b. Disclaimer. PRE-GA OFFERINGS ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OR REPRESENTATIONS OF ANY KIND. Pre-GA Offerings (i) may be changed, suspended or discontinued at any time without prior notice to Customer and (ii) are not covered by any SLA or Google indemnity. Except as otherwise expressly indicated in a written notice or Google documentation, (A) Pre-GA Offerings are not covered by TSS, and (B) the Data Location Section above will not apply to Pre-GA Offerings."
No comments yet
ant6n · 14h ago
It was definitely annoying when o1 disappeared over night, my impression is that was better at some tasks than o3.
csomar · 9h ago
I think the parent-parent poster has explained why we can't trust you (and work on OpenAI doesn't help they way you think it does).
I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.
This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.
Reubachi · 16m ago
I'm not totally sure how you at this point in your online presence associate someone stating their job as a "brag" and not what it really is, providing transparency/disclosure before stating their thoughts.
This is HN and not reddit.
"I didn't read the ToS, like everyone else, but my guess..."
Ah, there it is.
rfoo · 4h ago
An (arbitrarily) quantized model is a totally different model, compared to the original.
mliker · 13h ago
Where are you getting this information? What basis do you have for making this claim? OpenAI, despite its public drama, is still a massive brand and if this were exposed, would tank the company's reputation. I think making baseless claims like this is dangerous for HN
beering · 13h ago
I think Gell-Mann amnesia happens here too, where you can see how wrong HN comments are on a topic you know deeply, but then forget about that when reading the comments on another topic.
behnamoh · 19h ago
> rebranding the original o3 model as "o3-pro"
interesting take, I wouldn't be surprised if they did that.
anticensor · 19h ago
-pro models appear to be a best-of-10 sampling of the original full size model
Szpadel · 19h ago
how do you sample it behind the scenes?
usually best of X means you generate X outputs and you choose best result.
if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
anticensor · 16h ago
> if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
spott · 12h ago
I believe it is a majority vote kinda thing, rather than a best single result.
joshstrange · 17h ago
I think the idea is they use another/same model to judge all the results and only return the best one to the user.
anticensor · 3h ago
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
lispisok · 19h ago
I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.
Tiberium · 19h ago
I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?
Kranar · 18h ago
I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.
When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.
echelon · 18h ago
Re-run your historical queries, or queries that are similarly shaped.
Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).
You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).
Your linked article is specifically comparing two different versioned snapshots of a model and not comparing the same model across time.
You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.
I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.
herval · 18h ago
Please read it again. The article is clearly comparing gpt4 to gpt4, and gpt3.5 to gpt3.5, in march vs june 2023
Deathmax · 18h ago
I did read it, and I even went to their eval repo.
> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.
openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.
drewnick · 18h ago
I feel this too. I swear some of the coding Claude Code does on weekends is superior to the weekdays. It just has these eureka moments every now and then.
herval · 18h ago
Claude has been particularly bad since they released 4.0. The push to remove 3.7 from Windsurf hasn’t helped either. Pretty evident they’re trying to force people to pay for Claude Code…
Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff
glitch253 · 17h ago
Cursor / Windsurf's degraded functionality is exactly why I created my own system:
I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.
bobxmax · 18h ago
My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.
Which is why the base model wouldn't necessarily show differences when you benchmarked them.
tshaddox · 12h ago
Yeah, it’s almost certainly hallucination (by the human user).
85392_school · 19h ago
I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.
colordrops · 14h ago
It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.
JoshuaDavid · 13h ago
I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.
Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).
JamesBarney · 19h ago
I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.
There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.
nabla9 · 19h ago
It seems that least Google is overselling their compute capacity.
You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.
baq · 19h ago
Gemini is simply that good. I’m trying out Claude 4 every now and then and go back to Gemini to fix its mess…
energy123 · 18h ago
Gemini is the best model in the world. Gemini is the worst web app in the world. Somehow those two things are coexisting. The web devs in their UI team have really betrayed the hard work of their ML and hardware colleagues. I don't say this lightly - I say this after having paid attention to critical bugs, more than I can count on one hand, that persisted for over a year. They either don't care or are grossly incompetent.
Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.
energy123 · 17h ago
I don't understand why they can't just make it fast and go through the bug reports from a year ago and fix them. Is it that hard to build a box for users to type text into without it lagging for 5 seconds or throwing a bunch of errors?
fasterthanlime · 19h ago
Funny, I have the exact opposite experience! I use Claude to fix Gemini’s mess.
symfoniq · 19h ago
Maybe LLMs just make messes.
hgomersall · 19h ago
I heard that, but I'm getting consistent garbage from Gemini.
dayjah · 18h ago
For code? Use the context7 mcp.
edzitron · 17h ago
When you say "jammed," how do you mean?
mhitza · 19h ago
That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.
Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).
beering · 13h ago
Are you able to quantify how quickly your perception gets skewed by how long you use the models?
beering · 13h ago
It’s easy to measure the models getting worse, so you should be suspicious that nobody who claims this has scientific evidence to back it up.
solfox · 19h ago
I have seen this behavior as well.
tedsanders · 15h ago
It's the same model, no quantization, no gimmicks.
In the API, we never make silent changes to models, as that would be super annoying to API developers [1]. In ChatGPT, it's a little less clear when we update models because we don't want to bombard regular users with version numbers in the UI, but it's still not totally silent/opaque - we document all model updates in the ChatGPT release notes [2].
[1] chatgpt-4o-latest is an exception; we explicitly update this model pointer without warning.
Is this what happened to Gemini 2.5 Pro? It used to be very good, but it's started struggling on basic tasks.
The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.
SparkyMcUnicorn · 18h ago
The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.
It had a few bugs here or there when they pushed updates, but it didn't get worse.
ants_everywhere · 17h ago
Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.
My question is not whether this is true (it is) but why it's happening.
I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.
But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.
SparkyMcUnicorn · 15h ago
Gemini 2.5 Pro is the highest ranking model on the aider benchmarks leaderboard.
For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.
I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.
code_biologist · 5h ago
My use case is mostly creative writing.
IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.
Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.
In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.
rfoo · 4h ago
I don't work for OAI so obviously I can't say for them. But we don't do this.
We don't make hobbyist mistakes of randomly YOLO trying various "quantization" methods that only happen after all training and claim it a day, at all. Quantization was done before it went live.
esafak · 19h ago
Are there any benchmarks that track historical performance?
behnamoh · 19h ago
good question, and I don't know of any, although it's a no brainer that someone should make it.
a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.
SparkyMcUnicorn · 18h ago
Aider has one, but it hasn't been updated in months. People kept claiming models were getting worse, but the results proved that they weren't.
That Deepseek price is always hilarious to see in these charts.
SparkyMcUnicorn · 15h ago
That's not the one I'm referring to. See my other comments or your sibling comment.
benterix · 19h ago
> users found them more pleasing.
Some users. For me the drop was so huge it became almost unusable for the things I had used it for.
behnamoh · 19h ago
Same here. One of my apps straight out stopped working because the gpt-4o outputs were noticeably worse than the gpt-4 that I built the app based on.
risho · 19h ago
Quantization is a massive efficiency gain for near negligible drop in quality. If the tradeoff is quantization for an 80 percent price drop I would take that any day of the week.
You may be right that the tradeoff is worth it, but it should be advertised as such. You shouldn't think you're paying for full o3, even if they're heavily discounting it.
code_biologist · 6h ago
I would like the option to pay for the unquantized version. For creative or story writing (D&D campaign materials and such) quantization seems to end up in much weaker word selection and phrasing. There are small semantic missteps that break the illusion the LLM understands what it's writing. I find it jarring and deeply immersion breaking. I'd prefer prototype prompts on a cheaper quantized version, but I want to be able to spend 50 cents an API call to get golden output.
Bjorkbat · 17h ago
Related, when o3 finally came out ARC-AGI updated their graph because it didn’t perform nearly as well as the version of o3 that “beat” the benchmark.
The o3-preview test was with very expensive amounts of compute, right? I remember it was north of $10k so makes sense it did better
Bjorkbat · 1h ago
Point remains though, they crushed the benchmark using a specialized model that you’ll probably never have access to, whether personally or through a company.
They inflated expectations and then released to the public a model that underperforms
throwaway314155 · 22m ago
They revealed the price points for running those evaluations. IIRC the "high" level of reasoning cost tens of thousands of dollars if not more. I don't think they really inflated expectations. In fact a lot of what we learned is that ARC-AGI probably isn't a very good AGI evaluation (it claims to not be one, but the name suggests otherwise).
az226 · 13h ago
Even classic GPT-4 from March 2023 was quantized to 4.5 bits.
smusamashah · 16h ago
Hw about testing same input vs output with same seed on different dates. If its a different model it will return different output.
zomnoys · 16h ago
Isn’t this not true since these models run with a non-zero temperature?
smusamashah · 15h ago
You can set the temperature too.
luke-stanley · 15h ago
I think the API has some special IDs to check for reproducibility of the environment.
resters · 19h ago
It's probably optimized in some way, but if the optimizations degrade performance, let's hope it is reflected in various benchmarks. One alternative hypothesis is that it's the same model, but in the early days they make it think "harder" and run a meta-process to collect training data for reinforcement learning for use on future models.
The main leaderboard page that you linked to is updated quite frequently, but it doesn't contain multiple benchmarks for the same exact model.
EnPissant · 18h ago
The API lists o3 and o3-2025-04-16 as the same thing with the same price. The date based models are set in stone.
jstummbillig · 19h ago
You can just give it a go for very little money (in Windsurf it's 1x right now), and see what it does. There is no room for conspiracy here, because you can simple look at what it does. If you don't like it, so won't others, and then people will not use it. People are obviously very capable of (collectively) forming opinions on models, and then vote with their wallet.
segmondy · 18h ago
you don't, so run your own model.
lvl155 · 19h ago
Google has been catching up. Funny how fast this space is evolving. Just a few months ago, it was all about DeepSeek.
ookdatnog · 6h ago
If the competition boils down to who has access to the largest amount of high quality data, it's hard to see how anyone but Google could win in the end: through Google Books they have scans of tens of millions of books, and published books are the highest quality texts there are.
itake · 1h ago
I've been learning vietnamese. Unfortunately, a lot of social media (reddit, fb, etc) has a new generation of language. The younger generation uses so much abbreviations and acronyms, ChatGPT and Google Translate can't keep up.
I think if you're goal is to have properly written langauge using older writing styles, then you're correct.
bitpush · 19h ago
Many would say Google's Gemini models are SOTA, although Claude seems to be doing well with coding tasks.
snarf21 · 19h ago
Gemini has been better than Claude for me on a coding project. Claude kept telling me it update some code but the update wasn't in the output. Like, I had to re-prompt just for updated output 5 times in a row.
jacob019 · 18h ago
I break out Gemini 2.5 pro when Claude gets stuck, it's just so slow and verbose. Claude follows instructions better and seems to better understand it's role in agentic workflows. Gemini does something different with the context, it has a deeper understanding of the control flow and can uncover edge case bugs that Claude misses. o3 seems better at high level thinking and planning, questioning if it should it be done and whether the challenge actually matches the need. They're kind of like colleagues with unique strengths. o3 does well with a lot of things, I just haven't used it as much because of the cost. Will probably use it more now.
johan914 · 18h ago
I have been using Google’s models the past couple months, and was surprised to see how sycophantic chatGPT is now. It’s not just at the start or end of responses, it’s interspaced within the markdown, with little substance. Asking it to change its style makes it overuse technical terms.
resource_waste · 13h ago
Deepseek was exciting because you could download their model. They are seemingly 3rd place and have been since Gemini 2.5.
Squarex · 4h ago
I would put them on the fourth after Google, OpenAI and Anthropic. Still the best open weight llm.
BeetleB · 17h ago
Why does OpenAI require me to verify my "organization" (which requires my state issued ID) to use o3?
valleyer · 15h ago
Don't bother anyway. There are lots of cases of people trying and failing to go through the process, and there is no way to try a second time.
If only there were people with multiple passports or, I don’t know, Kyrgyzstan.
How exactly will passport check prevent any training?
At most this will block API access to your average Ivan, not a state actor
yyhhsj0521 · 12m ago
It's most likely for regulation compliance, instead of a sincere attempt to block anyone from training on them.
ivanmontillam · 12h ago
I'm an average Ivan, and I got access.
BeetleB · 15h ago
Yeah, I just don't see myself using o3 when I have Gemini-2.5 Pro. I don't recall if Google Cloud verified my ID in the past, though. Still, no need to let yet another organization have my data if I'm not getting something better in return.
jjani · 10h ago
> I don't recall if Google Cloud verified my ID in the past, though
It generally does not. No idea if there are edge cases where it does, but that's definitely not the norm for the average user.
lxgr · 19h ago
Is there also a corresponding increase in weekly messages for ChatGPT Plus users with o3?
In my experience, o4-mini and o4-mini-high are far behind o3 in utility, but since I’m rate-limited for the latter, I end up primarily using the former, which has kind of reinforced the perception that OpenAI’s thinking models are behind the competition altogether.
My usage has also reflected the pretty heavy rate limits on o3. I find o4-mini-high to be quite good, but I agree that I would much rather use o3. Hoping this means an increase in the limits.
sagarpatil · 9h ago
Before: 50 messages per week
Now: 100 messages per week
lxgr · 9h ago
That’s already been the case for a few weeks though, right? and it’s up from 50, whereas a price reduction by 80% would correspond to 5x the quota extrapolating linearly.
johnnyApplePRNG · 7h ago
Agreed 100% o3 is great but the rate limit window and the quota itself both render it almost useless for more than one off fixes.
It's great with those, however!
mrcwinn · 11h ago
Only at HN can the reaction to an 80% price drop be a wall of criticism.
alternatex · 6h ago
"80% price drop" is just a title. The wall of criticism is for the fine-print.
monster_truck · 2h ago
Curious that the number of usages for plus users remained the same. I don't think they're actually doing anything material to lower the cost by a meaningful amount. It's just margin they've always had, and they cut it because magistral is pretty incredible for being completely free
coffeecoders · 19h ago
Despite the popular take that LLMs have no moat and are burning cash, I find OpenAI's situation really promising.
Just yesterday, they reported an annualized revenue run rate of 10B. Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast - 30x revenue with over 500M active users.
It reminds me a lot of Uber in its earlier years—fast growth, heavy investment, but edging closer to profitability.
bitpush · 19h ago
The problem is your costs also scale with revenue. Ideally you want to have control costs as you scale (the first you build is expensive, but as you make more your costs come down).
For OpenAI, the more people use the product, the same you spend on compute unless they can supplement it with another ways of generating revenue.
I dont unfortunately think OpenAI will be able to hit sustained profitability (see Netflix for another example)
simonw · 17h ago
"... as you make more your costs come down"
I'd say dropping the price of o3 by 80% due to "engineers optimizing inferencing" is a strong sign that they're doing exactly that.
asadotzler · 15h ago
You trust their PR statements?
simonw · 10h ago
Seems more likely to me then them deciding to take a sizable loss on inference by dropping prices by 80% for no reason.
Optimizing serving isn't unlikely: all of the big AI vendors keep finding new efficiencies, it's been an ongoing trend over the past two years.
bitpush · 9h ago
This is my sense as well. You dont drop 80% on a random Tuesday based on scale, you do it with an explicit goal to get market share at the expense of $$.
TZubiri · 14h ago
It's not a PR statement, it's a change in price. Literally putting money where the mouth is.
theappsecguy · 14h ago
Or they are trying to gobble up market share because Anthropic has been much better than OpenAI
petesergeant · 7h ago
Providers are exceptionally easy to switch. There's no moat for enterprise-level usage. There's no "market share" to gobble up because I can change a line in my config, run the eval suite, and switch immediately to another provider.
This is marginally less true for embedding models and things you've fine-tuned, but only marginally.
Davidzheng · 13h ago
o3 probably used to have a HUGE profit margin on inference, so I'd say it's unclear how much optimo was done;
programjames · 11h ago
I find it pretty plausible they got an 80% speedup just by making optimized kernels for everything. Even when GPUs say they're being 100% utilized, there are so many improvements to be made, like:
- Carefully interleaving shared memory loading with computation, and the whole kernel with global memory loading.
- Warp shuffling for softmax.
- Avoiding memory access conflicts in matrix multiplication.
I'm sure the guys at ClosedAI have many more optimizations they've implemented ;). They're probably eventually going to design their own chips or use photonic chips for lower energy costs, but there's still a lot of gains to be made in the software.
Davidzheng · 6h ago
yes I agree that it is very plausible. But it's just unclear whether it is more of a business decision or a real downstream effect of engineering optimizations (which I assume are happening everyday at OA)
lossolo · 13h ago
> "engineers optimizing inferencing"
They finally implemented DeepSeek open source methods for fast inference?
Legend2440 · 19h ago
>(see Netflix for another example)
Netflix has been profitable for over a decade though? They reported $8.7 billion in profit in 2024.
amazingamazing · 18h ago
They increased prices and are not selling a pure commodity tho
ACCount36 · 15h ago
The bulk of AI costs are NOT in inference. They're in R&D and frontier training runs.
The more inference customers OpenAI has, the easier it is for them to reach profitability.
tptacek · 18h ago
All costs are not equal. There is a classic pattern of dogfights for winner-take-most product categories where the long term winner does the best job of acquiring customers at the expense of things like "engineering to reduce costs". I have no idea how the AI space is going to shake out, but if I had to pick between OpenAI's mindshare in the broadest possible cohort of users vs. best/most efficient model, I'd pick the customers.
Obviously, lots of nerds on HN have preferences for Gemini and Claude, and having used all three I completely get why that is. But we should remember we're not representative of the whole addressable market. There were probably nerds on like ancient dial-up bulletin boards explaining why Betamax was going to win, too.
awongh · 17h ago
We don't even know yet if the model is the product though, and if OpenAI is the company that will make the AI product/model, (chat that keeps expanding into other functionalities and capabilities) or will it be 10,000 companies using the OpenAI models. (well, it's probably both, but in what proportion of revenue)
tptacek · 17h ago
Right, but it might not even matter if all the competitors are in the ballpark of the final product/market fit and OpenAI holds a commanding lead in customer acquisition.
Again: I don't know. I've got no predictions. I'm just saying that the logic where OpenAI is outcompeted on models themselves and thus automatically lose does not hold automatically.
TZubiri · 11h ago
Unlike Uber or whatsapp, there's no network effect. Don't think this is a winner takes all market, there was an article where we had this discussion earlier. Players who get a small market share are immediately profitable proportional to the market share (given a minimum size is exceeded.)
Magmalgebra · 18h ago
Anyone concerned about cost should remember that those costs are dropping exponenentially.
Similarly, nearly all AI products but especially OpenAI are heavily _under_ monetized. OpenAI is an excellent personal shopper - the ad revenue that could be generated from that rivals Facebook or Google.
smelendez · 17h ago
It wouldn't surprise me if they try, but ironically if GPT is a good personal shopper, it might make it harder to monetize with ads because people will trust the bot's organic responses more than the ads.
You could override its suggestions with paid ones, or nerf the bot's shopping abilities so it doesn't overshadow the sponsors, but that will destroy trust in the product in a very competitive industry.
You could put user-targeted ads on the site not necessarily related to the current query, like ads you would see on Facebook, but if the bot is really such a good personal shopper, people are literally at a ChatGPT prompt when they see the ads and will use it to comparison shop.
whiplash451 · 16h ago
Alternative: let users reduce their monthly bill by accepting a sponsored answer with a dedicated button in the UI
(with many potential variants)
marsten · 16h ago
You raise a good point that this isn't a low marginal cost business like software, telecom, or (most of) the web. Efficiency will be a big advantage for companies that can achieve it, in part because it will let them scale to new AI use cases.
With the race to get new models out the door, I doubt any of these companies have done much to optimize cost so far. Google is a partial exception – they began developing the TPU ten years ago and the rest of their infrastructure has been optimized over the years to serve computationally expensive products (search, gmail, youtube, etc.).
aizk · 18h ago
> sustained profitability (see Netflix for another example)
What? Netflix is incredibly profitable.
bitpush · 18h ago
Probably a bad example from my part, but also because of increasing the costs and offering a tier with ads. I was mostly talking about the Netflix as it was originally concieved. "Give access to unlimited content at a flat fee", which didnt scale pretty well.
whiplash451 · 16h ago
Isn't this exactly what they offer today?
therealdrag0 · 17h ago
As an anecdote they have first mover advantage on me. I pay monthly but mostly because it’s good enough and I can’t be bothered to try a bunch out and switch. But if the dust settles and prices drop i would be motivated to switch. How much that matters maybe depends if their revenue comes from app users or API plans. And first mover only works once. Now they maybe coasting on name recognition, but otherwise new users maybe load balanced among all the options.
jillesvangurp · 6h ago
The moat is increasingly becoming having access to billions needed to finance the infrastructure needed to serve billions. That's why Google is still in the game. They have that and they are very good at massive scale and have some cost advantages there.
OpenAI is very good at this as well because of their brand name. For many people ChatGPT is all they know. That's the one that's in the news. That's the one everybody keeps talking about. They have many millions of paying users at this point.
This is a non trivial moat. If you can only be successful by not serving most of the market for cost reasons, then you can't be successful. It's how Google has been able to guard its search empire for a quarter century. It's easy to match what they do algorithmically. But then growing from a niche search engine that has maybe a few tens of thousands of users (e.g. Kagi) to Google scale serving essentially most of this planet (minus some fire walled countries like Russia and China), is a bit of a journey.
So Google rolling out search integration is a big deal. It means they are readying themselves for that scale and will have billions of users exposed to this soon.
> Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast
Yes, they are valued based on world+dog needing agentic AIs and subscribing to the extent of tens or hundreds of dollars/month. It's going to outstrip revenue things like MS Office in its prime.
5B loss is peanuts compared to that. If they weren't burning that, their ambition level would be too low.
Uber now has a substantial portion of the month. They have about 3-4 billion revenue per month. A lot of cost obviously. But they managed 10B profit last year. And they are not done growing yet. They were overvalued at some point and then they crashed, but they are still there and it's a pretty healthy business at this point and that reflects in their stock price. It's basically valued higher now than at the time of the Softbank investment pre-IPO. Of course a lot of stuff needed to be sorted out for that to happen.
seydor · 19h ago
their moat is leaky because llm prices will be dropping forever and the only viable model will be a free model. Eventually everyone will catch up.
Plus there is the thing that "thinking models" can't really solve complex tasks / aren't really as good as they are believed to be .
Zaheer · 19h ago
I would wager most of their revenue is from the subscriptions - both consumer and business. That pricing is detached from the API pricing. The heavy emphasis on applications more recently is because they realize this as well.
rgavuliak · 19h ago
I don't think the no moat approach makes sense. In a world where more an more content and interaction is done with and via LLMs, the data of your users chatting with your LLM is a super valuable dataset.
ToucanLoucan · 19h ago
I mean sure, it's very promising if OpenAI's future is your only metric. It gets notably darker if you look at the broader picture of ChatGPT (and company)'s impact on our society.
* We have people uploading tons of zero-effort slop pieces to all manner of online storefronts, and making people less likely to buy overall because they assume everything is AI now
* We have an uncomfortable community of, to be blunt, actual cultists emerging around ChatGPT, doing all kinds of shit from annoying their friends and family all the way up to divorcing their spouses
* Education is struggling in all kinds of ways due to students using (and abusing) the tech, with already strained administrations struggling to figure out how to navigate it
Like yeah if your only metric is OpenAI's particular line going up, it's looking alright. And much like Uber, it's success seems to be corrosive to the society in which it operates. Is this supposed to be good news?
BugheadTorpeda6 · 2h ago
I absolutely agree. I find it abhorrent.
arealaccount · 18h ago
Dying for a reference on the cult stuff, a quick search didn’t provide anything interesting.
Scroll through the ChatGPT subreddit right now and tell me there isn't a TON of people in there who are legitimately unwell. Reads like the back page notes of a dystopian novel.
arandomhuman · 16h ago
I think this is less caused by ChatGPT/LLMs and more of a phenomenon in social media circles where people flock to "the thing" and have poor social skills and mental health generally speaking.
Poetic, but I don't think that really explains anything.
SlowTao · 16h ago
Yes but in a typical western business sense they are merely optimizing for user engadgement and profits. What happens to society a decade from now because of all the slop being produced, that is not their concern. Facebook is just about connecting friends right, totally wont become a series of information moats and bubbles controlled by the algorithms...
A great communicator on the risks of AI being to heavily intergrated into society is Zak Stein. As someone who works in education, they are see first hand how people are becoming dependent on this stuff rather than any kind of self improvement. The people who are just handing over all their thinking to the machine. It is very bizarre and I am seeing it in my personal experience a lot more over the last few months.
blueblisters · 19h ago
This is the best model out there, priced level or lesser than Claude and Gemini
They’re not letting the competition breathe
Davidzheng · 13h ago
Gemini is close (if not better) so it just makes sense no? o3-pro might be ahead of pack tho
blueblisters · 7h ago
o3 does better especially if you use the api (not ChatGPT)
ilaksh · 20h ago
Maybe because they also are releasing o3-pro.
seydor · 19h ago
when the race to the bottom reaches the bottom, the foundation model companies will be bought by ... energy companies. You 'll be paying for AI with your electricity bill
paxys · 18h ago
It'll be the opposite. Large tech companies are already running their own power plants.
andyferris · 14h ago
Yes - it’s common in traditional industries too. In my home town the aluminum refinery bought the power station to improve reliability (I should add - through upgrades not screwing over the rest of the power users).
JojoFatsani · 16h ago
O3 is really good. I haven’t had the same results with o4 unfortunately
muzani · 6h ago
It's one of the most unfortunate naming conventions
sagarpatil · 9h ago
Meanwhile Apple: Liquid Glass
visiondude · 19h ago
always seemed to me that efficient caching strategies could greatly reduce costs… wonder if they cooked up something new
xmprt · 19h ago
How are LLMs cached? Every prompt would be different so it's not clear how that would work. Unless you're talking about caching the model weights...
That link explains how OpenAI uses it, but doesn't really walk through how it's any faster. I thought the whole point of transformers was that inference speed no longer depended on prompt length. So how does caching the prompt help reduce latency if the outputs aren't being cached.
> Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt
catlifeonmars · 14h ago
> OpenAI routes API requests to servers that recently processed the same prompt,
My mind immediately goes to rowhammer for some reason.
At the very least this opens up the possibility of some targeted denial of service
xmprt · 13h ago
Later they mention that they have some kind of rate limiting because if over ~15 requests are being processed per minute, the request will be sent to a different server. I guess you could deny cache usage but I'm not sure what isolation they have between different callers so maybe even that won't work.
catlifeonmars · 6h ago
So the doc mentions you can influence the cache key by passing an optional user parameter. It’s unclear from the doc whether the user parameter is validated or if you can just provide an arbitrary string.
catlifeonmars · 13h ago
15 requests/min is pretty low. Depending on how large the fleet is you might end up getting load balanced to the same one and if it’s round robin then it would be deterministic
hadlock · 19h ago
I've asked it a question not in it's dataset three different ways and I see the same three sentences in the response, word for word, which could imply it's caching the core answer. I hadn't previously seen this behavior before this last week.
amanda99 · 19h ago
You would use a KV cache to cache a significant chunk of the inference work.
xmprt · 13h ago
Using KV in the caching context is a bit confusing because it usually means key-value in the storage sense of the word (like Redis), but for LLMs, it means the key and value tensors. So IIUC, the cache will store the results of the K and V matrix multiplications for a given prompt and the only computation that needs to be done is the Q and attention calculations.
biophysboy · 19h ago
Do you mean that they provide the same answer to verbatim-equivalent questions, and pull the answer out of storage instead of recalculating each time? I've always wondered if they did this.
koakuma-chan · 19h ago
The prompt may be the same but the seed is different every time.
biophysboy · 17h ago
Could you not cache the top k outputs given a provided input token set? I thought the randomness was applied at the end by sampling the output distribution.
Traubenfuchs · 18h ago
I bet there is a set of repetitive single, or two, question user requests that makes out a sizeable amount of all requests. The models are so expensive to run, 1% would be enough. Much less than 1%. To make it less obvious they probably have a big set of response variants. I don't see how they would not do this.
They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.
koakuma-chan · 19h ago
A lot of the prompt is always the same: the instructions, the context, the codebase (if you are coding), etc.
tasuki · 18h ago
> Every prompt would be different
No? Eg "how to cook pasta" is probably asked a lot.
it used to take decades of erosion to make google search a hot mess, now that everything's happening in light speed, we get days for AI models to decay to the point of hot mess again..
ramesh31 · 18h ago
Anthropic will need to follow suit with Opus soon. It is simply too expensive for anything by an order of magnitude.
sagarpatil · 9h ago
Have they ever decreased the price in the past? I don’t remember.
minimaxir · 20h ago
...how? I'd understand a 20-30% price drop from infra improvements for a model as-is, but 80%?
I wonder if "we quantized it lol" would classify as false advertising for modern LLMs.
tofof · 19h ago
Presumably because the price was about 5x higher to begin with than any the competitors at the same tier of performance? Perhaps it's better to get paid anything at all than to just lose 100% of the customers.
drexlspivey · 19h ago
Deepseek made a few major innovations allowing them to achieve major compute efficiency and then published them. My guess is that OpenAI just implemented these themselves.
vitaflo · 17h ago
Wouldn’t surprise me. And even with this price cut it’s still 4x more expensive than Deepseek R1 is.
OutOfHere · 17h ago
o3 is very much needed in VSCode GitHub CoPilot for Ask/Edit/Agent modes. It is sorely missing there.
janstice · 2h ago
Sure is - and o3 is missing from the OpenAI models that Azure is serving, which I suspect isn’t a coincidence - if OpenAI has some secret sauce that lets them undercut resellers this might shake up agreements for a bit.
candiddevmike · 19h ago
It's going to be a race to the bottom, they have no moat.
rvnx · 19h ago
Especially now that they are second in the race (behind Anthropic) and lot of free-to-download and free-to-use models are now starting to be viable competitors.
Once new MacBooks and iPhones have enough memory onboard this is going to be a disaster for OpenAI and other providers.
mattnewton · 19h ago
I'm not sure they're scared of Anthropic - they're doing great work but afaict running into some scaling issues and really focused on winning over developers at the moment.
If I was OpenAI (or Anthropic for that matter) I would remain scared of Google, who is now awake and able to dump Gemini 2.5 pro on the market at costs that I'm not sure people without their own hardware can compete with, and with the infrastructure to handle everyone switching to them tomorrow.
itomato · 19h ago
Codex Research Preview appeared in my account in the early AM.
piuantiderp · 18h ago
Google is going to lap them. The hardware muscle they have has not even started flexing
koakuma-chan · 19h ago
What do you mean, Google is number 1
slashdev · 18h ago
Thrid for coding, after Anthropic, and Gemini, which was leading last I checked.
aerhardt · 19h ago
OpenAI are second in the race to Anthropic in some benchmarks (maybe?), but OpenAI still dwarves Anthropic in distribution and popularity.
ratedgene · 19h ago
That's slowly changing. I know some relatively non-tech savvy young people using things like Claude for various reasons, so people are exploring options.
I wonder how much of this is brand name? Like Kleenex. Non-tech people might not search for LLM, generative AI, etc. ChatGPT may just be what people have heard of. I’m assuming OpenAI has a large advantage over Anthropic, and the name helps, but I bet the name is exaggerating the difference here a bit. Not everyone buys Kleenex branded Kleenex.
jstummbillig · 6h ago
You are not going to find Claude when googling for ChatGPT
rvnx · 18h ago
This is such a big difference, thank you for sharing it, I didn't expect the gap to be _that_ huge
jdprgm · 19h ago
While mac unified ram inference is great for prosumers+ I really don't foresee Apple making 128GB+ options affordable enough to be attractive for inference for the general public. iPhone even less so considering the latest is only at 8GB. Meanwhile the best model sizes will just keep growing.
paxys · 18h ago
Third behind Anthropic/Google. People are too quick to discount mindshare though. For the vast majority of the world's population AI = LLM = ChatGPT, and that itself will keep OpenAI years ahead of the competition as long as they don't blunder away that audience.
mrweasel · 19h ago
My understanding was that OpenAI couldn't make money at their previous price point, and I don't think operation and training cost have gone down sufficiently to make up for those short comings. So how are they going to make money by lowering the price by 80%?
I get the point is to be the last man standing, and poaching customers by lowering the price, and perhaps attract a few people who wouldn't have bought a subscription at the higher price. I just question how long investors can justify pouring money into OpenAI. OpenAI is also the poster child for modern AI, so if they fail the market will react badly.
Mostly I don't understand Silicon Valley venture capital, but dumping price, making wild purchases for investor money and mostly only leading on branding, why isn't this a sign that OpenAI is failing?
That seems likely to me, all of the LLM providers have been consistently finding new optimizations for the past couple of years.
joshuajooste05 · 19h ago
There was an article on here a week or two ago on batch inference.
Do you not think that batch inference gives at least a bit of a moat whereby unit costs fall with more prompts per unit of time, especially if models get more complicated and larger in the future?
minimaxir · 19h ago
Batch inference is not exclusive to OpenAI.
Szpadel · 18h ago
for sure they are no longer clear winners, but they try to be just barely on top of others.
right now new Gemini surpassed their o3 (barely) in benchmarks for significantly less money so they cut pricing to be still competitive.
I bet they didn't released o4 not because it's not competitive, but because they are doing Nvidia game: release new product that is just enough better to convince people to buy it.
so IMO they are holding full o4 model to have something to release after competition release something better that their top horse
m3kw9 · 19h ago
LLM inferencing is race to the bottom but the service layers on top isn’t. People always pay much more for convenience, those are the thing OpenAI focuses on and is harder to replicate
ninetyninenine · 19h ago
You know. because LLMs can only be built by corporations... but because they're so easy to build, I see the price going down massively thanks to competition. Consumers benefit because all the companies are trying to out run each other.
croes · 19h ago
Easy doesn’t mean cheap.
They need lots of energy and customers don’t pay much, if they pay at all
briian · 19h ago
Exactly,
The developers of AI models do have a moat, the cost of training the model in the first place.
It's 90% of the low effort AI wrappers with little to no value add who have no moat.
codr7 · 19h ago
And then they all go out of business, since models cost a fortune to build, and their fan club is left staring at their computers trying to remember how to do anything without getting it served on a silver plate.
wrsh07 · 16h ago
I expect they don't go out of business: at worst they don't start their next training run quite as aggressively and instead let their new very good model be profitable for a minute
Many many companies are currently thrilled to pay the current model prices for no performance improvement for 2-3 years
We still have so many features to build on top of current capabilities
merth · 19h ago
Investors pouring money, its probably impossible to go out of business, at least for the big ones, until investors realise this is wrong hill to die on.
codr7 · 19h ago
Which they will eventually; so the point stands, no matter how unpopular with the AI excusers out there.
stevev · 13h ago
It was only a matter of time considering Deepseek R1’s recent release. OpenAI’s competitor is an open-source product that offers similar quality at a tenth of the cost. Now they’re just trying to prevent customers from leaving.
biophysboy · 19h ago
I don't know if this is OpenAI's intention, but the little message "you've reached your usage limit!" is actively disincentivizing me from subscribing. For my purposes, the free model is more than good enough; the difference before and after is negligible. I honestly wouldn't pay a dollar.
That said, I'm absolutely willing to hear people out on "value-adds" I am missing out on; I'm not a knee-jerk hater (For context, I work with large, complex & private databases/platforms, so its not really possible for me to do anything but ask for scripting suggestions).
Also, I am 100% expecting a sad day when I'll be forced to subscribe, unless I want to read dick pill ads shoehorned in to the answers (looking at you, YouTube). I do worry about getting dependent on this tool and watching it become enshittified.
Traubenfuchs · 18h ago
> "you've reached your usage limit!"
Just switch to a competitors free offering. There are enough to cycle through not to be hindered by limits. I wonder how much money I have cost those companies by now?
How anyone believes there is any moat for anyone here is beyond me.
wrsh07 · 16h ago
I expect the answer is <$1 as someone who shares a discord server with a friend where we egregiously ping the models
wrsh07 · 17h ago
o3 is so good it's worth paying for a minute (just for plus) just to see what it's like
I've never used anything like it. I think new Claude is similarly capable
boyka · 8h ago
80%? So this is either same Trump style "art of the deal" with setting unreasonable pricing in the first place or desperately needing customers?
teaearlgraycold · 20h ago
Personally I've found these bigger models (o3/Claude 4 Opus) to be disappointing for coding.
rvnx · 19h ago
Opus is really great but through Claude Code. If you used Cursor or RooCode it could be normal that you get disappointed
bitpush · 19h ago
This matches my experience, but cant explain it. Do you know what's going on?
eunoia · 19h ago
My understanding is context size. Companies like Cursor are trying to minimize the amount of context sent to the models to keep their own costs down. Claude Code seems to send a lot more context with every request and that seems to make the difference.
supermdguy · 18h ago
Just guessing, but the new Opus was probably RL tuned to work better with Claude Code's tool calls
jedisct1 · 19h ago
I got the opposite experience. Not with Opus (too expensive), but with Sonnet. I got things done way more efficiently when using Sonnet with Roo than with Claude Code.
rgbrenner · 18h ago
same. i ran a few tests ($100 worth of api calls) with opus 4 and didn’t see any difference compared to sonnet 4 other than the price.
also no idea why he thinks roo is handicapped when claude code nerfs the thinking output and requires typing “think”/think hard/think harder/ultrathink just to expand the max thinking tokens.. which on ultrathink only sets it at 32k… when the max in roo is 51200 and it’s just a setting.
rvnx · 6h ago
I think I could share a trick that could help:
From my experience (so not an ultimate truth) Claude is not so great at taking the decision for planning by its own: it dives immediately into coding.
If you ask it to think step-by-step it still doesn’t do it but Gemini 2.5 Pro is good at that planning but terrible at actual coding.
So you can use Gemini as planner and Claude as programmer and you get something decent on RooCode.
This “think wisely” that you have to repeat 10x in the prompt is absolutely true
rgbrenner · 1h ago
I think you misread my comment. I wasn't asking for help. I get consistent good output from Sonnet 4 using RooCode, without needing Gemini for planning.
Edit: I think I know where our miscommunication is happening...
The "think"/"ultrathink" series of magic words are a claudecode specific feature used to control the max thinking tokens in the request. For example, in claude code, saying "ultrathink" sets the max thinking tokens to 32k.
On other clients these keywords do nothing. In Roo, max thinking tokens is a setting. You can just set it to 32k, and then that's the same as saying "ultrathink" in every prompt in claudecode. But in Roo, I can also setup different settings profiles to use for each mode (with different max thinking token settings), configure the mode prompt, system prompt, etc. No magic keywords needed.. and you have full control over the request.
Claude Code doesn't expose that level of control.
apwell23 · 19h ago
i found them all disappointing in their own ways. Atleast deepseek models actually listen to what i say instead of ignoring me doing their own thing like a toddler.
Now cheaper than gpt-4o and same price as gpt-4.1 (!).
runako · 19h ago
> Now cheaper than gpt-4o and same price as gpt-4.1 (!).
This is where the naming choices get confusing. "Should" o3 cost more or less than GPT-4.1? Which is more capable? A generation 3 of tech intuitively feels less advanced than a 4.1 of a (similar) tech.
jacob019 · 18h ago
Do we know parameter counts? The reasoning models have typically been cheaper per token, but use more tokens. Latency is annoying. I'll keep using gpt-4.1 for day-to-day.
koakuma-chan · 17h ago
o3 is a reasoning model, GPT-4.1 is not. They are orthogonal.
runako · 15h ago
My quibble is with naming choices and differentiating. Even here they are confusing:
- o4 is reasoning
- 4o is not
They simply do not do a good job of differentiating. Unless you work directly in the field, it is likely not obvious what is the difference between "our most powerful reasoning model" and "our flagship model for complex tasks."
"Does my complex task need reasoning or not?" seems to be how one would choose. (What type of task is complex but does not require any reasoning?) This seems less than ideal!
koakuma-chan · 15h ago
This is true, and I believe apps automatically route requests to appropriate models for normie users.
agsqwe · 18h ago
thinking models produce a lot of internal output tokens making them more expensive than non-reasoning models for similar prompt and visible output lengths
vitaflo · 17h ago
Still 4x more expensive than Deepseek R1 tho.
rvnx · 19h ago
It is slower though
MallocVoidstar · 19h ago
No, people had tested it after Altman's announcement and had confirmed that they were still being billed at the original price. And I checked the docs ~1h after and they still showed the original price.
The speculation of only input pricing being lowered was because yesterday they gave out vouchers for 1M free input tokens while output tokens were still billed.
First, I tried enabling o3 via OpenRouter since I have credits with them already. I was met with the following:
"OpenAI requires bringing your own API key to use o3 over the API. Set up here: https://openrouter.ai/settings/integrations"
So I decided I would buy some API credits with my OpenAI account. I ponied up $20 and started Aider with my new API key set and o3 as the model. I get the following after sending a request:
"litellm.NotFoundError: OpenAIException - Your organization must be verified to use the model `o3`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate."
At that point, the frustration was beginning to creep in. I returned to OpenAI and clicked on "Verify Organization". It turns out, "Verify Organization" actually means "Verify Personal Identity With Third Party" because I was given the following:
"To verify this organization, you’ll need to complete an identity check using our partner Persona."
Sigh I click "Start ID Check" and it opens a new tab for their "partner" Persona. The initial fine print says:
"By filling the checkbox below, you consent to Persona, OpenAI’s vendor, collecting, using, and utilizing its service providers to process your biometric information to verify your identity, identify fraud, and conduct quality assurance for Persona’s platform in accordance with its Privacy Policy and OpenAI’s privacy policy. Your biometric information will be stored for no more than 1 year."
OK, so now, we've gone from "I guess I'll give OpenAI a few bucks for API access" to "I need to verify my organization" to "There's no way in hell I'm agreeing to provide biometric data to a 3rd party I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder. How do I get my $20 back?"
The AG office followed up and I got my refund. Worth my time to file because we should stop letting companies get away with this stuff where they show up with more requirements after paying.
Separately they also do not need my phone number after having my name, address and credit card.
Has anyone got info on why they are taking everyone’s phone number?
Phone number is the only way to reliably stop MOST fraudulent actors. You can easily block VOIP numbers and ensure the person connected to this number is paying for an actual phone plan, which cuts down dramatically on abuse.
Hence why even Facebook requires a unique, non-VOIP phone number to create an account.
>> Wednesday's 9th Circuit decision grew out of revelations that between 2013 and 2019, X mistakenly incorporated users' email addresses and phone numbers into an ad platform that allows companies to use their own marketing lists to target ads on the social platform.
>> In 2022, the Federal Trade Commission fined X $150 million over the privacy gaffe.
>> That same year, Washington resident Glen Morgan brought a class-action complaint against the company. He alleged that the ad-targeting glitch violated a Washington law prohibiting anyone from using “fraudulent, deceptive, or false means” to obtain telephone records of state residents.
>> X urged Dimke to dismiss Morgan's complaint for several reasons. Among other arguments, the company argued merely obtaining a user's phone number from him or her doesn't violate the state pretexting law, which refers to telephone “records.”
>> “If the legislature meant for 'telephone record' to include something as basic as the user’s own number, it surely would have said as much,” X argued in a written motion.
https://www.mediapost.com/publications/article/405501/None
To me the obvious example is fraud/abuse protection.
I’m pretty sure you do. Claude too. The only chatbot company I’ve made an account with is Mistral specifically because a phone number was not a registration requirement.
They may still buy data from ad companies and store credit cards, etc.
Many of them link users based on phone number.
Also Netflix wasn't initially selling ads and there you have after increasing the price of their plans drastically in the last few years the ad supported subscription is probably the #1 plans because most people aren't willing to shed 15 to 25usd/€ every month to watch content that is already littered with ads.
When I signed up I had to do exactly that.
Seems familiar…
[1] https://www.forbes.com/advisor/investing/cryptocurrency/what...
> I've never heard of that's a 'partner' of the largest AI company and Worldcoin founder
Hello Human Resource, we have all your data, please upload your bio-metric identity, as well as your personal thoughts.
Building the next phase of a corporate totalitarian state, thank you for your cooperation.
(So I’m remaining locked out of my linkedin account.)
Since Mossad and CIA is essentially one organization they already do it, 100%.
which, after using it, fair! It found a zero day
- generating synthetic data to train their own models
- hacking and exploitation research
etc
Source?
Link: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
It's also the only LLM provider which has this.
What OpenAI has that the others don't is SamA's insatiable thirst for everyone's biometric data.
[0] https://www.aisi.gov.uk/work/replibench-measuring-autonomous...
Contact support and ask for a refund. Then a charge back.
openai persona verification site:community[.]openai[.]com
e.g. a thread with 36 posts beginning Apr 13:
"OpenAI Non-Announcement: Requiring identity card verification for access to new API models and capabilities"
But always good to be on look out for shenanigans :)
https://withpersona.com/legal/privacy-policy
To me it looks like an extremely aggressive data pump.
[1] https://techcrunch.com/2025/04/13/access-to-future-ai-models...
Just send them a random passport photo from the Internet, what's the deal? Probably they are just vibe-verifying the photo with "Is it legit passport?" prompt anyways.
This should be illegal. How many are going to do the same as you, but then think that the effort/time/hassle they would waste to try to get their money back would not be worth it? At which point you've effectively donated money to a corp that implements anti-consumer anti-patterns.
Google tried this with Google Plus and Google Wave, failed spectacularly, and have ironically stopped with this idiotic "marketing by blocking potential users". I can access Gemini Pro 2.5 without providing a blood sample or signing parchment in triplicate.
[1] Not really though, because a significant percentage of OpenAI's revenue is from spammers and bulk-generation of SOE-optimised garbage. Those are valued customers!
Anthropic exposes reasoning, which has become a big reason to use them for reasoning tasks over the other two despite their pricing. Rather ironic when the other two have been pushing reasoning much harder.
[1] https://ai.google.dev/gemini-api/docs/thinking#summaries [2] https://discuss.ai.google.dev/t/massive-regression-detailed-...
Maybe you’re thinking of deep research mode which is web UI only for now.
Difficulty: Impossible
Agreeing to Persona’s terms, especially for biometric identity verification, involves both privacy and long-term data security risks. Here’s a clear breakdown of the main risks you should be aware of: 1. Biometric Data Collection
Risk: Biometric identifiers (like facial recognition, voiceprints, etc.) are extremely sensitive and irreplaceable if compromised.
Risk: Persona says biometric data is kept for up to one year, but: Risk: Your biometric and ID data may be shared with subprocessors (partners/vendors) that you haven’t explicitly vetted. Risk: The fine print often includes vague consent for "quality assurance", "model improvement", or "fraud detection". Should You Agree?Only if:
If you’re uneasy about this, you’re not alone. Many developers and privacy advocates refuse to verify with biometrics for non-critical services, and companies like OpenAI are increasingly facing criticism for requiring this.I don't see this happening with for example deepseek.
Is it possible they are saving on resources by having it answer that way?
When I worked at Netflix I sometimes heard the same speculation about intentionally bad recommendations, which people theorized would lower streaming and increase profit margins. It made even less sense there as streaming costs are usually less than a penny. In reality, it’s just hard to make perfect products!
(I work at OpenAI.)
Example, I asked it to write something. And then I asked it to give me that blob of text in markdown format. So everything it needed was already in the conversation. That took a whole minute of doing web searches and what not.
I actually dislike using o3 for this reason. I keep the default to 4o. But sometimes I forget to switch back and it goes off boiling the oceans to answer a simple question. It's a bit too trigger happy with that. In general all this version and model soup is impossible to figure out for non technical users. And I noticed 4o is now sometimes starting to do the same. I guess, too many users never use the model drop down.
I am not saying they haven't improved the laziness problem, but it does happen anecdotally. I even got similar sort of "lazy" responses for something I am building with gemini-2.5-flash.
takes tinfoil hat off
Oh, nvm, that makes sense.
Although I've always wondered how OpenAI could get away with o3's astronomical pricing, what does o3 do better than any other model to justify their premium cost?
I have a suspicion that's how they were able to get gpt-4-turbo so fast. In practice, I found it inferior to the original GPT-4 but the company probably benchmaxxed the hell out of the turbo and 4o versions so even though they were worse models, users found them more pleasing.
o3 is still o3 (no nerfing) and o3-pro is new and better than o3.
If we were lying about this, it would be really easy to catch us - just run evals.
(I work at OpenAI.)
However starting from a week ago, the o3 responses became noticeably worse, with G2.5P staying about the same (in terms of what I've come to expect from the two models).
This alongside the news that you guys have decreased the price of o3 by 80% does really make it feel like you've quantized the model or knee-capped thinking or something. If you say it is wholly unchanged I'll believe you, but not sure how else to explain the (admittedly subjective) performance drop I've experienced.
o4-mini-high o4-mini o3 o3-pro gpt-4o
Oy.
o3 pro is a different thing - it’s not just o3 with maximum remaining effort.
Here's the current state with version numbers as far as I can piece it together (using my best guess at naming of each component of the version identifier. Might be totally wrong tho):
1) prefix (optional): "gpt-", "chatgpt-"
2) family (required): o1, o3, o4, 4o, 3.5, 4, 4.1, 4.5,
3) quality? (optional): "nano", "mini", "pro", "turbo"
4) type (optional): "audio", "search"
5) lifecycle (optional): "preview", "latest"
6) date (optional): 2025-04-14, 2024-05-13, 1106, 0613, 0125, etc (I assume the last ones are a date without a year for 2024?)
7) size (optional): "16k"
Some final combinations of these version number components are as small as 1 ("o3") or as large as 6 ("gpt-4o-mini-search-preview-2024-12-17").
Given this mess, I can't blame people assuming that the "best" model is the one with the "biggest" number, which would rank the model families as: 4.5 (best) > 4.1 > 4 > 4o > o4 > 3.5 > o3 > o1 (worst).
If we did change the model, we'd release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we ever silently changed models, so we never do this [1].
[1] `chatgpt-4o-latest` being an explicit exception
Speaking of a new name. I'll donate the API credits to run a "choose a naming scheme for AI models that isn't confusing AF" for OpenAI.
It's the same logic of why UB in C/C++ isn't a license to do whatever the compiler wants. We're humans and we operate on implications, common-sense assumptions and trust.
https://cloud.google.com/products?hl=en#product-launch-stage...
"At Preview, products or features are ready for testing by customers. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these. Unless stated otherwise by Google, Preview offerings are intended for use in test environments only. The average Preview stage lasts about six months."
If the "preview release" you were using was v0.3, and suddenly it started being v0.6 without warning, that would be insane. The only point of providing a version number is to give people an indicator of consistency. The datestamp is a version number. If they didn't want us to expect consistency, they should not have given it a version number. That's the whole point of rolling release branches, they have no version. You don't have "v2.0" of a rolling release, you just have "latest". They fucked up by giving it a datestamp.
This is an extremely old and well-known problem with software interfaces. Either you version it or you don't. If you do version it, and change it, you change the version, and give people dependent on the old version some time to upgrade. Otherwise it breaks things, and that pisses people off. The alternative is not versioning it, which is a signal that there is no consistency to be expected. Any decent software developer should have known all this.
And while I'm at it: what's with the name flip-flopping? In 2014, GCP issued a PR release explaining It was no longer using "Preview", but "Alpha" and "Beta" (https://cloudplatform.googleblog.com/2014/10/new-release-pha...). But the link you showed earlier says "Alpha" and "Beta" are now deprecated. But no PR release? I guess that's our bad for not constantly reading the fine print and expecting it to revert back to something from 11 years ago.
I bet someone at Google would be a bit surprised to see someone jumping to legalese to act like this...novelty...is inherently due to the preview status, and based on anything more than a sense that there's no net harm done to us if it costs the same and is better.
I'm not sure they're wrong.
But it also leads to a sort of "nobody knows how anything works because we have 2^N configs and 5 bits" - for instance, 05-06 was also upgraded to 06-05. Except it wasn't, if you sent variable thinking to 05-06 after upgrade it'd fail. (and don't get me started on the 5 different thinking configurations for Gemini 2.5 flash thinking vs. gemini 05-06 vs. 06-05 and 0 thinking)
It's a preview model - for testing only, not for production. Really not that complicated.
Why are you in the comments section of a engineering news site?
(note: beyond your, excuse me while I'm direct now, boorish know-nothing reply, the terms you are citing have nothing to do with the thing people are actually discussing around you, despite your best efforts. It doesn't say "we might swap in a new service, congrats!", nor does it have anything to say about that. Your legalese at most describes why they'd pull 05-06, not forward 05-06 to 06-05. This is a novel idea.)
And I mean I genuinely do not understand what you are trying to say. Couldn't parse it.
It’s always worth considering that this may be your problem. If you still don’t get it, the only valuable reply is one which asks a question. Also, including “it’s not that complicated” only serves to inflame.
Do you understand that even if it did say that, that wasn't true either? It was some weird undocumentable half-beast?
I have exactly your attitude about their cavalier use of preview for all things Gemini, and even people's use of the preview models.
But I've also been on this site for 15 years and am a bit wow'd by your interlocution style here -- it's quite rare to see someone flip "the 3P provider swapped the service on us!" into "well they said they could turn it off, of course you should expect it to be swapped for the first time ever!" insert dull sneer about the quality of other engineers
I am done with this thread. We are going around in circles.
It's a cute argument, as I noted, I'm emotionally sympathetic to it even, it's my favorite "get off my lawn." However, I've also been on the Internet long enough to know you write back, at length, when people try anti-intellectualism and why-are-we-even-talking-about-this as interaction.
"b. Disclaimer. PRE-GA OFFERINGS ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OR REPRESENTATIONS OF ANY KIND. Pre-GA Offerings (i) may be changed, suspended or discontinued at any time without prior notice to Customer and (ii) are not covered by any SLA or Google indemnity. Except as otherwise expressly indicated in a written notice or Google documentation, (A) Pre-GA Offerings are not covered by TSS, and (B) the Data Location Section above will not apply to Pre-GA Offerings."
No comments yet
I didn't read the ToS, like everyone else, but my guess is that degrading model performance at peak times will be one of the things that can slip through. We are not suggesting you are running a different model but that you are quantizing it so that you can support more people.
This can't happen with Open weight models where you put the model, allocate the memory and run the thing. With OpenAI/Claude, we don't know the model running, how large it is, what it is running on, etc... None of that is provided and there is only one reason that I can think of: to be able to reduce resources unnoticed.
This is HN and not reddit.
"I didn't read the ToS, like everyone else, but my guess..."
Ah, there it is.
interesting take, I wouldn't be surprised if they did that.
if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.
Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).
You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).
Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)
You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.
I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.
> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.
openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.
Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff
https://github.com/mpfaffenberger/code_puppy
Which is why the base model wouldn't necessarily show differences when you benchmarked them.
Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).
There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.
You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.
Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.
Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).
In the API, we never make silent changes to models, as that would be super annoying to API developers [1]. In ChatGPT, it's a little less clear when we update models because we don't want to bombard regular users with version numbers in the UI, but it's still not totally silent/opaque - we document all model updates in the ChatGPT release notes [2].
[1] chatgpt-4o-latest is an exception; we explicitly update this model pointer without warning.
[2] ChatGPT Release Notes document our updates to gpt-4o and other models: https://help.openai.com/en/articles/6825453-chatgpt-release-...
(I work at OpenAI.)
> Today, we dropped the price of OpenAI o3 by 80%, bringing the cost down to $2 / 1M input tokens and $8 / 1M output tokens.
> We optimized our inference stack that serves o3—this is the same exact model, just cheaper.
https://x.com/hyperknot/status/1932476190608036243
The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.
It had a few bugs here or there when they pushed updates, but it didn't get worse.
My question is not whether this is true (it is) but why it's happening.
I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.
But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.
For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.
I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.
IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.
Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.
In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.
We don't make hobbyist mistakes of randomly YOLO trying various "quantization" methods that only happen after all training and claim it a day, at all. Quantization was done before it went live.
a proxy to that may be the anecdotal evidence of users who report back in a month that model X has gotten dumber (started with gpt-4 and keeps happening, esp. with Anthro and OpenAI models). I haven't heard such anecdotal stories about Gemini, R1, etc.
Some users. For me the drop was so huge it became almost unusable for the things I had used it for.
Hmm, that's evidently and anecdotally wrong:
https://github.com/ggml-org/llama.cpp/discussions/4110
https://arcprize.org/blog/analyzing-o3-with-arc-agi
They inflated expectations and then released to the public a model that underperforms
The main leaderboard page that you linked to is updated quite frequently, but it doesn't contain multiple benchmarks for the same exact model.
I think if you're goal is to have properly written langauge using older writing styles, then you're correct.
https://community.openai.com/t/session-expired-verify-organi...
https://community.openai.com/t/callback-from-persona-id-chec...
https://community.openai.com/t/verification-issue-on-second-...
https://community.openai.com/t/verification-not-working-and-...
https://community.openai.com/t/organization-verfication-fail...
https://community.openai.com/t/help-organization-could-not-b...
https://community.openai.com/t/to-verify-an-organization-acc...
How exactly will passport check prevent any training?
At most this will block API access to your average Ivan, not a state actor
It generally does not. No idea if there are edge cases where it does, but that's definitely not the norm for the average user.
In my experience, o4-mini and o4-mini-high are far behind o3 in utility, but since I’m rate-limited for the latter, I end up primarily using the former, which has kind of reinforced the perception that OpenAI’s thinking models are behind the competition altogether.
It's great with those, however!
Just yesterday, they reported an annualized revenue run rate of 10B. Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast - 30x revenue with over 500M active users.
It reminds me a lot of Uber in its earlier years—fast growth, heavy investment, but edging closer to profitability.
For OpenAI, the more people use the product, the same you spend on compute unless they can supplement it with another ways of generating revenue.
I dont unfortunately think OpenAI will be able to hit sustained profitability (see Netflix for another example)
I'd say dropping the price of o3 by 80% due to "engineers optimizing inferencing" is a strong sign that they're doing exactly that.
Optimizing serving isn't unlikely: all of the big AI vendors keep finding new efficiencies, it's been an ongoing trend over the past two years.
This is marginally less true for embedding models and things you've fine-tuned, but only marginally.
- Carefully interleaving shared memory loading with computation, and the whole kernel with global memory loading.
- Warp shuffling for softmax.
- Avoiding memory access conflicts in matrix multiplication.
I'm sure the guys at ClosedAI have many more optimizations they've implemented ;). They're probably eventually going to design their own chips or use photonic chips for lower energy costs, but there's still a lot of gains to be made in the software.
They finally implemented DeepSeek open source methods for fast inference?
Netflix has been profitable for over a decade though? They reported $8.7 billion in profit in 2024.
The more inference customers OpenAI has, the easier it is for them to reach profitability.
Obviously, lots of nerds on HN have preferences for Gemini and Claude, and having used all three I completely get why that is. But we should remember we're not representative of the whole addressable market. There were probably nerds on like ancient dial-up bulletin boards explaining why Betamax was going to win, too.
Again: I don't know. I've got no predictions. I'm just saying that the logic where OpenAI is outcompeted on models themselves and thus automatically lose does not hold automatically.
Similarly, nearly all AI products but especially OpenAI are heavily _under_ monetized. OpenAI is an excellent personal shopper - the ad revenue that could be generated from that rivals Facebook or Google.
You could override its suggestions with paid ones, or nerf the bot's shopping abilities so it doesn't overshadow the sponsors, but that will destroy trust in the product in a very competitive industry.
You could put user-targeted ads on the site not necessarily related to the current query, like ads you would see on Facebook, but if the bot is really such a good personal shopper, people are literally at a ChatGPT prompt when they see the ads and will use it to comparison shop.
(with many potential variants)
With the race to get new models out the door, I doubt any of these companies have done much to optimize cost so far. Google is a partial exception – they began developing the TPU ten years ago and the rest of their infrastructure has been optimized over the years to serve computationally expensive products (search, gmail, youtube, etc.).
What? Netflix is incredibly profitable.
OpenAI is very good at this as well because of their brand name. For many people ChatGPT is all they know. That's the one that's in the news. That's the one everybody keeps talking about. They have many millions of paying users at this point.
This is a non trivial moat. If you can only be successful by not serving most of the market for cost reasons, then you can't be successful. It's how Google has been able to guard its search empire for a quarter century. It's easy to match what they do algorithmically. But then growing from a niche search engine that has maybe a few tens of thousands of users (e.g. Kagi) to Google scale serving essentially most of this planet (minus some fire walled countries like Russia and China), is a bit of a journey.
So Google rolling out search integration is a big deal. It means they are readying themselves for that scale and will have billions of users exposed to this soon.
> Their last funding round in March valued them at 300B. Despite losing 5B last year, they are growing really fast
Yes, they are valued based on world+dog needing agentic AIs and subscribing to the extent of tens or hundreds of dollars/month. It's going to outstrip revenue things like MS Office in its prime.
5B loss is peanuts compared to that. If they weren't burning that, their ambition level would be too low.
Uber now has a substantial portion of the month. They have about 3-4 billion revenue per month. A lot of cost obviously. But they managed 10B profit last year. And they are not done growing yet. They were overvalued at some point and then they crashed, but they are still there and it's a pretty healthy business at this point and that reflects in their stock price. It's basically valued higher now than at the time of the Softbank investment pre-IPO. Of course a lot of stuff needed to be sorted out for that to happen.
Plus there is the thing that "thinking models" can't really solve complex tasks / aren't really as good as they are believed to be .
* We have people uploading tons of zero-effort slop pieces to all manner of online storefronts, and making people less likely to buy overall because they assume everything is AI now
* We have an uncomfortable community of, to be blunt, actual cultists emerging around ChatGPT, doing all kinds of shit from annoying their friends and family all the way up to divorcing their spouses
* Education is struggling in all kinds of ways due to students using (and abusing) the tech, with already strained administrations struggling to figure out how to navigate it
Like yeah if your only metric is OpenAI's particular line going up, it's looking alright. And much like Uber, it's success seems to be corrosive to the society in which it operates. Is this supposed to be good news?
A great communicator on the risks of AI being to heavily intergrated into society is Zak Stein. As someone who works in education, they are see first hand how people are becoming dependent on this stuff rather than any kind of self improvement. The people who are just handing over all their thinking to the machine. It is very bizarre and I am seeing it in my personal experience a lot more over the last few months.
They’re not letting the competition breathe
> Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt
My mind immediately goes to rowhammer for some reason.
At the very least this opens up the possibility of some targeted denial of service
They probably also have cheap code or cheap models that normalize requests to increase cache hit rate.
No? Eg "how to cook pasta" is probably asked a lot.
https://x.com/cursor_ai/status/1932484008816050492
I wonder if "we quantized it lol" would classify as false advertising for modern LLMs.
Once new MacBooks and iPhones have enough memory onboard this is going to be a disaster for OpenAI and other providers.
If I was OpenAI (or Anthropic for that matter) I would remain scared of Google, who is now awake and able to dump Gemini 2.5 pro on the market at costs that I'm not sure people without their own hardware can compete with, and with the infrastructure to handle everyone switching to them tomorrow.
OpenAI vs Anthropic on Google Trends
https://trends.google.com/trends/explore?date=today%203-m&q=...
ChatGPT vs Claude on Google Trends
https://trends.google.com/trends/explore?date=today%203-m&q=...
I get the point is to be the last man standing, and poaching customers by lowering the price, and perhaps attract a few people who wouldn't have bought a subscription at the higher price. I just question how long investors can justify pouring money into OpenAI. OpenAI is also the poster child for modern AI, so if they fail the market will react badly.
Mostly I don't understand Silicon Valley venture capital, but dumping price, making wild purchases for investor money and mostly only leading on branding, why isn't this a sign that OpenAI is failing?
That seems likely to me, all of the LLM providers have been consistently finding new optimizations for the past couple of years.
Do you not think that batch inference gives at least a bit of a moat whereby unit costs fall with more prompts per unit of time, especially if models get more complicated and larger in the future?
right now new Gemini surpassed their o3 (barely) in benchmarks for significantly less money so they cut pricing to be still competitive.
I bet they didn't released o4 not because it's not competitive, but because they are doing Nvidia game: release new product that is just enough better to convince people to buy it. so IMO they are holding full o4 model to have something to release after competition release something better that their top horse
They need lots of energy and customers don’t pay much, if they pay at all
The developers of AI models do have a moat, the cost of training the model in the first place.
It's 90% of the low effort AI wrappers with little to no value add who have no moat.
Many many companies are currently thrilled to pay the current model prices for no performance improvement for 2-3 years
We still have so many features to build on top of current capabilities
That said, I'm absolutely willing to hear people out on "value-adds" I am missing out on; I'm not a knee-jerk hater (For context, I work with large, complex & private databases/platforms, so its not really possible for me to do anything but ask for scripting suggestions).
Also, I am 100% expecting a sad day when I'll be forced to subscribe, unless I want to read dick pill ads shoehorned in to the answers (looking at you, YouTube). I do worry about getting dependent on this tool and watching it become enshittified.
Just switch to a competitors free offering. There are enough to cycle through not to be hindered by limits. I wonder how much money I have cost those companies by now?
How anyone believes there is any moat for anyone here is beyond me.
I've never used anything like it. I think new Claude is similarly capable
also no idea why he thinks roo is handicapped when claude code nerfs the thinking output and requires typing “think”/think hard/think harder/ultrathink just to expand the max thinking tokens.. which on ultrathink only sets it at 32k… when the max in roo is 51200 and it’s just a setting.
From my experience (so not an ultimate truth) Claude is not so great at taking the decision for planning by its own: it dives immediately into coding.
If you ask it to think step-by-step it still doesn’t do it but Gemini 2.5 Pro is good at that planning but terrible at actual coding.
So you can use Gemini as planner and Claude as programmer and you get something decent on RooCode.
This “think wisely” that you have to repeat 10x in the prompt is absolutely true
Edit: I think I know where our miscommunication is happening...
The "think"/"ultrathink" series of magic words are a claudecode specific feature used to control the max thinking tokens in the request. For example, in claude code, saying "ultrathink" sets the max thinking tokens to 32k.
On other clients these keywords do nothing. In Roo, max thinking tokens is a setting. You can just set it to 32k, and then that's the same as saying "ultrathink" in every prompt in claudecode. But in Roo, I can also setup different settings profiles to use for each mode (with different max thinking token settings), configure the mode prompt, system prompt, etc. No magic keywords needed.. and you have full control over the request.
Claude Code doesn't expose that level of control.
> We’ll post to @openaidevs once the new pricing is in full effect. In $10… 9… 8…
There is also speculation that they are only dropping the input price, not the output price (which includes the reasoning tokens).
Input: $2.00 / 1M tokens
Cached input: $0.50 / 1M tokens
Output: $8.00 / 1M tokens
https://openai.com/api/pricing/
Now cheaper than gpt-4o and same price as gpt-4.1 (!).
This is where the naming choices get confusing. "Should" o3 cost more or less than GPT-4.1? Which is more capable? A generation 3 of tech intuitively feels less advanced than a 4.1 of a (similar) tech.
- o4 is reasoning
- 4o is not
They simply do not do a good job of differentiating. Unless you work directly in the field, it is likely not obvious what is the difference between "our most powerful reasoning model" and "our flagship model for complex tasks."
"Does my complex task need reasoning or not?" seems to be how one would choose. (What type of task is complex but does not require any reasoning?) This seems less than ideal!
The speculation of only input pricing being lowered was because yesterday they gave out vouchers for 1M free input tokens while output tokens were still billed.
https://openai.com/api/pricing/
No comments yet
Or is the price drop an attempt to cover up bad news about the outage with news about the price drop?
This makes no sense. No way a global outage will get less coverage than the price drop.
Also the earliest sign of price drop is this tweet 20 hrs ago (https://x.com/OpenAIDevs/status/1932248668469445002), which is earlier than the earliest outage reports 13hrs ago on https://downdetector.com/status/openai/
Have you seen today's outage on any news outlet? I have not. Is there an HN thread?
(It's "contact us" pricing, so I have no idea how much that would set you back. I'm guessing it's not cheap.)