Tokens are getting more expensive

104 admp 71 8/3/2025, 11:01:37 AM ethanding.substack.com ↗

Comments (71)

ej88 · 19s ago

The article just isn't that coherent for me.

> when a new model is released as the SOTA, 99% of the demand immediately shifts over to it

99% is in the wrong ballpark. Lots of users use Sonnet 4 over Opus 4, despite Opus being 'more' SOTA. Lots of users use 4o over o3 or Gemini over Claude. In fact it's never been a closer race on who is the 'best': https://openrouter.ai/rankings

>switch from opus ($75/m tokens) to sonnet ($15/m) when things get heavy. optimize with haiku for reading. like aws autoscaling, but for brains.

they almost certainly built this behavior directly into the model weights

???

Overall the article seems to argue that companies are running into issues with usage-based pricing due to consumers not accepting or being used to usage based pricing and it's difficult to be the first person to crack and switch to usage based.

I don't think it's as big of an issue as the author makes it out to be. We've seen this play out before in cloud hosting.

- Lots of consumers are OK with a flat fee per month and using an inferior model. 4o is objectively inferior to o3 but millions of people use it (or don't know any better). The free ChatGPT is even worse than 4o and the vast majority of chatgpt visitors use it!

- Heavy users or businesses consume via API and usage based pricing (see cloud). This is almost certainly profitable.

- Fundamentally most of these startups are B2B, not B2C

mystraline · 1h ago

From the article:

> consumers hate metered billing. they'd rather overpay for unlimited than get surprised by a bill.

Yes and no.

Take Amazon. You think your costs are known and WHAMMO surprise bill. Why do you get a surprise bill? Because you cannot say 'Turn shit off at X money per month'. Can't do it. Not an option.

All of these 'Surprise Net 30' offerings are the same. You think you're getting a stable price until GOTAHCA.

Now, metered billing can actually be good, when the user knows exactly where they stand on the metering AND can set maximums so their budget doesn't go over.

Taken realistically, as an AI company, you provide a 'used tokens/total tokens' bar graph, tokens per response, and estimated amount of responses before exceeding.

Again, don't surprise the user. But that's an anathema to companies who want to hide tokens to dollars, the same way gambling companies obfuscate 'corporate bux' to USD.

mhitza · 37m ago

> Again, don't surprise the user. But that's an anathema to companies who want to hide tokens to dollars, the same way gambling companies obfuscate 'corporate bux' to USD.

This is the exact same thing that frustrates me with GitHub's AI rollout. Been trialing the new Copilot agent, and it's cost is fully opaque. Multiple references to "premium requests" that don't show up real-time in my dashboard, not clear how many I have in total/left, and when these premium requests are referenced in the UI they link to the documentation that also doesn't talk about limits (instead of linking to the associated billing dashboard).

ikari_pl · 1h ago

I often find Amazon pricing to be vague and cryptic, sometimes there's literally no way to tell ehy, for example, your database cost is fluctuating all the time

graemep · 36m ago

If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you.

I am not saying this is desirable, but it is necessary IFF you chose to use these services. They are complex by design, and intended primarily for large scale users who do have the expertise to handle the complexity.

motorest · 2m ago

> If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you.

What a baffling comment. Is it normal to even consider hiring someone to figure out how you are being billed by a service? You started with one problem and now you have at least two? And what kind of perverse incentive are you creating? Don't you think your "finops" person has a vested interest in preserving their job by ensuring billing complexity will always be there?

Shank · 5m ago

> If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you.

The point where you get sticker shock from AWS is often significantly lower than the point where you have enough money to hire in either of those roles. AWS is obviously the infrastructure of choice if you plan to scale. The problem is that scaling on expertise isn’t instant and that’s where you’re more likely to make a careless mistake and deploy something relatively costly.

lelanthran · 13m ago

> If your AWS costs are too complex for you to understand you need to employ a finops person or AWS specialist to handle it for you

At that point wouldn't it simply be cheaper to do VMs?

ajsnigrutin · 32m ago

But they're also simple and cheap if you're a "one man band" trying out some personal idea that might or might not take off. Those people have no budgets for specialists.

Pricing schemes like these just make them move back to virtual machines with "unlimited" shared cpu usage and setting up services (db,...) manually.

mort96 · 18m ago

I'm 100% on team "just rent VMs and run the software on there". It's not that hard, it has predictable price and performance, and you don't lock yourself into one provider. If you build your whole service on top of some weird Amazons -specific thing, and Amazon jacks up their prices, you don't have any recourse. With VMs, you can just spin up new VMs with another provider.

You could also have potential customers who would be interested in your solution, but don't want it hosted by an American company. Spinning up a few Hetzner VMs is easy. Finding European alternatives to all the different "serverless" services Amazon offers is hard.

joseda-hg · 1h ago

Amazon pricing is nice if you compare it to Azure...

crinkly · 53m ago

Yeah that. We moved to AWS using their best practices and enterprise cost estimation stuff and got a 6x cost increment on something that was supposed to be cheaper and now we’re fucked because we can’t get out.

It’s nearly impossible to tell what the hell is going where and we are mostly surviving on enterprise discounts from negotiations.

The worst thing is they worked out you can blend costs in using AWS marketplace without having to raise due diligence on a new vendor or PO. So up it goes even more.

Not my department or funeral fortunately. Our AWS account is about $15 a month.

scoreandmore · 1h ago

You can set billing alerts and write a lambda function to respond and disable resources. Of course they don’t make it easy but if you don’t learn how to use limits what do you expect? This argument amazes me. Cloud services require some degree of responsibility on the users side.

mystraline · 1h ago

This is complete utter hogwash.

Up until recently, you could hit somebody else's S3 endpoint, no auth, and get 403's that would charge them 10s of thousands of dollars. Coudnt even firewall it. And no way to see, or anything. Number go up every 15-30 minutes in cost dashboard.

Real responsibility is 'I have 100$ a month for cloud compute'. Give me a easy way to view it, and shut down if I exceed that. That's real responsibility, that Scamazon, Azure, Google - none of them 'permit'.

They (and well, you) instead say "you can build some shitty clone of the functionality we should have provided, but we would make less money".

Oh, and your lambda job? That too costs money. It should not cost more money to detect and stop stuff on 'too much cost' report.

This should be a default feature of cloud: uncapped costs, or stop services

No comments yet

gray_-_wolf · 1h ago

Last time I was looking into this, is there not up to an hour of delay for the billing alerts? It did not seem possible to ensure you do not run over your budget.

esafak · 1h ago

So you're okay with turning your site off...

mystraline · 52m ago

This a logical fallacy of false dilemma.

I made it clear that you ask the user to choose between 'accept risk of overrun and keep running stuff', 'shut down all stuff on exceeding $ number', or even a 'shut down these services on exceeding number', or other possible ways to limit and control costs.

The cloud companies do not want to permit this because they would lose money over surprise billing.

dd36 · 36m ago

Cats doing tricks has a limited budget.

verbify · 41m ago

Isn't that the definition of metered billing?

ankit219 · 6m ago

Interesting article, full of speculation and some logical follows, but feels like it feels short of admitting what the true conclusion is. Model building companies can build thinner wrapper / harness and can offer better prices than third party companies (the article assumes it costs anthropic same price per token as it does for their customers) because their costs per token is lower than app layer companies. Anthropic has a decent margin (likely higher than openai) on sale of every token, and with more scale, they can sell at a lower cost (or some unlimited plans with limits that keeps out 1%-5% of the power users).

I don't agree with the Cognition conclusion either. Enterprises are fighting super hard to not have a long term buying contract when they know SOTA (app or model) is different every 6 months. They are keeping their switching costs low and making sure they own the workflow, not the tool. This is even more prominent after Slack restricted API usage for enterprise customers.

Making money on the infra is possible, but that again misunderstands the pricing power of Anthropic. Lovable, Replit etc. work because of Claude. Openai had codex, google had jules, both aren't as good in terms of taste compared to Claude. It's not the cli form factor which people love, it's the outcome they like. When Anthropic sees the money being left on the table in infra play, they will offer the same (at presumably better rates given Amazon is an investor) and likely repeat this strategy. Abstraction is a good play, only if you abstract it to the maximum possible levels.

furyofantares · 1h ago

> claude code has had to roll back their original unlimited $200/mo tier this week

The article repeats this throughout but isn't it a straight lie? The plan was named 20x because it's 20x usage limits, it always had enforced 5 hour session limits, it always had (unenforced? soft?) 50 session per month limits.

It was limited, but not enough and very very probably still isn't, judging by my own usage. So I don't think the argument would even suffer from telling the truth.

michaelbuckbee · 2h ago

A major current problem is that we're smashing gnats with sledgehammers via undifferentiated model use.

Not every problem needs a SOTA generalist model, and as we get systems/services that are more "bundles" of different models with specific purposes I think we will see better usage graphs.

empiko · 54m ago

Yeah, but the juiciest tasks are still far from solved. The amount of tasks where people are willing to accept low accuracy answers is not that high. It is maybe true for some text processing pipelines, but all the user facing use cases require good performance.

simonjgreen · 1h ago

Completely agree. It’s worth spending time to experiment too. A reasonably simple chat support system I build recently uses 5 different models dependent on the function it it’s in. Swapping out different models for different things makes a huge difference to cost, user experience, and quality.

alecco · 1h ago

If there was an option to have Claude Opus guide Sonnet I'd use it for most interactions. Doing it manually is a hassle and breaks the flow, so I end up using Opus too often.

This shouldn't be that expensive even for large prompts since input is cheaper due to parallel processing.

isoprophlex · 1h ago

You can define subagents that are forced to run on eg. Sonnet, and call these from your main Opus backed agent. /agent in CC for more info...

danielbln · 1h ago

That's what I do. I used to use Opus for the dumbest stuff, writing commits and such, but now that' all subagent business that run on Sonnet (or even Haiku sometimes). Same for running tests, executing services, docker etc. All Sonnet subagents. Positive side effect: my Opus allotment lasts a lot longer.

illusive4080 · 24m ago

I’m just sitting here on my $20 subscription hoping one day we will get to use Opus

mustyoshi · 1h ago

Yeah this is the thing people miss a lot. 7,32b models work perfectly fine for a lot of things, and run on previously high end consumer hardware.

But we're still in the hype phase, people will come to their senses once the large model performance starts to plateau

_heimdall · 1h ago

I expect people to come to their senses when LLM companies stop subsidizing cost and start charging customers what it actually costs them to train and run these models.

nateburke · 1h ago

generalist = fungible?

In the food industry is it more profitable to sell whole cakes or just the sweetener?

The article makes a great point about replit and legacy ERP systems. The generative in generative AI will not replace storage, storage is where the margins live.

Unless the C in CRUD can eventually replace the R and U, with the D a no-op.

xrd · 4m ago

This is the moment an open source solution could pop in and say just "uv add aider" and then make sure you have a 24gb card for Qwen3 for each dev, and you are future proofed for at least the next year. It seems like the only way out.

GiorgioG · 25m ago

I tried Gemini CLI and in 2 hours somehow spent $22 just messing around with a very small codebase. I didn’t find out until the next day from Google’s billing system. That was enough for me - I won’t touch it again.

happytoexplain · 1h ago

While reading this, every time I started a paragraph and saw a lowercase, my brain and eyes were stalling or jumping up, to reflexively look for the text that got cut off. My brain has been trained for decades that, when reading full prose, a paragraph starting with lowercase means I'm starting in the middle of a sentence, and something happened in the layout or HTML to interrupt it.

And, I know this seems dramatic, but besides being cognitively distracting, it also makes me feel sad. Chatroom formatting in published writings is clearly a developing trend at this point, and I love my language so much. Not in a linguistic capacity - I'm not an English expert or anything, nor do I follow every rule - I mean in an emotional capacity.

I'm not trying to be condescending. This is a style choice, not "bad writing" in the typical sense. I realize there is often a lot of low-quality bitterness on both sides about this kind of thing.

Edit:

I also fear that this is exactly the kind of thing where any opinion in opposition to this style will feel like the kind of attack that makes a writer want to push back in a "oh yeah? fuck you" kind of way. I.e. even just my writing this opinion may give an author using the style in question the desire to "double down". Though this conundrum is appropriate (ironic?) - the intensely personal nature of language is part of why I love it.

egypturnash · 34m ago

IT COULD BE WORSE, YOU COULD BE READING A LENGTHY ESSAY PRESENTED ENTIRELY IN ALL CAPS WITH MINIMAL PUNCTUATION TO BREAK IT UP

SEARCH FOR “FILM CRIT HULK” FOR SOME EXAMPLES

simianwords · 46m ago

It’s to draw contrast against extremely polished and sterile looking slop content. Think of it like avoiding em dash but going a bit far.

djhworld · 1h ago

Over the past year or two I've just been paying for the API access and using open source frontends like LibreChat to access these models.

This has been working great for the occasional use, I'd probably top up my account by $10 every few months. I figured the amount of tokens I use is vastly smaller than the packaged plans so it made sense to go with the cheaper, pay-as-you-go approach.

But since I've started dabbling in tooling like Claude Code, hoo-boy those tokens burn _fast_, like really fast. Yesterday I somehow burned through $5 of tokens in the space of about 15 minutes. I mean, sure, the Code tool is vastly different to asking an LLM about a certain topic, but I wasn't expecting such a huge leap, a lot of the token usage is masked from you I guess wrapped up in the ever increasing context + back/forth tool orchestration, but still

zurfer · 1h ago

The simple reason for this is that Claude Code uses way more context and repetitions than what you would use in a typical chat.

TechDebtDevin · 1h ago

$20.00 via Deepseek's api (Yes China, can have my code idc), has lasted me almost a year. Its slow, but better quality output than any of the independently hosted Deepseek models (ime). I don't really use agents or anything tho.

Waterluvian · 34m ago

On the topic of cost per token, is it accurate to represent a token as, ideally, a composable atomic unit of information. But because we’re (often) using English as the encoding format, it can only be as efficient as English can encode the data.

Does this mean that other languages might offer better information density per token? And does this mean that we could invent a language that’s more efficient for these purposes, and something humans (perhaps only those who want a job as a prompt engineer) could be taught?

Kevin speak good? https://youtu.be/_K-L9uhsBLM?si=t3zuEAmspuvmefwz

joseda-hg · 15m ago

IIRC, in linguistics there's a hypothesis for "Uniform Information density" languages seem to follow on a human level (Denser languages slow down, sparse languages speed up) so you might have to go for an Artificial encoding, that maps effectively to english

English (And any of the dominant languages that you could use in it's place) work significantly better than other languages purely by having significantly larger bodies of work for the LLM to work from

Waterluvian · 12m ago

Yeah I was wondering about it basically being a dialect or the CoffeeScript of English.

Maybe even something anyone can read and maybe write… so… Kevin English.

Job applications will ask for how well one can read and write Kevin.

deegles · 28m ago

Human speech has a bit rate of around 39 bits per second, no matter how quickly you speak. assuming reading is similar, I guess more "dense" tokens would just take longer for humans to read.

https://www.science.org/content/article/human-speech-may-hav...

r_lee · 28m ago

Sure, for example Korean is unicode heavy, e.g. 경찰 = police, but its just 2 unicode chars. Not too familiar with how things are encoded but it could be more efficient

mark_l_watson · 2h ago

I have already thought a lot about the large packaged inference companies hitting a financial brick wall, but I was surprised by material near the end of the article: the discussions of lock in for companies that can’t switch and about Replit making money on the whole stack. Really interesting.

I managed a deep learning team at Capital One and the lock-in thing is real. Replit is an interesting case study for me because after a one week free agent trial I signed up for a one year subscription, had fun the their agent LLM-based coding assistant for a few weeks, and almost never used their coding agent after that, but I still have fun with Replit as an easy way to spin up Nix based coding environments. Replit seems to offer something for everyone.

strangescript · 46m ago

We haven't reached a peak on scaling/performance, so even if an old model can be commoditized, a new one will be created to take advantage of the newly freed infra. Until we hit a ceiling on scaling, tokens are going to remain expensive relative to what people are trying to do with them because the underlying compute is expensive.

dcre · 42m ago

Vibes-based analysis. We have no idea how much these models cost to serve.

raincole · 1h ago

First of all the title is click-bait. Tokens are getting cheaper and cheaper. People just use more and more tokens.

And everything, I mean everything after the title is only a downhill:

> saying "this car is so much cheaper now!" while pointing at a 1995 honda civic misses the point. sure, that specific car is cheaper. but the 2025 toyota camry MSRPs at $30K.

Cars got cheaper. The only reason you don't feel it is trade barrier that stops BYD from flooding your local dealers.

> charge 10x the price point > $200/month when cursor charges $20. start with more buffer before the bleeding begins.

What does this even mean? The cheapest Cursor plan is $20, just like Claude Code. And the most expensive Cursor plan is $200, just like Claude Code. So clearly they're at the exact same price point.

> switch from opus ($75/m tokens) to sonnet ($15/m) when things get heavy. optimize with haiku for reading. like aws autoscaling, but for brains.

> they almost certainly built this behavior directly into the model weights, which is a paradigm shift we’ll probably see a lot more of

"I don't know how Claude built their models and I have no insider knowledge, but I have very strong opinions."

> 3. offload processing to user machines

What?

> ten. billion. tokens. that's 12,500 copies of war and peace. in a month.

Unironically quoting data from viberank leaderboard, which is just user-submitted number...

> it's that there is no flat subscription price that works in this new world.

The author doesn't know what throttling is...?

I've stopped reading here. I should've just closed the tab when I saw the first letter in each sentence isn't capitalized. This is so far the most glaring signal of slop. More than the overuse of em-dash and lists.

WA · 39m ago

All good points, but:

> when I saw the first letter in each sentence isn't capitalized. This is so far the most glaring signal of slop.

How so? It's the exact opposite imho. Lowercase everything with a staccato writing style to differentiate from AI slop, because LLMs usually don't write lowercase.

ankit219 · 1m ago

Likely op does not mean ai slop, but more a signal of human carelessness that they could not write it in a proper manner.

lelanthran · 4m ago

I think GP is drawing a distinction between "slop" and "AI slop".

This comes across as sloppily written, but not sloppily generated.

Semaphor · 4m ago

Human slop instead of AI. Our race is catching up to the machines again.

Havoc · 1h ago

The combination of "thinking models" plus the blind focus on incremental benchmarking gains was a mistake for practical use.

You definitely want that for some tasks, but for the majority of tasks there is a lot of space for cheap & cheerful (and non-thinking)

comrade1234 · 2h ago

I'm kind of curious what IntelliJ's deal is with the different providers. I usually just keep it set to Claude but there are others that you can pick. I don't pay extra for the AI assistant - it's part of my regular subscription. I don't think I use the AI features as heavily as many others, but it does feed my code base to whoever I'm set to...

louthy · 2h ago

Are you sure you don’t pay extra? I’m on Rider and it’s an additional cost. Unless us C# and F# devs are subsidising everyone else :D

Edit: It says on the Jetbrains website:

“The AI Assistant plugin is not bundled and is not enabled in IntelliJ IDEA by default. AI Assistant will not be active and will not have access to your code unless you install the plugin, acquire a JetBrains AI Service license and give your explicit consent to JetBrains AI Terms of Service and JetBrains AI Acceptable Use Policy while installing the plugin.”

double051 · 2h ago

If you pay for the all products subscription, their AI features are now bundled in. I believe that may be a relatively recent change, and I would not have known about it if I hadn't been curious and checked.

comrade1234 · 2h ago

When they first added the assistant it was $100/yr to enable it. However, it's now part of the subscription and they even reimbursed me a portion of the $100 that I paid.

terminalbraid · 1h ago

You're one of the lucky ones. They just outright stole from many of the people who did pay for it.

terminalbraid · 1h ago

Considering they didn't significantly change their pricing when they bundled the equivalent of a ~$10-20/mo subscription to their Ultimate pack (which I pay something around $180/year for), I'm guessing they're eating a lot of the cost out of desperation for an imagined problem. That or they were fleecing everyone from the beginning.

robertclaus · 1h ago

My team is debating this exact question for a new product we have in early access. Ultimately we realized the issue early on, so even our plans option would include at-cost usage limits.

abtinf · 1h ago

Lack of proper capitalization makes the text unreadable for me.

blamestross · 1h ago

https://convertcase.net/browser-extension/

This extension might make the internet more accessible for you!

senko · 1h ago

Insisting on flaunting English spelling rules (by not starting a sentence with a capital letter) in a think piece is a dead giveaway that the author thinks too highly of themselves, and results in me automatically discounting whatever they're saying.

If I (and billions others) can be bothered to learn your damn language so we can all communicate, do us a service and actually use it properly, FFS.

No comments yet

flyinglizard · 2h ago

The truth is we're brute forcing some problems via tremendous amount of compute. Especially for apps that use AI backends (rather than chats where you interface with the LLM directly), there needs to be hybridization. I haven't used Claude Code myself but I did a screenshare session with someone who does and I think I saw it running old fashioned keyword search on the codebase. That's much more effective than just pushing more and more raw data into the chat context.

On one of the systems I'm developing I'm using LLMs to compile user intents to a DSL, without every looking at the real data to be examined. There are ways; increased context length is bad for speed, cost and scalability.

ath3nd · 2h ago

Mathematics are not relevant when we have hype and vibes. We can't have facts and projections and no path to profitability distract us from our final goal.

Which, of course, is to donate money to Sama so he can create AGI and be less lonely with his robotic girlfriend, I mean...change the world for the better somehow. /s

NitpickLawyer · 2h ago

I get your point but I think it's debatable. As long as the capabilities increase (and they have, IMO) cost isn't really relevant. If you can reasonably solve problems of a given difficulty (and we're starting to see that), then suddenly you can do stuff that you simply can't with humans. You can "hire" 100 agents / servers / API bundles, whatever and "solve" all tasks with difficulty x in your business. Then you cancel and your bottom is suddenly raised. You can't do that with humans. You can't suddenly hire 100 entry-level SWEs and fire them after 3 months.

Then you can think about automated labs. If things pan out, we can have the same thing in chemistry/bio/physics. Having automated labs definitely seems closer now than 2.5 years ago. Is cost relevant when you can have a lab test formulas 24/7/365? Is cost a blocker when you can have a cure to cancer_type_a? And then _b_c...etc?

Also, remember that costs go down within a few generations. There's no reason to think this will stop.

ysofunny · 1h ago

and the AIs stupider!

I am seeing problems with formatting that seemed 'solved' already.

I mean, I have seen "the same" model get better and worse already.

clearly somebody is calibrating the stupidity level relative to energy cost and monetary gain