Google will allow only apps from verified developers to be installed on Android (9to5google.com)

- There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings)

- GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia.

These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization.

But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%.

caminanteblanco · 2h ago

There was some tangentially related discussion in this post: https://news.ycombinator.com/item?id=45050415, but this cost analysis answers so many questions, and gives me a better idea of how huge the margin on inference a lot of these providers could be taking. Plus I'm sure that Google or OpenAI can get more favorable data center rates than the average Joe Scmoe.

A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr. With 188 million input tokens/hr and 80 million output tokens/hr, that comes out to around $2/million input tokens, and $4.70/million output tokens.

This is actually a lot more than Deepseek r1's rates of $0.10-$0.60/million input and $2/million output, but I'm sure major providers are not paying AWS p5 on-demand pricing.

Edit: those figures were per node, so the actual input and output prices would be divided by 12.$0.17/million input tokens, and $0.39/million output

zipy124 · 2h ago

AWS is absolutely not cheap, and never has been. You want to look for the hetzner of the GPU world like runpod.io where they are $2 an hour, so $16/hr for 8, that's already half of aws. You can also get a volume discount if you're looking for 96 almost certainly.

An H100 costs about $32k, amortized over 3-5 years gives $1.21 to $0.7 per hour, so adding in electricity costs and cpu/ram etc... runpod.io is running much closer to the actual cost compared to AWS.

matt-p · 2h ago

188M input / 80M output tokens per hour was per node I thought?

Reversing out these numbers tells us that they're paying about $2/H100/Hour (or $16/hour for a 8xH100 node).

Disclaimer (one of my sites) https://www.serversearcher.com/servers/gpu - says that a one month commit on a 8XH100 node goes for $12.91/hour. The "I'm buying the servers and putting them in COLO rate" usually works out at around $10/Hour, so there's scope here to reduce the cost by ~30% just by doing better/more committed purchasing.

caminanteblanco · 2h ago

You were definitely right, I updated the original comment. Thanks for your correction!

caminanteblanco · 2h ago

Ok, so the authors apparently used atlas cloud hosting, which charges $1.80 per h100/hr, which would change the overall cost to around $0.08/ million input and $0.18/million output, which seems much more in line with massive inference margins for major providers.

paxys · 2h ago

According to the post their costs were $0.20/1M output tokens (on cloud GPUs), so your numbers are off somewhere.

arnaudsm · 2h ago

Interestingly, this is 10x cheaper than the cheapest provider on OpenRouter : https://openrouter.ai/deepseek/deepseek-r1?sort=price

Inference is more profitable than I thought.

34679 · 3h ago

"By deploying this implementation locally, it translates to a cost of $0.20/1M output tokens"

Is that just the cost of electricity, or does it include the cost of the GPUs spread out over their predicted lifetime?

zipy124 · 1h ago

This is all costs included. Thats 22k tokens per second per node, so per 8 h100's. With 12 nodes they get 264k tokens per second, or 950 million an hour. This get's you to roughly $0.2021 per million at $2 an hour for an h100, which is what they go for on services such as runpod.io . (cheaper if not paying spot-price + volume discounts).

dragonslayer56 · 3h ago

” Our implementation, shown in the figure above, runs on 12 nodes in the Atlas Cloud, each equipped with 8 H100 GPUs.”

Maybe the cost of renting?

34679 · 3h ago

I'm confused because I wouldn't consider a cloud implementation to be local.

randomjoe2 · 2h ago

Local doesn't refer to "on metal" anymore to many people

mwcz · 2h ago

"On metal" is muddied too. I've heard people refer to web apps running in an OCI container as being "bare metal" deployment, as opposed to AWS or whatever hosting platform.

That's silly, but the idea that "local" is not the opposite of remote is even sillier.

ffsm8 · 2h ago

You can run an OCI container on bare metal though. It doesn't stop being run on bare metal just because you're running in kernel namespaces, aka docker container

Lots of people were advocating for running their k8s on bare metal servers to maximize the performance of their containers

Now wherever that's applied to your conversation... I've no clue, too little context ( ｡ ŏ ﹏ ŏ )

okasaki · 1h ago

In my opinion, if you're running k8s on bare metal, that's "k8s on bare metal" but still "<your app> on kubernetes", not "<your app> on bare metal".

ffsm8 · 32m ago

Sorry, but then your opinion is just plain wrong

Bare metal in the context of running software is a technical term with a clear meaning that hasn't become contested like "AI" or "Crypto" - and that meaning is that the software is running directly on the hardware.

As k8s isn't virtualization, processes spawned by its orchestrator are still running on bare metal. It's the whole reason why containers are more efficient compared to virtual machines

bee_rider · 26m ago

Bare metal as in, no operating system? Does Linux really get in the way of these LLM inference engines?

ffsm8 · 20m ago

No, as I said in my previous comment: bare metal as in not a virtual machine

https://en.m.wikipedia.org/wiki/Bare-metal_server

dtech · 2h ago

If you do bare metal as not being under a VM it fits. OCI on linux is cgroup so that counts as not a VM I'd say. Or at least it's a layer closer to the metal than a typical VM running OCI images.

I a Java app running on Linux bare metal?

bee_rider · 31m ago

Local doesn’t need to be “on metal,” but I’m still confused as to what they are saying. Are they running some local cloud system?

monsieurbanana · 2h ago

I missed that train

vFunct · 2h ago

My basement server really confused by all this...

DSingularity · 2h ago

I guess local for him is independent/private.

ollybee · 2h ago

H100's can be $2 and hour, so $192 an hour for the full cluster. They report 22k tokens per second, so ~ 80 million an hour, thats $16 an hour at $0.2 per million. Maybe a bit more for input tokens, but it seems a long way off.

zipy124 · 2h ago

I think you mis-read. Thats 22k tokens per second per node, so per 8 h100's. With 12 nodes they get 264k tokens per second, or 950 million an hour. This get's you to roughly $0.2021 per million at $2 an hour.

s46dxc5r7tv8 · 2h ago

Separation of the prefill and decoding layers with sglang is quite nifty! Normally 8xH100 would barely be able to hold the 4bit quantization of the model without even considering the KV cache. One prefill node for 3 decode nodes is also fascinating, nice writeup.

ozgune · 1h ago

The SGLang Team has a follow-up blog post that talks about DeepSeek inference performance on GB200 NVL72: https://lmsys.org/blog/2025-06-16-gb200-part-1/

Just in case you have $3-4M lying around somewhere for some high quality inference. :)

SGLang quotes a 2.5-3.4x speedup as compared to the H100s. They also note that more optimizations are coming, but they haven't yet published a part 2 on the blog post.

abdellah123 · 3h ago

Wow, please edit the title to include Open-source !

numpad0 · 2h ago

These open models are just commercial binary distributions made available at zero cost with intention to cripple opportunities for Western LLM providers to capitalize on investments.

These are more like really gorgeous corporate swags than FOSS.

Blahah · 2h ago

Why? Open source isn't in the original title

SV_BubbleTime · 2h ago

Also “open source” I feel covers for “open weights” which is not the same thing.

Google will allow only apps from verified developers to be installed on Android (9to5google.com)

Ask HN: The government of my country blocked VPN access. What should I use?

Gemini 2.5 Flash Image (developers.googleblog.com)

FFmpeg 8.0 (ffmpeg.org)

What are OKLCH colors? (jakub.kr)

Dissecting the Apple M1 GPU, the end (rosenzweig.io)

A German ISP changed their DNS to block my website (lina.sh)

Claude for Chrome (anthropic.com)

DeepSeek-v3.1 (api-docs.deepseek.com)

AI tooling must be disclosed for contributions (github.com)

Show HN: Base, an SQLite database editor for macOS (menial.co.uk)

A visual introduction to big O notation (samwho.dev)

Updates to Consumer Terms and Privacy Policy (anthropic.com)

Go is still not good (blog.habets.se)

Comet AI browser can get prompt injected from any site, drain your bank account (twitter.com)

We regret but have to temporary suspend the shipments to USA (olimex.wordpress.com)

U.S. government takes 10% stake in Intel (cnbc.com)

Waymo granted permit to begin testing in New York City (cnbc.com)

Monodraw (monodraw.helftone.com)

Ban me at the IP level if you don't like me (boston.conman.org)

Google has eliminated 35% of managers overseeing small teams in past year (cnbc.com)

Michigan Supreme Court: Unrestricted phone searches violate Fourth Amendment (reclaimthenet.org)

Altered states of consciousness induced by breathwork accompanied by music (journals.plos.org)

Scientist exposes anti-wind groups as oil-funded, now they want to silence him (electrek.co)

US Intel (stratechery.com)

Tesla said it didn't have key data in a fatal crash, then a hacker found it (washingtonpost.com)

Unexpected productivity boost of Rust (lubeno.dev)

Are OpenAI and Anthropic losing money on inference? (martinalderson.com)

Io_uring, kTLS and Rust for zero syscall HTTPS server (blog.habets.se)

Nx compromised: malware uses Claude code CLI to explore the filesystem (semgrep.dev)

How to build a coding agent (ghuntley.com)

What makes Claude Code so damn good (minusx.ai)

Claude Sonnet will ship in Xcode (developer.apple.com)

Framework Laptop 16 (frame.work)

Building the mouse Logitech won't make (samwilkinson.io)

Line scan camera image processing for train photography (daniel.lawrence.lu)

The Therac-25 Incident (2021) (thedailywtf.com)

Proposal to Ban Ghost Jobs (cnbc.com)

The GitHub website is slow on Safari (github.com)

I Am An AI Hater (anthonymoser.github.io)

Malicious versions of Nx and some supporting plugins were published (github.com)

Ask HN: Why hasn't x86 caught up with Apple M series?

A teen was suicidal. ChatGPT was the friend he confided in (nytimes.com)

Manim: Animation engine for explanatory math videos (github.com)

Uncertain<T> (nshipster.com)

Everything I know about good API design (seangoedecke.com)

We put a coding agent in a while loop (github.com)

95% of Companies See 'Zero Return' on $30B Generative AI Spend (thedailyadda.com)

Show HN: A zoomable, searchable archive of BYTE magazine (byte.tsundoku.io)

4chan will refuse to pay daily online safety fines, lawyer tells BBC (bbc.co.uk)

Deploying DeepSeek on 96 H100 GPUs

Comments (32)