U.S. government takes 10% stake in Intel (cnbc.com)
604 points by givemeethekeys 6d ago 718 comments
Claude Sonnet will ship in Xcode (developer.apple.com)
459 points by zora_goron 17h ago 362 comments
Deploying DeepSeek on 96 H100 GPUs
116 GabrielBianconi 32 8/29/2025, 2:07:28 PM lmsys.org ↗
This throughput assumes 100% utilizations. A bunch of things raise the cost at scale:
- There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings)
- GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia.
These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization.
But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%.
A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr. With 188 million input tokens/hr and 80 million output tokens/hr, that comes out to around $2/million input tokens, and $4.70/million output tokens.
This is actually a lot more than Deepseek r1's rates of $0.10-$0.60/million input and $2/million output, but I'm sure major providers are not paying AWS p5 on-demand pricing.
Edit: those figures were per node, so the actual input and output prices would be divided by 12.$0.17/million input tokens, and $0.39/million output
An H100 costs about $32k, amortized over 3-5 years gives $1.21 to $0.7 per hour, so adding in electricity costs and cpu/ram etc... runpod.io is running much closer to the actual cost compared to AWS.
Reversing out these numbers tells us that they're paying about $2/H100/Hour (or $16/hour for a 8xH100 node).
Disclaimer (one of my sites) https://www.serversearcher.com/servers/gpu - says that a one month commit on a 8XH100 node goes for $12.91/hour. The "I'm buying the servers and putting them in COLO rate" usually works out at around $10/Hour, so there's scope here to reduce the cost by ~30% just by doing better/more committed purchasing.
Inference is more profitable than I thought.
Is that just the cost of electricity, or does it include the cost of the GPUs spread out over their predicted lifetime?
Maybe the cost of renting?
That's silly, but the idea that "local" is not the opposite of remote is even sillier.
Lots of people were advocating for running their k8s on bare metal servers to maximize the performance of their containers
Now wherever that's applied to your conversation... I've no clue, too little context ( 。 ŏ ﹏ ŏ )
Bare metal in the context of running software is a technical term with a clear meaning that hasn't become contested like "AI" or "Crypto" - and that meaning is that the software is running directly on the hardware.
As k8s isn't virtualization, processes spawned by its orchestrator are still running on bare metal. It's the whole reason why containers are more efficient compared to virtual machines
https://en.m.wikipedia.org/wiki/Bare-metal_server
I a Java app running on Linux bare metal?
Just in case you have $3-4M lying around somewhere for some high quality inference. :)
SGLang quotes a 2.5-3.4x speedup as compared to the H100s. They also note that more optimizations are coming, but they haven't yet published a part 2 on the blog post.
These are more like really gorgeous corporate swags than FOSS.