Judge rules Apple executive lied under oath, makes criminal contempt referral (thebignewsletter.com)

I’m at the integration testing and benchmarking phase of a rust crate for allocating LLMs to GPUs and System RAM. The impetus is that single models are limited in what they can achieve and more complex workflows require LLM swarms. Think of a lot of smaller models doing reasoning steps or tasks like search and then a big model to summarize it all.

It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).

I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.

Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.

mooreds · 17h ago

The subtitle (which is important but was too long for the HN submission) is "A high-level guide to GPU utilization".

alexjplant · 1h ago

OT: I'm not really sure what the author meant by

> Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s

since FM was more of an 80s thing. Even their linked comment says

> Throughout the 90s FM was old-hat. Nobody wanted to hear those woody clangy sounds of the 80s anymore.

FM synthesis has kept being a thing ever since in specific applications but the zeitgeist of the 90s (and its modern postmodern retreads like vaporwave) is arguably digital sampling.

charles_irl · 16h ago

Oh, I wrote this! Thanks for sharing it.

freeqaz · 15h ago

Anything you feel is worth adding for the HN crowd while you've got our attention? :)

(Thanks for writing this btw!)

charles_irl · 14h ago

Hmm, hard to say!

In the few months since I originally wrote this, I've come to an even greater appreciation of just how hard it is to maximize utilization of the Tensor Cores. It's a lot more than just kernel parameter tuning and using a few parallel programming tricks (parallel reduce, unrolling). It really borks your CUDA code -- you need warp specialization, you need to break warp uniformity, you need to work with explicit asynchrony. Hoping to write about this for the Modal blog/GPU Glossary soon!

I also spent a bit of time working with ncu/"NSight Compute". I'd probably include a bit about it in the section on how to improve your MFU if I rewrote the article today. But tl;dr use the profiling tool, Luke! And a good way to learn is to watch NVIDIA's GTC talks.

That said, I've also noticed even more cases where GPU kernel utilization is well below target. I think (and Horace He has argued) that that comes in part from optimized GEMMs running so fast on Tensor Cores that host overhead becomes the bottleneck (classic Amdahl). This unfortunately means more host logic needs to be compiled -- either graph-compiled as in torch.compile or moved into a compiled language.

Mockapapella · 17h ago

This is a good article on the "fog of war" for GPU inference. Modal has been doing a great job of aggregating and disseminating info on how to think about high quality AI inference. Learned some fun stuff -- thanks for posting it.

> the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.

Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_/¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...

charles_irl · 16h ago

Nice article! I had to restrain myself from ranting on our blog :)

semessier · 16h ago

well, related: fractional GPUs to multiplex workloads for aggregate utilization have been a topic for some time with no definite (NVIDIA) solutions for it: https://vhpc.org

charles_irl · 14h ago

We looked into this at Modal! We put out vGPUs but didn't see demand and our internal benchmarks for MPS and Green Contexts didn't indicate a big win.

The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.

And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).

kristianpaul · 15h ago

I'm still trying to use all my CPUs...

r3tr0 · 12h ago

We spend a lot of time on getting these measurement w eBPF

You can check us out at https://yeet.cx

Heres an overview of our GPU specific solution

https://yeet.cx/solutions/maximize-ai-infra-roi

drob518 · 15h ago

And we’re back to time-sharing.

charles_irl · 14h ago

When I'm feeling sassy, I like to tell people that Modal is "Enterprise Java Beans for AI".

dehrmann · 12h ago

Tomcat wanted to be some sort of compile once, run anywhere Docker.

esperent · 13h ago

What is time-sharing and why is being back to it a bad thing?

freeqaz · 15h ago

How fast are modern GPU boxes able to spin up these days? Loading in a massive blob of weights into VRAM feels like it's gotta be the bottleneck even if server provisioning is fast.

Or am I naive and my knowledge is outdated? I am genuinely curious what people see and what providers are capable of in 2025.

tedivm · 15h ago

Once you have the models on local storage you can move pretty quickly from there to VRAM, I've never found that to be the biggest bottleneck. The problem is provisioning itself, especially if you have to actually move models locally. Some of this can be avoided with extremely expensive networking (infiniband to a NAS with model weights), but that's not something you're going to have fun dealing with in a cloud environment.

It might help to remember that the training process is essentially a game of "how fast can we shove data into these GPUs", and having a GPU sit idle because the data can't get into it fast enough is a challenge people have been tackling since at least the P100 series. This has resulted in improvements on the GPUs as well as all the hardware around them. Getting data into the chips is one of the most efficient processes at this point.

freeqaz · 13h ago

How do Serverless GPU Cloud Providers deal with that then? Do they go down the Infiniband-to-NAS rabbit hole to build all of their infrastructure? Or do they just setup an NVME RAID cache to hold the models locally? (Maybe an LRU? With the system memory also being used?)

I imagine in the real world that model usage follows a zipfian distribution, ie, a small number of models (<10) represent 95% of the machines for inference workloads. And, for those machines, you can just load the weights off of your ~40gbit ethernet connection since they're never cycling.

But for that last 5%, I feel like that's where it becomes important. If I'm running a weird, custom model and I want Lambda-like billing... what's the stack? Is the market big enough that people care? (And do most people just use LORAs which are much easier to hot swap?)

Training I imagine is a totally different ballpark because you're constantly checkpointing, transferring data at each step, etc, versus inference. That's a world I know a lot less about though!

mountainriver · 15h ago

There is some recent work to make the model loading significantly faster by InferX. They are claiming to be able to load the a 7b in under 2 seconds.

I haven’t tried it but if that’s the case it’s a game changer

charles_irl · 14h ago

We've talked to them and there's some impressive technology there!

kllrnohj · 14h ago

pcie 5.0 x16 is ~64gb/s of bandwidth. Real world is never perfect, but it's not exactly a small pipe here.

pavelstoev · 12h ago

GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.

cph123 · 2h ago

Could SR-IOV VFs be a solution to that?

mwilcox · 17h ago

Understandable

cubefox · 17h ago

For anyone thinking this is about video games:

> We’ll specifically focus on neural network inference workloads

keybored · 17h ago

It’s hard to forget the neural network application these days.

awesome_dude · 16h ago

I'm old enough to remember when people would be concerned if their CPU usage went to 100%

twoodfin · 16h ago

You’d worry about 100% CPU because even if the OS was successfully optimizing for throughput (as Linux is very good at), latency/p99 is certain to suffer as spare cycles disappear.

That’s not a concern with typical GPU workloads, which are batch/throughput-oriented.

calaphos · 16h ago

There's still a throughput/latency tradeoff curve, at least for any sort of interactive models.

One of the reasons why inference providers sell batch discounts.

awesome_dude · 16h ago

If you never reach 100% CPU usage, why did you buy an oversized CPU?

dehrmann · 13h ago

For low-latency applications, and this can be anything from your computer's UI to a webserver, at around 80%, you start to see latency increase quickly. If it's CPU-intensive on your computer, you can run the CPU hog at a lower priority. If it's a webserver, you're forced to trade off throughput and latency.

frainfreeze · 14h ago

Surely you're referring to occasional full load, not 24/7 load? Or 100% usage on some but not all cores? 100% used CPU means unresponsive system usually, and things crashing.

johnklos · 13h ago

Your point is taken, but if things are crashing because your CPU is running at 100%, you either have an Intel CPU or you have other hardware problems. There should be no issue running a CPU at 100% 24/7/365 indefinitely.

awesome_dude · 13h ago

I mean, I'm glad you could see that, it appears several other people interpret things in only one direction, 100% 24/7

And yes, people back in the day were actively concerned if their CPUs ever hit 100% - it never made sense then, it doesn't make sense now.

Dylan16807 · 12h ago

If the load meter ever reads "100%" then that means you were at 100% for long enough to cause problems. It's measuring a period much longer than a millisecond. It depends on use case how big those problems are, and whether you want to pay money to avoid them, but they exist even before you hit 100% and even if it's only briefly that high.

Peaking at 90% over the monitoring interval does not mean you can fit 10% more load without compromises. It does not mean your CPU is oversized.

Non-zero concern is correct.

awesome_dude · 12h ago

Bollocks.

My CPU meters read 100% every day during set up time.

Dylan16807 · 12h ago

And the compromise is that set up takes longer.

awesome_dude · 11h ago

I am trying not to spit my coffee out here.

You're seriously telling people that a CPU working at 90% is processing faster than a CPU working at 100%?

Context switching, memory access, and the ability of the problem to be computed in parallel - sure, but CPU usage as the defining metric - seriously stop smoking that stuff.

Dylan16807 · 11h ago

> You're seriously telling people that a CPU working at 90% is processing faster than a CPU working at 100%?

If your CPU sits at 100% for several seconds then your task is almost certainly bottlenecked by the CPU part of the time. Therefore, if you use a faster CPU your task will get done faster.

So if we keep everything else the same, same hardware, same workload, same CPU design, only changing the clock speed of the CPU, then the CPU that reads 90% must be a faster CPU. Therefore it will reduce the bottlenecking, and your task will get done faster.

For the CPU upgrade to not matter, you'd have to have a task that is never bottlenecked by the CPU. That is very unlikely to be the case if your CPU is reading 100% for several seconds.

Edit to reply:

> Ok, so your grand solution to this is - faster CPUs process faster than slower CPUs. Wow. Who would have thunk it.

You were the one saying that a fast CPU is "oversized". So I explained in detail why avoiding 100% does not make your CPU oversized.

Yes it's obvious, glad you agree now.

> 100% of a CPU is faster than 90% of that same CPU.

It has more throughput but for most types of software it now has worse latency.

If you care about latency, then you don't want to increase throughput by pegging your CPU at 100%. You want to increase throughput by getting more/better CPUs and keeping them at a lower percent.

No comments yet

bdangubic · 16h ago

back in those days you weren’t renting them :)

theandrewbailey · 5h ago

You're right, you were leasing them from IBM.

XorNot · 16h ago

Well also people are pretty bad at logistical reasoning though.

From a capital expenditure perspective, you are renting the CPU you bought in terms of opportunity cost.

What people have some sense of is that there's an ascribable value to having a capability in reserve versus discovering you don't have it when you need it.

Redis is open source again (antirez.com)

I'd rather read the prompt (claytonwramsey.com)

The curse of knowing how, or; fixing everything (notashelf.dev)

Show HN: Clippy – 90s UI for local LLMs (felixrieseberg.github.io)

Judge rules Apple executive lied under oath, makes criminal contempt referral (thebignewsletter.com)

Ty: A fast Python type checker and language server (github.com)

Design for 3D-Printing (blog.rahix.de)

Technical analysis of the Signal clone used by Trump officials (micahflee.com)

Show HN: Free, in-browser PDF editor (breezepdf.com)

Claude Integrations (anthropic.com)

Gemini 2.5 Pro Preview (developers.googleblog.com)

The Death of Daydreaming (afterbabel.com)

Zed: High-performance AI Code Editor (zed.dev)

OpenAI reaches agreement to buy Windsurf for $3B (bloomberg.com)

Starting July 1, academic publishers can't paywall NIH-funded research (nih.gov)

Accountability Sinks (250bpm.substack.com)

Claude's system prompt is over 24k tokens with tools (github.com)

Evolving OpenAI's Structure (openai.com)

Matt Godbolt sold me on Rust by showing me C++ (collabora.com)

CLion Is Now Free for Non-Commercial Use (blog.jetbrains.com)

How to live an intellectually rich life (utsavmamoria.substack.com)

I decided to pay off a school’s lunch debt (huffpost.com)

Waiting for Postgres 18: Accelerating Disk Reads with Asynchronous I/O (pganalyze.com)

We fell out of love with Next.js and back in love with Ruby on Rails (hardcover.app)

Show HN: I built a synthesizer based on 3D physics (anukari.com)

Show HN: Real-time AI Voice Chat at ~500ms Latency (github.com)

Unity’s Open-Source Double Standard: the ban of VLC (mfkl.github.io)

Why Archers Didn't Volley Fire (acoup.blog)

A faster way to copy SQLite databases between computers (alexwlchan.net)

DuckDB is probably the most important geospatial software of the last decade (dbreunig.com)

India launches attack on 9 sites in Pakistan and Pakistani Jammu and Kashmir (reuters.com)

Mistral ships Le Chat – enterprise AI assistant that can run on prem (mistral.ai)

Apple App Store guidelines remove ban on encouraging external payments in US (developer.apple.com)

Buffett to step down following six-decade run atop Berkshire (bloomberg.com)

Gorgeous-GRUB: collection of decent community-made GRUB themes (github.com)

Third party cookies must be removed (w3ctag.github.io)

Time saved by AI offset by new work created, study suggests (arstechnica.com)

Curl: We still have not seen a valid security report done with AI help (linkedin.com)

Replacing Kubernetes with systemd (2024) (blog.yaakov.online)

VVVVVV Source Code (github.com)

The vocal effects of Daft Punk (bjango.com)

So Much Blood (dynomight.net)

The language brain matters more for programming than the math brain? (2020) (massivesci.com)

Old Soviet Venus descent craft nearing Earth reentry (leonarddavid.com)

Sneakers (1992) – 4K makeover sourced from the original camera negative (blu-ray.com)

“Fewer Users” Warning Hurting Specialized and New Apps (support.google.com)

International Workers' Day (en.wikipedia.org)

An appeal to Apple from Anukari (anukari.com)

Launch HN: Exa (YC S21) – The web as a database

Judge said Meta illegally used books to build its AI (wired.com)

'I paid for the whole GPU, I am going to use the whole GPU'

Comments (43)