Diffuse-CLoC: Guided Diffusion for Physics-Based Character Look-Ahead Control (diffusecloc.github.io)

Utilization is not a lie, it is a measurement of a well-defined quantity, but people make assumptions to extrapolate capacity models from it, and that is where reality diverges from expectations.

Hyperthreading (SMT) and Turbo (clock scaling) are only a part of the variables causing non-linearity, there are a number of other resources that are shared across cores and "run out" as load increases, like memory bandwidth, interconnect capacity, processor caches. Some bottlenecks might come even from the software, like spinlocks, which have non-linear impact on utilization.

Furthermore, most CPU utilization metrics average over very long windows, from several seconds to a minute, but what really matters for the performance of a latency-sensitive server happens in the time-scale of tens to hundreds of milliseconds, and a multi-second average will not distinguish a bursty behavior from a smooth one. The latter has likely much more capacity to scale up.

Unfortunately, the suggested approach is not that accurate either, because it hinges on two inherently unstable concepts

> Benchmark how much work your server can do before having errors or unacceptable latency.

The measurement of this is extremely noisy, as you want to detect the point where the server starts becoming unstable. Even if you look at a very simple queueing theory model, the derivatives close to saturation explode, so any nondeterministic noise is extremely amplified.

> Report how much work your server is currently doing.

There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

Ultimately, the confidence intervals you get from the load testing approach might be as large as what you can get from building an empirical model from utilization measurement, as long as you measure your utilization correctly.

PaulKeeble · 43m ago

This is bang on, you can't count the hyperthreads as double the performance, typically they are actually in practice only going to bring 15-30% if the job works well with it and their use will double the latency. Failing to account for loss in clockspeed as the core utilisation climbs is another way its not linear and in modern software for the desktop its really something to pay careful attention to.

It should be possible from the information you can get on a CPU from the OS to better estimate utilisation involving at the very least these two factors. It becomes a bit more tricky to start to account for significantly going past the cache or available memory bandwidth and the potential drop in performance to existing threads that occurs from the increased pipeline stalls. But it can definitely be done better than it is currently.

judge123 · 1h ago

This hits so close to home. I once tried to explain to a manager that a server at 60% utilization had zero room left, and they looked at me like I had two heads. I wish I had this article back then!

hinkley · 57m ago

You also want to hit him with queueing theory.

Up to a hair over 60% utilization the queuing delays on any work queue remain essentially negligible. At 70 they become noticeable, and at 80% they've doubled. And then it just turns into a shitshow from there on.

The rule of thumb is 60% is zero, and 80% is the inflection point where delays go exponential.

The biggest cluster I ran, we hit about 65% CPU at our target P95 time, which is pretty much right on the theoretical mark.

BrendanLong · 47m ago

A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible.

tgma · 1h ago

The way they refer to cores in their system is confusing and non-standard. The author talks about a 5900X as a 24 core machine and discusses as if there are 24 cores, 12 of which are piggybacking on the other 12. In reality, there are 24 hyperthreads that are pretty much pairwise symmetric that execute on top of 12 cores with two sets of instruction pipeline sharing same underlying functional units.

saghm · 54m ago

Years ago, when trying to explain hyper threading to my brother, who doesn't have any specialized technical knowledge, he came up with the analogy that it's like 2-ply toilet paper. You don't quite have 24 distinct things, but you have 12 that are roughly twice as useful as the individual ones, although you can't really separate them and expect them to work right.

nayuki · 43m ago

Nah, it's easier than that. Putting two chefs in the same kitchen doesn't let you cook twice the amount of food in the same amount of time, because sometimes the two chefs need to use the same resource at the same time - e.g. sink, counter space, oven. But, the additional chef does improve the utilization of the kitchen equipment, leaving fewer things unused.

sroussey · 34m ago

Will be interesting when (if?) Intel ships software defined cores which are the logical inverse of hyper threading.

Instead of having a big core with two instruction pipelines sharing big ALUs etc, they have two (or more) cores that combine resources and become one core.

Almost the same, yet quite different.

https://patents.google.com/patent/EP4579444A1/en

tgma · 6m ago

There was the dreaded AMD FX chip which was advertised as 8 core, but shared functional units. Got sued, etc.

BrendanLong · 38m ago

Thanks for the feedback. I think you're right, so I changed a bunch of references and updated the description of the processor to 12 core / 24 thread. In some cases, I still think "cores" is the right terminology though, since my OS (confusingly) reports utilization as-if I had 24 cores.

sroussey · 16m ago

Eh, what’s a thread really? It’s a term for us humans.

The difference between two threads and one core or two cores with shared resources?

Nothing is really all that neat and clean.

It more of a 2 level NUMA type architecture with 2 sets of 6 SMP sets of 2.

The scheduler may look at it that way (depending), but to the end user? Or even to most of the system? Nah.

hinkley · 1h ago

How many times has hyperthreading been an actual performance benefit in processors? I cannot count how many times an article has come out saying you'll get better performance out of your <insert processor here> by turning off hyperthreading in the BIOS.

It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with.

toast0 · 17m ago

Going from 1 core to 2 hyperthreads was a big bonus in interactivity. But I think it was easy to get early systems to show worse throughput.

I think there's two kinds of loads where hyperthreads aren't more likely to hurt than help. If you've got a tight loop that uses all the processor execution resources, you're not gaining anything by splitting that in two, it just makes things harder. Or if your load is mostly bound by memory bandwidth without a lot of compute... having more threads probably means you're that much more oversubscribed on i/o and caching.

But a lot of loads are grab some stuff from memory and then do some compute, rinse and repeat. There's a lot of potential for idle time while waiting on a load, being able to run something else during that time makes a lot of sense.

It's worth checking how your load performs with hyperthreads off, but I think default on is probably the right choice.

sroussey · 7m ago

Definitely measure both ways and decide.

For many years (still?) it was faster to run your database with hyper threading turned off and your app server with it turned on.

esseph · 10m ago

Intel vs AMD, you'll get a different answer on the hyperthreading question.

https://www.tomshardware.com/pc-components/cpus/zen-4-smt-fo...

twoodfin · 1h ago

For whatever it’s worth, operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading.

I’m familiar with one such system where the throughput benefit is ~15%, which is a big deal for a BIOS flag.

IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!)

tom_ · 23m ago

Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already.

(Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)

jiggawatts · 46m ago

I've noticed an overreliance on throughput as measured during 100% load as the performance metric, which has resulted in hardware vendors "optimising to the test" at the expense of other, arguably more important metrics. For example: single-user latency when the server is just 50% loaded.

twoodfin · 31m ago

That’s more than fair.

In the system I’m most familiar with, however, the benefits of hyperthreading for throughput extend to the 50-70% utilization band where p99 latency is not stressed.

duped · 58m ago

For me today it's definitely a pessimation because I have enough well-meaning applications that spawn `nproc` worker threads. Which would be fine if they're the only process running, but they're not.

hinkley · 12m ago

I wrote a little tool for our services that could do basic expression based off of nproc based on an environment variable at startup time.

You could do one thread for every two cores, three threads for every 2 cores, one thread per core ± 1, or both (2n + 1).

Unfortunately the sweet spot based on our memory usage always came out to 1:1, except for a while when we had a memory leak that was surprisingly hard to fix, and we ran n - 1 for about 4 months while a bunch of work and exploratory testing were done. We had to tune in other places to maximize throughput.

toast0 · 24m ago

Wouldn't that be about the same badness without hyperthreads? If you're oversubscribed, there might be some benefit to having fewer tasks, but maybe you get some good throughput with two different application's threads running on opposite hyperthreads.

hinkley · 4m ago

Oversubscribing also leads to process migration, which these days leads to memory read delays.

tom_ · 32m ago

Total throughout has always seemed better with it switched on for me, even for stuff that isn't hyper threading friendly. You get a free 10% at least.

tgma · 1h ago

It has a lot to do with your workload as well as if not moreso than the chip architecture.

The primary trade-off is the cache utilization when executing two sets of instruction streams.

hinkley · 1h ago

That's likely the primary factor, but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU.

tgma · 1h ago

May be true for FMA or AVX2 or similar stuff. Outside vector units that sounds implausible. Obviously multi core thermal throttling is a thing but that would by far dominate. Hyperthreading should have minimal impact there.

gruez · 1h ago

>but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU.

That doesn't make any sense. Disabling SMT likely saves negligible amount of power, but disables any performance to be gained from the other thread. If there's thermal budget available, it's better to spend it by shoving more work onto the second thread than to leave it disabled. If anything, due to voltage/frequency curves, it might even be better to run your CPU at lower clocks but with SMT enabled to make up for it (assuming it's amenable to your workloads), than it is to run with SMT disabled.

BrendanLong · 56m ago

To be fair, in most of these tests hyperthreading did provide a significant benefit (in the general CPU stress test, the hyperthreads increased performance by ~66%). It's just confusing that utilization metrics treat hyperthread usage the same as full physical cores.

FpUser · 57m ago

In the old days it had made the difference between my multimedia game like application not working at all with hyperthreading off to working just fine with it on.

Search LBJ Presidential Library Collections (discoverlbj.org)

The Hidden Vulnerabilities of Open Source (fastcode.io)

Nano Banana > Photoshop (videotube.ai)

Meta putting wood in bit barns in bid to get greener (theregister.com)

Hey Tech Bro–Your Dream City Is Doomed (zocalopublicsquare.org)

Zig Software Foundation 2025 Financial Report and Fundraiser (ziglang.org)

Quantum Chess (q-chess.com)

Nano Image – Generate Images with Nano Banana AI Technology (nanoimage.app)

US Pulls TSMC China Waiver (finance.yahoo.com)

Nano Banana – 2025's Fastest AI Image Editor (Text-to-Edit, Not Gen) (nano-banana-ai.net)

My Claude code setup on windows (harishgarg.com)

Laravel inventor: don't code 'cathedrals of complexity' (theregister.com)

China Victory Day Military Parade (Live) (youtube.com)

Show HN: Tallyit.co – Upload. Describe. Bill (tallyit.co)

Top Python Libraries for Visualization: Which One to Use? (codecut.ai)

open source is distinct from Open Source (paritybits.me)

Speeding up Unreal Editor launch by not spawning 38000 tooltips (larstofus.com)

Cracking Down, Pricing Up: Housing Supply in the Wake of Mass Deportation (papers.ssrn.com)

What overthinking after a party has to do with your 'lizard brain' (popsci.com)

Sketchy.boats (sketchy.boats)

Show HN: Kooder – AI software engineer for full-stack development (kooder.dev)

The US Population Could Shrink in 2025, for the first time ever (derekthompson.org)

Wash Your Shoes. Experts Explain How (time.com)

I want to be left alone (blog.ctms.me)

Java Build Tools Could Be Better [video] (youtube.com)

Google not required to sell Chrome or Android, judge rules in win for tech giant (bbc.com)

Show HN: Feedin – AI-powered personalized news reader (feedin.ai)

Show HN: Erdus – Universal ERD Converter (SQL, Prisma, JSON Schema) (github.com)

The Only Real Solution to the A.I. Cheating Crisis (nytimes.com)

Separating Equifax from Fiction (1995) (wired.com)

Show HN: Training an LLM to Play Wordle with RL on Apple Silicon (charbull.github.io)

Show HN: Dafont – Cool Fonts and Copy Paste Font Generator (dafontweb.com)

Camataca Android Camera App (github.com)

High-IQ Kids: Are They Unhappier? (medscape.com)

Understanding the success and failure of online political debate (science.org)

It's time for Penpot to almost move away from the DOM (community.penpot.app)

Never-Monetize (codeberg.org)

Diffuse-CLoC: Guided Diffusion for Physics-Based Character Look-Ahead Control (diffusecloc.github.io)

Wikidata and Mundaneum – The Triumph of the Commons (schmud.de)

The Information-Computational Universe (ICU) Theory (zenodo.org)

Regular Expressions 101 – regex visualiser and debugger (regex101.com)

Adding [Derive(From)] to Rust (kobzol.github.io)

Windows Mr Headsets Revived by Free 'Oasis' SteamVR Driver (uploadvr.com)

U.S. Military Strikes Drug Vessel from Venezuela, Killing 11 (wsj.com)

Breakdown of the onboarding and growth hacks used in Nikita Bier's app Explode (docs.google.com)

Climate Experts’ Review of the DOE Climate Working Group Report (sites.google.com)

Fuel supply is a bottleneck for Starship (arstechnica.com)

Signal Processing Recipes for Communication Systems (wirelesspi.com)

Ask HN: If you were to start a business outside of tech, what would it be?

United States of Emergency (2025) (statesofemergency.com)

%CPU utilization is a lie

Comments (31)