Tools I love: mise(-en-place) (blog.vbang.dk)

It only supports consecutive instructions in the innermost loops. It can't include nor even ignore any setup/teardown cost. This means I can't feed any function as-is (even a tiny one). I need to manually cut out the loop body.

It doesn't support branches at all. I know it's a very hard problem, but that's the problem I have. Quite often I'd like to compare branchless vs branchy versions of an algorithm. I have to manually remove branches that I think are predictable and hope that doesn't alter the analysis.

It's not designed to compare between different versions of code, so I need to manually rescale the metrics to compare them (different versions of the loop can be unrolled different number of times, or process different amount of elements per iteration, etc.).

Overall that's laborious, and doesn't work well when I want to tweak the high-level C or Rust code to get the best-optimizing version.

camel-cdr · 3h ago

One thing to keep in mind with llvm-mca is that not all processors use their own scheduling model and different scheduling models are more or less accurate.

E.g. Cortex-A72 uses the Cortex-A57 model, as does Cortex-A76, even Cortex-A78.

The neoverse V1 model has an issue width of 15, meanwhile the neoverse V2 (and V3, which uses V2) has an issue width of 6.

MobiusHorizons · 2h ago

Are you saying the model used to simulate many different cpu models is the same, which makes comparing CPUs harder? Or are you saying the model is not accurate?

It’s an interesting point that the newer neoverse cores use a model with smaller issue width. Are you saying this doesn’t match reality? If so do you have any idea why they model it that way?

camel-cdr · 1h ago

> Are you saying the model used to simulate many different cpu models is the same, which makes comparing CPUs harder? Or are you saying the model is not accurate?

Both, but mostly the former. You can view the scheduling models used for a given CPU here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

    * CortexA53Model used for: A34, A35, A320, a53, a65, a65ae
    * CortexA55Model used for: A55, r82, r82ae
    * CortexA510Model used for: a510, a520, a520ae
    * CortexA57Model used for: A57, A72, A73, A75, A76, A76ae, A77, A76, A78ae, A78c
    * NeoverseN2Model used for: a710, a715, a720, a720ae, neoverse-n2
    * NeoverseV1Model used for: X1, X1c, neoverse-v1/512tvb
    * NeoverseV2Model used for: X2, X3, X4, X295, grace, neoverse-v2/3/v3ae
    * NeoverseN3Model used for: neoverse-n3

It's even worse for Apple CPUs, all apple CPUs, from apple-a7 to apple-m4 use the same "CycloneModel" of a 6-issue out-of-order core from 2013.

There are more fine-grained target-specific feature flags used, e.g. for fusion, but the base scheduling model often isn't remotely close to the actual processor.

> It’s an interesting point that the newer neoverse cores use a model with smaller issue width. Are you saying this doesn’t match reality? If so do you have any idea why they model it that way?

Yes, I opened an issue about the Neoverse cores since then an independent PR adjusted the V2 down from 16 wide to a more realistic 8 wide: https://github.com/llvm/llvm-project/issues/136374

Part of the problem is that LLVMs scheduling model can't represent all properties of the CPU.

The issue width for those cores seems to be set to the maximum number of uops the core can execute at once. If you look at the Neoverse V1 micro architecture, it indeed has 15 independent issue ports: https://en.wikichip.org/w/images/2/28/neoverse_v1_block_diag...

But notice how it can only decode 8 instructions (5 if you exclude MOP cache) per cycle. This is partially because some operations take multiple cycles before the port can execute new instructions, so having more execution ports is still a gain in practice. The other reason is uop cracking. Complex addressing modes and things like load/store pairs are cracked into multiple uops, which execute on separate ports.

The problem is that LLVMs IssueWidth parameter is used to model, decode and issue width. The execution port count is derived from the ports specified in the scheduling model itself, which basically are correct.

---

The reason for all of this is, if I had to guess, that modeling instruction scheduling doesn't matter all that much for codegen on OoO cores. The other one is that just putting in the "real"/theoretical numbers doesn't automatically result in the best codegen.

It does matter, however, if you use it to visualize how a core would execute instructions.

The main point I want to make, is that you shouldn't use llvm-mca with -mcpu=apple-m4 and use it to compare against -mcpu=znver5 and expect any reasonable answers. Just be sure to check the source, so you realize you are actually comparing a scheduling model based on the apple Cyclone (2013) core and the Zen4 core (2022).

b0a04gl · 1h ago

llvm-mca's always was one of those tools i bookmark but never touch, this post finally made it feel usable, seeing uop breakdowns and bottlenecks right in the cli was super clarifying

Tools I love: mise(-en-place) (blog.vbang.dk)

Why Are Homes in Western States So Expensive? (construction-physics.com)

A List Is a Monad (alexyorke.github.io)

Apple's history is hiding in a Mac font (spacebar.news)

Is chat a good UI for AI? A Socratic dialogue (geoffreylitt.com)

Civilization Gallery (forums.civfanatics.com)

RFK Jr. is bringing psychedelics to the Republican Party (politico.com)

Show HN: Bookmarks + Notes as One Simple Feed (apps.apple.com)

Baidu's open source Ernie is about to hit the market (cnbc.com)

Budapest Pride Parade Was Bigger Than Ever, Despite Orban's Ban (nytimes.com)

Unreal Engine 5.6 is now available (unrealengine.com)

How Maglev Trains Work (science.howstuffworks.com)

Share Your CLAUDE.mds (claudemd.dev)

Show HN: Formo – data platform for onchain apps (docs.formo.so)

Show HN: SaaS Starter Kit with Stripe and Auth and Admin Panel (MERN) (falakthackar.gumroad.com)

Formo – data platform for onchain apps (formo.so)

Loss of key US satellite data could send hurricane forecasting back 'decades' (theguardian.com)

Show HN: Sharpe Ratio Calculation Tool (fundratios.com)

Zero Trust for Bring Your Own Cloud (BYOC) (signoz.io)

Taking Blogging Seriously (newsletter.tomcritchlow.com)

AI and Programming Language Communities (jerf.org)

Trump considers forcing journalists to reveal sources who leaked Iran report (theguardian.com)

Trump Cuts Subscriptions to Springer Nature Journals (science.org)

How Does AI Image Recognition Work? What Is CNN and the Tech Behind It [video] (youtube.com)

EDF anticipates immediate reduction in nuclear output amid heatwave (energynews.pro)

The first lab-grown salmon sold in the U.S. (washingtonpost.com)

MergerFS: Combine numerous filesystems into a single mount point (github.com)

Steve Ballmer on Vista: "we cannot do that again" (2007) (twitter.com)

Ask HN: Why my show HN posts are not visible to anyone?

Public Lands Sell-Off Is Struck from the GOP Policy Bill (nytimes.com)

Personal care products disrupt the human oxidation field (science.org)

Highbrow Climate Misinformation (josephheath.substack.com)

America's Coming Smoke Epidemic (theatlantic.com)

On having no visual memory (2024) (rachelandrew.co.uk)

Show HN: LLM-Native Website Builder (webscone.com)

NeuroShellOS: A Concept for AI-Native Linux with Deep LLM Integration

Beyond the "Wow": Using Generative AI for Increasing Generative Sense-Making (link.springer.com)

Why Are Scam Emails Getting So Weird? (muckypaws.com)

Hi, I'm founder. I have 20 domains and a dozen Supabase accounts and have made 0 (dontbuildthat.com)

Show HN: I made a zero-log, ephemeral, E2EE web chat (trashtalk.me)

The Evolution of Caching Libraries in Go and Ristretto's zero hit rate mystery (maypok86.github.io)

Online Font Size Calculator (anvituteja.in)

Show HN: Distapp. Manage and distribute Android, iOS and Desktop app (distapp.lhf.my.id)

Dead members of Congress can't stop posting (politico.com)

Show HN: StopAddict – A minimalist, gamified app to quit addictions

Self-driving is finally happening (world.hey.com)

Over a third of people on sinking Tuvalu seek Australia's climate visas (reuters.com)

LAPD Face Search (github.com)

Analysing the Death Toll from the Hamas-Run Ministry of Health in Gaza [pdf] (henryjacksonsociety.org)

Magma (en.wikipedia.org)

Performance Debugging with LLVM-mca: Simulating the CPU

Comments (5)