FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

130 harambae 46 7/20/2025, 8:51:41 PM tomshardware.com ↗

Comments (46)

AaronAPU · 2h ago

When I spent a decade doing SIMD optimizations for HEVC (among other things), it was sort of a joke to compare the assembly versions to plain c. Because you’d get some ridiculous multipliers like 100x. It is pretty misleading, what it really means is it was extremely inefficient to begin with.

The devil is in the details, microbenchmarks are typically calling the same function a million times in a loop and everything gets cached reducing the overhead to sheer cpu cycles.

But that’s not how it’s actually used in the wild. It might be called once in a sea of many many other things.

You can at least go out of your way to create a massive test region of memory to prevent the cache from being so hot, but I doubt they do that.

izabera · 18m ago

ffmpeg is not too different from a microbenchmark, the whole program is basically just: while (read(buf)) write(transform(buf))

torginus · 1h ago

Sorry for the derail, but it sounds like you have a ton of experience with SIMD.

Have you used ISPC, and what are your thoughts on it?

I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand, as regular compilers suck at auto-vectorizing, especially as this has never been the case with GPU kernels.

yieldcrv · 10m ago

> what it really means is it was extremely inefficient to begin with

I care more about the outcome than the underlying semantics, to me thats kind of a given

Aardwolf · 3h ago

The article somtimes says 100x, other times it says 100% speed boost. E.g. it says "boosts the app’s ‘rangedetect8_avx512’ performance by 100.73%." but the screenshot shows 100.73x.

100x would be a 9900% speed boost, while a 100% speed boost would mean it's 2x as fast.

Which one is it?

ethan_smith · 1h ago

It's definitely 100x (or 100.73x) as shown in the screenshot, which represents a 9973% speedup - the article text incorrectly uses percentage notation in some places.

MadnessASAP · 2h ago

100x to the single function 100% (2x) to the whole filter

pizlonator · 3h ago

The ffmpeg folks are claiming 100x not 100%. Article probably has a typo

k_roy · 13m ago

That would be quite the percentage difference with 100x

torginus · 2h ago

I'd guess the function operates of 8 bit values judging from the name. If the previous implementation was scalar, a double-pumped AVX512 implementation can process 128 elements at a time, making the 100x speedup plausible.

tombert · 47m ago

Actually a bit surprised to hear that assembly is faster than optimized C. I figured that compilers are so good nowadays that any gains from hand-written assembly would be infinitesimal.

Clearly I'm wrong on this; I should probably properly learn assembly at some point...

mananaysiempre · 19m ago

Looking at the linked patches, you’ll note that the baseline (ff_detect_range_c) [1] is bog-standard scalar C code while the speedup is achieved in the AVX-512 version (ff_detect_rangeb_avx512) [2] of the same computation. FFmpeg devs prefer to write straight assembly using a library of vector-width-agnostic macros they maintain, but at a glance the equivalent code looks to be straightforwardly expressible in C with intrinsics if that’s more your jam. (Granted, that’s essentially assembly except with a register allocator, so the practical difference is limited.) The vectorization is most of the speedup, not the assembly.

To a first approximation, modern compilers can’t vectorize loops beyond the most trivial (say a dot product), and even that you’ll have to ask for (e.g. gcc -O3, which in other cases is often slower than -O2). So for mathy code like this they can easily be a couple dozen times behind in performance compared to wide vectors (AVX/AVX2 or AVX-512), especially when individual elements are small (like the 8-bit ones here).

Very tight scalar code, on modern superscalar CPUs... You can outcode a compiler by a meaningful margin, sometimes (my current example is a 40% speedup). But you have to be extremely careful (think dependency chains and execution port loads), and the opportunity does not come often (why are you writing scalar code anyway?..).

[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346725.h...

[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346726.h...

mhh__ · 30m ago

Compilers are extremely good considering the amount of crap they have to churn through but they have zero information (by default) about how the program is going to be used so it's not hard to beat them.

haiku2077 · 14m ago

If anyone is curious to learn more, look up "profile-guided optimization" which observes the running program and feeds that information back into the compiler

ivanjermakov · 39m ago

Related: ffmpeg's guide to writing assembly: https://news.ycombinator.com/item?id=43140614

cpncrunch · 42m ago

Article is unclear what will actually be affected. It mentions "rangedetect8_avx512" and calls it an obscure function. So, what situations is it actually used for, and what is the real-time improvement in performance for the entire conversion process?

jauntywundrkind · 1h ago

Kind of reminds me of Sound Open Firmware (SOF), which can compile with e8ther unoptimized gcc, or using the proprietary Cadence XCC compiler that can can use the Xtensa HiFi SIMD intrinsics.

https://thesofproject.github.io/latest/introduction/index.ht...

pavlov · 2h ago

Only for x86 / x86-64 architectures (AVX2 and AVX512).

It’s a bit ironic that for over a decade everybody was on x86 so SIMD optimizations could have a very wide reach in theory, but the extension architectures were pretty terrible (or you couldn’t count on the newer ones being available). And now that you finally can use the new and better x86 SIMD, you can’t depend on x86 ubiquity anymore.

Aurornis · 2h ago

AVX512 is a set of extensions. You can’t even count on an AVX512 CPU implementing all of the AVX512 instructions you want to use, unless you stick to the foundation instructions.

Modern encoders also have better scaling across threads, though not infinite. I was in an embedded project a few years ago where we spent a lot of time trying to get the SoC’s video encoder working reliably until someone ran ffmpeg and we realized we could just use several of the CPU cores for a better result anyway

shmerl · 3h ago

Still waiting for Pipewire + xdg desktop portal screen / window capture support in ffmpeg CLI. It's been dragging feet forever with it.

askvictor · 3h ago

I wonder how many optimisations like this could be created by LLMs. Obviously we should not be incorporating LLMs into compilers, but I suspect that that's eventually what will happen.

Arubis · 1h ago

This intrinsically feels like the opposite of a good use case for an LLM for code gen. This isn’t boilerplate code by any means, nor would established common patterns be helpful. A lot of what ffmpeg devs are doing at the assembly level is downright novel.

pizlonator · 2h ago

The hardest part of optimizations like this is verifying that they are correct.

We don’t have a reliable general purpose was of verifying if any code transformation is correct.

LLMs definitely can’t do this (they will lie and say that something is correct even if it isn’t).

viraptor · 2h ago

But we do! For llvm there's https://github.com/AliveToolkit/alive2 There are papers like https://people.cs.rutgers.edu/~sn349/papers/cgo19-casmverify... There's https://github.com/google/souper There's https://cr.yp.to/papers/symexemu-20250505.pdf And probably other things I'm not aware of. If you're limiting the scope to a few blocks at a time, symbolic execution will do fine.

pizlonator · 1h ago

> limiting the scope to a few blocks

Yeah so like that doesn’t scale.

The interesting optimizations involve reasoning across thousands of blocks

And my point is there is no reliable general purpose solution here. „Only works for a few blocks at a time” is not reliable. It’s not general purpose

hashishen · 1h ago

It doesn't matter. This is inherently better because the dev knows exactly what is being done. Llms could cripple entire systems with assembly access

gronglo · 1h ago

You could run it in a loop, asking it to improve the code each time. I know what the ffmpeg devs have done is impressive, but I would be curious to know if something like Claude 4 Opus could make any improvements.

eukara · 11m ago

I think if it was easy for them to improve critical projects like ffmpeg, we'd have seen some patches that mattered already. The only activity I've seen is LLMs being used to farm sec-ops bounties which get rejected because of poor quality.

https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

minimaxir · 35m ago

That can work with inefficient languages like Python, but not raw Assembly.

smj-edison · 2h ago

I'd be worried about compile times, lol. Final binaries are quite often tens to hundreds of megabytes, pretty sure an LLM processes tokens much slower than a compiler completes passes.

EDIT: another thought: non-deterministic compilation would also be an issue unless you were tracking the input seed, and it would still cause spooky action at a distance unless you had some sort of recursive seed. Compilers are supposed to be reliable and deterministic, though to be fair advanced optimizations can be surprising because of the various heuristics.

viraptor · 2h ago

There's no reason to run the optimisation discovery at compile time. Anything that changes the structure can be run to change the source ahead of time. Anything that doesn't can be generalised into a typical optimisation step in the existing compiler pipeline. Same applies to Souper for example - you really don't want everyone to run it.

smj-edison · 33m ago

I'm not quite understanding your comment, are you saying that ANNs are only useful for tuning compiler heuristics?

viraptor · 1m ago

That too. But mainly for transforming the source ahead of time to be more optional. If there's some low level, local optimisation, that can be found once and implemented as a stable, repeatable idea in the compiler code instead.

hansvm · 57m ago

If we're considering current-gen LLMs, approximately zero. They're bad at this sort of thing even with a human in the loop.

viraptor · 2h ago

https://arxiv.org/html/2505.11480v1 we're getting there. This is for general purpose code, which is going to be easier than heavy SIMD where you have to watch out for very specific timing, pipelines and architecture details. But it's a step in that direction.

LtWorf · 2h ago

> I wonder how many optimisations like this could be created by LLMs

Zero. There's no huge corpus of stackoverflow questions on highly specific assembly optimisations so…

astrange · 2h ago

You can run an agent in a loop, but for something this small you can already use a SAT solver or superoptimizer if you want to get out of the business of thinking about things yourself.

I've never seen anyone actually do it, mostly because modeling the problem is more work than just doing it.

ksclk · 2h ago

> you can already use a SAT solver

Could you elaborate please? How would you approach this problem, using a SAT solver? All I know is that a SAT solver tells you whether a certain formula of ANDs and ORs is true. I don't know how it could be useful in this case.

dragontamer · 2h ago

Pretty much all instructions at the assembly level are sequences of AND/OR/XOR operations.

SAT solvers can prove that some (shorter) sequences are equivalent to other (longer) sequences. But it takes a brute force search.

IIRC, these super optimizing SAT solvers can see patterns and pick 'Multiply' instructions as part of their search. So it's more than traditional SAT. But it's still... At the end of the day.... A SAT equivalence problem.

agent327 · 2h ago

A short look at any compiled code on godbolt will very quickly inform you that pretty much all instructions at the assembly level are, in fact, NOT sequences of AND/OR/XOR operations.

dragontamer · 1h ago

All instructions are implemented with logic gates. In fact. All instructions today are likely NAND gates.

Have you ever seen a WallaceTree multiplier? A good sequence that shows how XOR and AND gates can implement multiply.

Now, if multiply + XOR gets the new function you want, it's likely better than whatever the original compiler output.

viraptor · 1h ago

You're missing the point. All instructions can be simplified to short integer operations, then all integer operations are just networks of gates, then all gates can be replaced with AND/OR/NOT, or even just NAND. That's why you can SAT solve program equivalence. See SMT2 programs using BV theory for example.

Also of course all instructions are MOV anyway. https://github.com/xoreaxeaxeax/movfuscator

dataangel · 29m ago

You're a bit naive about the complexity. Commonly longer sequences are actually faster, not just because instructions vary in their speed, but also because the presence of earlier instructions that don't feed results into later instructions still affect their performance. Different instructions consume different CPU resources and can contend for them (e.g. the CPU can stall even though all the inputs needed for a calculation are ready just because you've done too many of that operation recently). And then keep in mind when I say "earlier instructions" I don't mean earlier in the textual list, I mean in the history of instructions actually executed; you can reach the same instruction arriving from many different paths!

gametorch · 1h ago

There are literally textbooks on optimization. With tons of examples. I'm sure there are models out there trained on them.

wk_end · 1h ago

There are literally textbooks on computational theory, with tons of example proofs. I'm sure there are models trained on them. Why hasn't ChatGPT produced a valid P vs. NP proof yet?

gametorch · 55m ago

I'm just countering their claim that such code is not in the training data.

You're employing a logical fallacy to argue about something else.

LLM Alloying Improves Performance over Single Model (xbow.com)

Computational Complexity of Neural Networks (lunalux.io)

Don't Forget About Atari (cory.news)

Show HN: X11 desktop widget that shows location of your network peers on a map (github.com)

It's the end of the internet as we know it (qz.com)

Chess Llama – Training a tiny Llama model to play chess (lazy-guy.github.io)

SDK for the tony 65c02-based portable game consoles (github.com)

What birdsong and back ends can teach us about magic (digitalseams.com)

Why the Amish have almost no allergies (washingtonpost.com)

LLMs and Computation Complexity (2023) (lesswrong.com)

Show HN: Anonymous Board with Self Destructing Posts (posttwo.com)

AI Coding Category on Wakatime/Wakana.io (github.com)

The Genius Device That Rocked F1 (youtube.com)

Dennis Ritchie: The Man Who Gave Us C Language (karthikwritestech.com)

First Cloned Yak in Tibet (lifesciencehistory.com)

Why Lexing and Parsing Should Be Separate (github.com)

Bad Apple CD+G on a karaoke machine (gashlin.net)

Ask HN: What is a great project based Rails tutorial for 2025?

Show HN: Auto-greyscale specific apps after time limit reached (github.com)

Show HN: PaletAI – App to create, play, and share AI‑generated games (apps.apple.com)

AI's Trillion-Dollar Opportunity: Sequoia AI Ascent 2025 Keynote [video] (youtube.com)

Type-level programming for safer resource management (frasertweedale.github.io)

Yet Another Bad Analysis of AI (recursed.blogspot.com)

A Founder's 20-Year Journey Through Compound Stupidity (valleyofdoubt.com)

PBS Passport Streaming Service (help.pbs.org)

Experimenting with Flox's new build and publish (thefridaydeploy.substack.com)

Elizabeth Holmes's Partner Has a New Blood-Testing Startup (nytimes.com)

Investment Promised a 25% Yield, Then Collapsed 98% (2020) (nasdaq.com)

AI Is the Answer to Everything (banagale.com)

How China Became the World's Biggest Shipbuilder (construction-physics.com)

Why It Feels Like Every Company Suddenly Wants to Sell You Protein [video] (youtube.com)

OpenAI Ignored IMO Request, Announced Math Results Before Closing Ceremony (twitter.com)

Thawing vacuum-packed fish correctly (canr.msu.edu)

Rethinking "Progress": A Hard Look at Sustainability

WordPecker: Open-Source Personalized Duolingo

The Physical Turing Test: Jim Fan on Nvidia's Roadmap for Embodied AI (youtube.com)

Crayola CEO's how-to-succeed guide: Lose the tie pretend you don't know anything (aol.com)

Assessing interstellar comet 3I/ATLAS with the 10.4M Gran Telescopio Canarias (arxiv.org)

AI Coding Tools Underperform in Field Study with Experienced Developers (infoq.com)

IPv6 Based Canvas (canvas.openbased.org)

Think Toggles Are Dumb (paritybits.me)

Trump threatens stadium deal unless NFL team readopts Redskins name (reuters.com)

U.S.-Based Wells Fargo Banker Blocked from Leaving China (wsj.com)

Israel levelling Gaza civilian buildings in controlled demolitions (bbc.co.uk)

Delta Air Lines is using AI to set the maximum price you're willing to pay (theverge.com)

How Distillation Makes AI Models Smaller and Cheaper (quantamagazine.org)

'Flutter': The song that saved raves from a government ban (faroutmagazine.co.uk)

Cuban Experiences on Computing and Education (link.springer.com)

The Last SS Guard (zeit.de)

Nvidia Bringing CUDA to RISC-V (phoronix.com)

FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

Comments (46)