This is a good set of slides. Dan is a good guy. There are a few nits I would pick. Sqrt(N) convergence comes from independence not normality -- based on independence => linearity of variance. { So, N IID samples of any distribution have a sum with N times higher variance, but then dividing by N you et sqrt(N). } There is, of course, a higher order relationship between the variance / "scale^2" of the distro and its tails which statisticians refer to as "shape". He later goes on to mention the dependence problem, though, and the minimum dt solution that, relied upon by, e.g., https://github.com/c-blake/bu/blob/main/doc/tim.md. So, it's all good. He may have covered it in voice, even.
He also mentions the Sattolo used by https://github.com/c-blake/bu/blob/main/doc/memlat.md to do his memory latency measurements. One weird thing was how he said because of 1 byte/cycle is 4GB/s things are "easily CPU bound" while I feel like I've been "fighting The Memory Wall for at least 3 decades now..." even just from super-scalar CPUs, but he later does some vectorization stuff. That more relates to what calcs you are doing, of course, but high bandwidth memory is a big part of what nVidia is selling.
appreciatorBus · 11h ago
Looks like this was delivered earlier today at SEA 2025, I hope there's video that will be available soon!
I don't think talks are being recorded, unfortunately.
benob · 2h ago
Are there efforts to include the neccessary context in compilers to autovectorize?
yvdriess · 2h ago
What do you mean with necessary context?
Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code.
The usual workflow I use is to translate the vector notation (RIP Cilk+ array syntax) in my paper notes to plain C loops. The compiler's optimization report (-qopt-report for icx, gcc has -fopt-info-vec and -fopt-info-vec-missed) gives feedback on what optimizations it considered and why it did not apply them. In more complex scenarios it can be helpful to add `#pragma omp simd` pragmas or similar to overrule the C semantics.
anonymousDan · 56m ago
Is there a talk to go with the slides by any chance?
DrNosferatu · 3h ago
I’m surprised so much branching isn’t more costly.
yvdriess · 1h ago
Branch predictors have gotten really good and it often now makes more sense to rely on it rather than working away the branches.
For example, modern compilers will very rarely introduce conditional moves (cmov) x86 because they are nearly always slower than simply branching. It might be counter intuitive, but a branch prediction breaks the dependencies of the micro-ops between the conditional and the clause. So if your cmov's conditional depends on a load, you need to wait for that load complete before it can execute.
Always benchmark with at-scale data and measure.
mycatisblack · 2h ago
Depends on the branch predictor: correct branch, everything’s loaded and set. Wrong branch: flush it all and load again.
If you know the branch predictor algorithm you can optimise for it.
Edit: it’s on p27
NooneAtAll3 · 9h ago
apple still uses utf16?
vanderZwan · 6h ago
JavaScript does, so the web does, so by extension Apple probably does care about utf16.
jiggawatts · 6h ago
Also: Java, DotNet, and Windows all use 2-byte char types.
looperhacks · 3h ago
Akchyually! These days, Java uses Latin1 if no characters outside Latin1 are used. Only if full Unicode is necessary, it uses UTF-16
josephg · 3h ago
Apple does something similar for strings in objc and Swift. They do lots of other optimisations besides - like small string optimisation for short strings to avoid heap allocations.
...it's trivial to get UTF-8 strings into and out of an NSString though so the internal representation doesn't matter all that much.
More importantly, all of the actual user-facing side of macOS is UTF-8 (e.g. you can simply copy-paste an UTF-8 encoded UNICODE string literal into a C source file, printf() it and it will show up correctly on the terminal without tinkering with text editor or locale settings).
markasoftware · 7h ago
is this talk about apple? Regardless, lots of language runtimes still use utf16 (eg java, qt, haskell), and windows certainly still uses utf16.
phkahler · 8h ago
Pentium 4 didn't hit 3.8GHz. It melted at 1.4 or so.
wtallis · 7h ago
The Pentium 3 is what eventually topped out at 1.4 GHz, for the 130nm Tualatin parts introduced in 2001. The Pentium 4 started at 1.4GHz and 1.5GHz with the 180nm Willamette parts introduced in 2000. Those were eventually released with speeds up to 2.0GHz. The 130nm Pentium 4 Northwood reached 3.4GHz in 2004, and the 90nm Pentium 4 Prescott hit 3.8GHz later in 2004.
It hit 3.8 and for a while it surpassed multiple cores' performance because games were built to work on single cores and multithreaded/multicore. It happened with some emulators.
IgnaciusMonk · 11h ago
I do not want to be rude but this is exactly why LLVM being in hands of same entity which controls access to / owns platform is insane.
edit - #64 E ! Also, i always say, human body is most error prone measuring device humans have in their disposal.
bayindirh · 5h ago
Both LLVM and GCC is being supported by processor manufacturers directly. Yes, Apple and Intel has their own LLVM versions, but as long as don't break compatibility with GCC and doesn't prevent porting explicitly, I don't see a problem.
I personally use GCC suite exclusively though, and while LLVM is not my favorite compiler, we can thank them for spurring GCC team into action for improving their game.
gleenn · 11h ago
Can you be more explicit? Is it because they are optimizing too much to a single platform that isn't generalizable to other compilers or architectures? What's your specific gripe?
almostgotcaught · 10h ago
Whose hands exactly is LLVM in?
IgnaciusMonk · 11h ago
Also to be more controversial. - redhat deprecated x86_64_v1 & x86_64v2 . and people were crying because of that....
volf_ · 7h ago
A commercial enterprise is dropping support for older cpu architectures in their newer OSs so they can improve the average performance of the deployed software?
Don't see how that's controversial. It's something that doesn't matter to their customers or their business.
bayindirh · 5h ago
The newest x86_64-v1 server is older than a decade now, and I'm not sure -v2 is deprecated. RockyLinux 9 is running happily on -v2 hardware downstairs.
Oh, -v2 is deprecated for RH10. Not a big deal, honestly.
From a fleet perspective, I prefer more code uses more advanced instructions on my processors. Efficiency goes up on hot code paths possibly. What's not to love?
homebrewer · 1h ago
One more reason to switch to a better alternative:
The newest x86_64-v1 server is older than a decade now
Did you mean v3?
bayindirh · 4h ago
No, v1. I mean, you can't buy a x86_64-v1 server for a decade now, and if you have one and it's alive, it's a very slim chance it's working unless it's new old stock.
If it has seen any decent amount of workload during its lifetime, it possibly has a couple of ICs which reached their end of their electronic life and malfunctioning.
anthk · 2h ago
Gemini Lake runs pretty well. If that happens, bye Fedora Bazzite with Linux-Libre on top.
He also mentions the Sattolo used by https://github.com/c-blake/bu/blob/main/doc/memlat.md to do his memory latency measurements. One weird thing was how he said because of 1 byte/cycle is 4GB/s things are "easily CPU bound" while I feel like I've been "fighting The Memory Wall for at least 3 decades now..." even just from super-scalar CPUs, but he later does some vectorization stuff. That more relates to what calcs you are doing, of course, but high bandwidth memory is a big part of what nVidia is selling.
https://x.com/lemire/status/1947615932702200138
Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code. The usual workflow I use is to translate the vector notation (RIP Cilk+ array syntax) in my paper notes to plain C loops. The compiler's optimization report (-qopt-report for icx, gcc has -fopt-info-vec and -fopt-info-vec-missed) gives feedback on what optimizations it considered and why it did not apply them. In more complex scenarios it can be helpful to add `#pragma omp simd` pragmas or similar to overrule the C semantics.
For example, modern compilers will very rarely introduce conditional moves (cmov) x86 because they are nearly always slower than simply branching. It might be counter intuitive, but a branch prediction breaks the dependencies of the micro-ops between the conditional and the clause. So if your cmov's conditional depends on a load, you need to wait for that load complete before it can execute.
Always benchmark with at-scale data and measure.
If you know the branch predictor algorithm you can optimise for it.
Edit: it’s on p27
...it's trivial to get UTF-8 strings into and out of an NSString though so the internal representation doesn't matter all that much.
More importantly, all of the actual user-facing side of macOS is UTF-8 (e.g. you can simply copy-paste an UTF-8 encoded UNICODE string literal into a C source file, printf() it and it will show up correctly on the terminal without tinkering with text editor or locale settings).
Netburst lasted a long time as intel was floundering, before Core Duo was released in 2006.
Tom's Hardware overclocked one of these Northwood Pentium 4's to 5 GHz with liquid nitrogen and a compressor [1].
Those were the days, honestly.
[0]: https://en.wikipedia.org/wiki/Pentium_4
[1]: https://www.youtube.com/watch?v=z0jQZxH7NgM
edit - #64 E ! Also, i always say, human body is most error prone measuring device humans have in their disposal.
I personally use GCC suite exclusively though, and while LLVM is not my favorite compiler, we can thank them for spurring GCC team into action for improving their game.
Don't see how that's controversial. It's something that doesn't matter to their customers or their business.
Oh, -v2 is deprecated for RH10. Not a big deal, honestly.
From a fleet perspective, I prefer more code uses more advanced instructions on my processors. Efficiency goes up on hot code paths possibly. What's not to love?
https://lwn.net/Articles/1010868/
https://almalinux.org/blog/2025-05-27-welcoming-almalinux-10...
tl;dr: AlmaLinux will support v2 in EL10 as a separate rebuild in the near future.
https://repo.almalinux.org/almalinux/10/isos/x86_64_v2/
We even do a full EPEL rebuild for it as well.
Did you mean v3?
If it has seen any decent amount of workload during its lifetime, it possibly has a couple of ICs which reached their end of their electronic life and malfunctioning.