Less Slow C++
184 ashvardanian 92 4/18/2025, 1:09:50 PM github.com ↗
Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.
- Are coroutines viable for high-performance work?
- Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution?
- Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE?
- How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD?
- What's the throughput gap between CPU and GPU Tensor Cores (TCs)?
- How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer?
- Which parts of the standard library hit performance hardest?
- How do error-handling strategies compare overhead-wise?
- What's the compile-time vs. run-time trade-off for lazily evaluated ranges?
- What practical, non-trivial use cases exist for meta-programming?
- How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
- How close are we to effectively using Networking TS or heterogeneous Executors in C++?
- What are best practices for propagating stateful allocators in nested containers, and which libraries support them?
These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (<https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.Some fun observations:
- Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
- Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
- The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
- In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
- Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
- Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.
The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!
Huh, ok, let's see how...
I see. "reduced accuracy" is an understatement. It's just horrifically wrong for inputs outside the range of [-2, 2]https://www.wolframalpha.com/input?i=plot+sin+x%2C+x+-+%28x%...
It cannot handle a single interval of a sin wave, much less the repeating nature? What an absolutely useless "optimization"
You can find more complete examples in my SimSIMD (https://github.com/ashvardanian/SimSIMD), but they also often assume that at a certain part of a kernel, a floating point number is guaranteed to be in a certain range. This can greatly simplify the implementation for kernels like Atan2. For general-purpose inputs, go to SLEEF (https://sleef.org). Just remember that every large, complicated optimization starts with a small example.
People have already ragged on you for doing Taylor approximation, and I'm not the best expert on the numerical analysis of implementing transcendental functions, so I won't pursue that further. But there's still several other unaddressed errors in your trigonometric code:
* If your function is going to omit range reduction, say so upfront. Saying "use me to get a 40× speedup because I omit part of the specification" is misleading to users, especially because you should assume that most users are not knowledgeable about floating-point and thus they aren't even going to understand they're missing something without you explicitly telling them!
* You're doing polynomial evaluation via `x * a + (x * x) * b + (x * x * x) * c` which is not the common way of doing so, and also, it's a slow way of doing so. If you're trying to be educational, do it via the `((((x * c) * x + b) * x) + a) * x` technique--that's how it's done, that's how it should be done.
* Also, doing `x / 6.0` is a disaster for performance, because fdiv is one of the slowest operations you can do. Why not do `x * (1.0 / 6.0)` instead?
* Doing really, really dumb floating-point code and then relying on -ffast-math to make the compiler unpick your dumbness is... a bad way of doing stuff. Especially since you're recommending people go for it for the easy speedup and saying absolutely nothing about where it can go catastrophically wrong. As Simon Byrne said, "Friends don't let friends use -ffast-math" (and the title of my explainer on floating-point will invariably be "Floating Point, or How I Learned to Start Worrying and Hate -ffast-math").
I can't for the life of me find the Sony presentation, but the fastest polynomial calculation is somewhere between Horner's method (which has a huge dependency tree in terms of pipelining) and full polynomial evaluation (which has redundancy in calculation).
Totally with you on not relying on fast math! Not that I had much choice when I was working on games because that decision was made higher up!
One of the big problems with floating-point code in general is that users are largely ignorant of floating-point issues. Even something as basic as "0.1 + 0.2 != 0.3" shouldn't be that surprising to a programmer if you spend about five minutes explaining it, but the evidence is clear that it is a shocking surprise to a very large fraction of programmers. And that's the most basic floating-point issue, the one you're virtually guaranteed to stumble across if you do anything with floating-point; there's so many more issues that you're not going to think about until you uncover them for the first time (e.g., different hardware gives different results).
On the plus side, x87 excess precision is largely a thing of the past, and we've seen some major pushes towards getting rid of FTZ/DAZ (I think we're at the point where even the offload architectures are mandating denormal support?). Assuming Intel figures out how to fully get rid of denormal penalties on its hardware, we're probably a decade or so out from making -ffast-math no longer imply denormal flushing, yay. (Also, we're seeing a lot of progress on high-speed implementations of correctly-rounded libm functions, so I also expect to see standard libraries require correctly-rounded implementations as well).
I'm somewhat less interested in correctness of the results, so long as they're consistent. rlibm and related are definitely neat, but I'm not optimistic they'll become mainstream.
But not having _range reduction_ is a bigger problem, I can't see many uses for a sin() approximation that's only good for half wave. And as others have said, if you need range reduction for the approximation to work in its intended use case, that needs to be included in the benchmark because you're going to be paying that cost relative to `std::sin()`.
That would be me, I’m afraid. I know little about Taylor series, but I’m pretty sure it’s less than ideal for the use case.
Here’s a better way to implement faster trigonometry functions in C++ https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM... That particular source file implements that for 4-wide FP64 AVX vectors.
No comments yet
It will only take a 5-line PR to add Horner’s method and Chebyshev’s polynomials and probably around 20 lines of explanations, and everyone passionate about the topic is welcome to add them.
There are more than enough examples in the libraries mentioned above ;)
I'll update the README statement in a second, and already patched the sources to explicitly focus on the [-π/2, π/2] range.
Thanks!
Also, it would be good to have even in a "production" use of a function like this, in case something outside that range reaches it by accident.
Then it should be educating on the applicability and limitations of things like this instead of just saying "reduced accuracy" and hoping the reader notices the massive issues? Kinda like the ffast-math section does.
Kaze Emanuar has two entire videos dedicated to optimizing sin() on the Nintendo 64 and he's using approximations like this without issues in his homebrew:
branches at this scale are actually significant, and so will drastically impact being able to achieve 40x faster as claimed
Again, we're in a situation where we know we can tolerate a 0.5% error, we can spare a bit of time to think about what range needs to be handled fast or supported at all.
It's indeed not a substitute for sin in general, but it could be in some use-cases, and for those it could really be 40x faster (say, cases where you're already externally doing range reduction because it's necessary for some other reason (in general you don't want your angles infinitely accumulating scale)).
No comments yet
- CTRE is fine as long as you don't overflow the stack. I tried once to validate a string for a HTTP proxy configuration with an exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call frames in and therefore crashed the embedded system with a stack overflow. I've had to remove port validation from the regex (matching a number between 1 and 65535 was a bridge too far) and check that part by hand instead. I've also had to dumb down other CTRE regexes too in my code for similar reasons.
- Several constraints and design decisions led me to mostly ditch JSON internally and write my own BSON library. Instead of the traditional dynamically-allocated tree of nodes approach it works directly in-place, so I can give it a std::vector with a chunk of reserved memory upfront and not worry about memory allocation or fragmentation later on. One major benefit is that since there are no string escape sequences, I can return directly a std::string_view for string values. There are downsides to this approach, mostly revolving around modifications: one needs to be very careful not to invalidate iterators (which are raw pointers to the underlying buffer) while doing so and adding/removing entries towards the beginning of a large document is expensive due to the memmove().
- I ditched newlib for picolibc and exterminated anything that pulled in the C/C++ standard library locale code (that was over 130 kilobytes of Flash altogether IIRC), which includes among other things C++ streams (they are bad for plenty of other reasons too, but mine was program size).
You seem to have mostly aimed for throughput and raw performance in your benchmarks, which is fine for a generic desktop or server-class system with a MMU and plenty of resources. Just wanna point out that other environments will have different constraints that will mandate different kinds of optimizations, like memory usage (heap/stack/program size), dynamic memory fragmentation, real-time/jitter...
I wonder how Rust is stacking up (no pun intended) in the embedded game these days.
I am still looking for a short example of such CUDA kernels, and I would love to see more embedded examples if you have thoughts ;)
I'm aware that the C++ standard library has polymorphic allocators alongside a couple of memory resource implementations. I've also heard that the dynamic dispatch for the polymorphic allocators could bring some optimization or speed penalties compared to a statically dispatched allocator or the standard std::allocator that uses operator new(), but I have no concrete data to judge either way.
Which is to say CTRE is mostly not fine, if you use it on user-provided strings, regardless of target environment. It's heavily recursion based, with never spilling to the heap and otherwise no safeguards for memory use/recursion depth.
Our target has ~1.5 MiB of Flash for program code and 512 KiB of RAM. We're using half of the former and maybe a third of the latter, the team barely paid any attention to program size or memory consumption. One day the project lead became slightly concerned about that and by the end of the day I shed off 20% of Flash and RAM usage going for the lowest hanging fruits.
I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
That's certainly a restricted dialect of C++.
> I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
I took an "embedded" systems class in college 15+ years ago that targeted a 32-bit ARM with megabytes of ram, so using these kBs of RAM micros in 2025 definitely feels like a constrained environment to me. The platforms I work on with C++ professionally have, ya know, hundreds of gigabytes of RAM (and our application gets ~100% of it).
I am sure it overlaps in terms of topics, maybe even some examples, but the homepage suggests that the book is about 500 pages long. I generally don’t have a time budget to read that much, and in most cases, I want to play with the code more than read the text, especially when some new IO_uring feature, a metaprogramming tool, or an Assembly instruction comes out.
Another observation is that people don’t like to read into 5000-word essays on each topic. At best, those become training materials for the next LLMs and will affect future code only indirectly…
I’m all ears for alternative formats if you have recommendations, as I generally reimplement such projects every couple of years ;)
I used such a structure in a benchmark suite https://github.com/MC-DeltaT/cpu-performance-demos
I have increasingly soured on Abseil over the past couple of years. At Google, we've seen an exodus of many of the core library maintainers, and some of the more recent design choices have been questionable at best.
"How coding FPGA differs from GPU and what is High-Level Synthesis, Verilog, and VHDL? #36" Yes please!
HW design has so many degrees of freedom which makes it fun but challenging.
What specific topics are you trying to address with this?
> High-Level Synthesis
Something that absolutely not a single professional designer uses.
Everything that Kris Jusiak has under https://github.com/qlibs/ is worth a look.
I expected C code, too, not only .cpp and .S, however.
The less_slow.cpp uses a lot of C++-ism.
This may require fixing, or removal of "C" from the list.
Does C++ have a good async ('coroutine') story for io_uring yet?
Short answer: sadly, no. I love the "usability" promise of coroutines—and even have 2–3 FOSS projects that could be rewritten entirely around C++ or Rust coroutines for better debuggability and extensibility—but my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high. Frankly, I’m not even sure if a better design is possible on modern hardware.
This leads me to conclude that, despite my passion for SIMD and superscalar execution, the highest‑impact new assembly instructions that x86 and Arm could standardize would center on async execution and lightweight context switching... yet I haven’t seen any movement in that direction.
⸻
I also wrote toy examples for various range/async/stream models in C++, Rust, and Python, with measured latencies in inline comments:
Aside from coroutines (toy hand-rolled implementations and commonly used libraries), I've also played around C++ executors, senders & receivers, but didn't have much success with them either. May be a skill issue.Which runtime cost do you mean?
The main one I am aware of is a heap allocation per coroutine, though this can in some cases be elided if the coroutine is being called from another coroutine.
The other cost I am aware of is the initializing of the coroutine handle, but I think this is just a couple of pointers.
In both cases I would expect these overheads to be relatively modest compared to the cost of the I/O itself, though it's definitely better to elide the heap allocation when possible.
I don't know much about coroutine libraries like unifex (which I think your test is using), but a hand-coded prototype I was playing with doesn't seem to add much overhead: https://godbolt.org/z/8Kc1oKf15
If we can compile with -fno-exceptions, the code is even tighter: https://godbolt.org/z/5Yo8Pqvd5
My exploration into coroutines and I/O is only in the earliest stages, so I won't claim any of this to be definitive. But I am very interested in this question of whether the overhead is low enough to be a good match for io_uring or not.
the cost of context switch consists of two parts, one of which can be subdivided:
1. register save/restore
2. the cost of the TLB flush, which is in turn proportional to the working set size of the switched-to process (i.e. if you don't touch much memory after the context switch, the cost is lower than if you do)I am not sure that any assembler instructions could address either of these.
What do you have in mind?
Next, it would be exciting to implement a concurrent job-stealing graph algorithm in both languages to get a feel for their ergonomics in non-trivial problems. I can imagine it looks very different in Rust and C++, but before I get there, I'm looking for best practices for implementing nested associative containers with shared stateful allocators in Rust.
In C++, I've implemented them like this: <https://github.com/ashvardanian/less_slow.cpp/blob/8f32d65cc...>, even though I haven't seen many people doing that in public codebases. Any good examples for Rust?
I am surprised about CTRE giving good results—I will admit I have thought of it more as a parlor trick than a viable engine. I will need to dig into that more. I also want to dig into the OpenMP & TBB threadpool benchmarks to see whether Boost::ASIO threadpool can be added into it.
A word of caution, though: I remember the throughput differing vastly between GCC and MSVC builds. The latter struggles with heavy meta-programming and expression templates. I don't know why.
I think there's a case to be made for libaries like https://github.com/foonathan/lexy and/or https://github.com/boostorg/parser instead of reaching for a regex in the first place.
I’ve only avoided it in the tutorial, as I want to keep the build system lean. I wouldn’t be surprised if it’s 10x faster than Boost in the average case.
For general purpose usage, Google's RE2 and PCRE2 in JIT mode will offer pretty good performance. Zoltan Herczeg's work on the PCRE2's JIT is underappreciated. Both these options are widely available and portable.
--
This was a good reminder that I need to pay more attention to Unum's projects. I noticed this older blog article, https://www.unum.cloud/blog/2021-01-03-texts, and that brings up some questions. First, in 2025, is UStore a wholesale replacement for UDisk or are the two complementary? Second, what is the current Unum approach for replacing full-text search engines (e.g., ElasticSearch)?
For years, I've had a hope to build it in the form of an open-core project: open-source SotA solutions for Storage, Compute, and AI Modeling built bottom up. You can imagine the financial & time burden of building something like that with all the weird optimizations and coding practices listed above.
A few years in, with millions spent out of my pocket, without any venture support or revenue, I've decided to change gears and focus on a few niche workloads until some of the Unum tools become industry standards for something. USearch was precisely that, a toy Vector Search engine that would still, hopefully, be 10x better than alternatives, in one way or another: <https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...>.
Now, ScyllaDB (through Rust SDK) and YugaByte (through C++ SDK) are the most recent DBMSs to announce features built on USearch, joining the ranks of many other tech products leveraging some of those optimizations, and I was playing around with different open-source growth & governance ideas last year, looking for way to organize more collaborative environment among our upstream users, rather than competitive — no major releases, just occasional patches here and there.
It was an interesting period, but now I'm again deep in the "CUDA-GDB" land, and the next major release to come is precisely around Full-Text Search in StringZilla <https://github.com/ashvardanian/stringzilla>, and will be integrated into both USearch <https://github.com/unum-cloud/usearch> and somewhere else ;)
ElasticSearch has always seemed geared too much towards concurrent querying with mixed workloads, and then it gets applied to logs… and, well, with logs you care about detection of known query sets at ingest, indexing speed, compression, and ability to answer queries over large cold indices in cheap object storage. And often when searching logs, you want exact matching, preferably with regex. Part of me wants to play with rolling my own crazy FM-index library, part of me thinks it might be easier to play games with Parquet dictionary tables (get factors out of regexps, check dictionary tables for the factors for great win), and part of me thinks I will be better off waiting to see what comes of the Rust-based ES challengers.
Will definitely follow announcements to come with StringZilla.
Oh, absolutely, go for it! And check out what Pesho <https://github.com/pesho-ivanov> & Ragnar are doing <https://github.com/RagnarGrootKoerkamp> ;)
But then you come across something like this Less Slow C++. It’s not just about being clever or showy—it’s about solving real problems in the most efficient way possible. And that’s why it stands out in a world full of quick fixes and overcomplicated solutions.
Keep pushing for clarity, efficiency, and above all, results. With tools like this, the sky really is the limit.