Ask HN: Why hasn't x86 caught up with Apple M series?
450 points by stephenheron 7d ago 620 comments
Ask HN: Did Developers Undermine Their Own Profession?
8 points by rayanboulares 21h ago 16 comments
How is Ultrassembler so fast?
118 netr0ute 49 8/31/2025, 5:42:43 PM jghuff.com ↗
That is not true - no allocator I know of (and certainly not the default glibc allocator) allocates memory in this way. It only does a syscall when it doesn’t have free userspace memory to hand out but it overallocates that memory and also reuses memory you’ve already freed.
Including checking return codes instead of exceptions. It's even possible for exceptions as implemented by g++ in the Itanium ABI to be cheaper than the code that would be used for consistently checking return codes.
[0] https://news.ycombinator.com/item?id=22483028
[1] https://www.research.ed.ac.uk/portal/files/78829292/low_cost...
You can find one style outperforms the other based on the circumstances of the program, and programmers worried about optimization may someday be able to choose between approaches to meet their performance goals instead of pretending that tables are inherently slow and return codes that they won't even fully implement are inherently fast.
My point was simply that you can't just say "oh but exceptions are not zero-cost" without actually comparing to the alternative of laboriously carting return codes all through the call graph, as done in the research you show here and as also done by Khalil Estell elsewhere for ARM embedded.
By definition, that's zero-overhead because Ultrassembler doesn't care about space.
Feel free to ask me any questions to break the radio silence!
The codegen switch with the offsets is in everything, first time I saw it was in the Rhino JS bytecode compiler in maybe 2006, written it a dozen times since. Still clever you worked it out from first principles.
There are some modern C++ libraries that do frightening things with SIMD that might give your bytestring stuff a lift on modern stupid-wide high mispredict penalty stuff. Anything by lemire, stringzilla, take a look at zpp_bits for inspiration about theoretical minimum data structure pack/unpack.
But I think you got damn close to what can be done, niiicccee work.
The “value speculation” thing looks completely weird to me, especially with the “volatile” that doesn't do anything at all (volatile is generally a pointer qualifier in C++). If it works, I'm not really convinced it works for the reason the author thinks it works (especially since it refers to an article talking about a CPU from the relative stone age).
> However, in Chata's case, it needs to access a RISC-V assembler from within its C++ code. The alternative is to use some ugly C function like system() to run external software as if it were a human or script running a command in a terminal.
Have you tried LLVM's C++ API [0]?
To be fair, I do think there's merit in writing your own assembler with your own API. But you don't necessarily have to.
I'm not likely to go back to assembly unless my employer needs that extra level of optimization. But if/when I do, and the target platform is RISC-V, then I'll definitely consider Ultraseembler.
> It's not clear when exactly exceptions are slow. I had to do some research here.
There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah. There's also other C++ conferences that have similar presentations (or even, almost identical presentations because the presenters go to multiple conferences), though I don't have a link handy because I pretty much only attend cppcon.
[0]: https://stackoverflow.com/questions/10675661/what-exactly-is...
[1]: https://www.youtube.com/results?search_query=cppcon+exceptio...
I think I read something about this but couldn't figure out how to use it because the documentation is horrible. So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing) probably because nobody is using the C++ API.
> There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah.
I don't have enough time to watch these kinds of presentations.
But honestly you'd get vast majority of the benefit just by skimming through the slides at https://github.com/CppCon/CppCon2024/blob/main/Presentations...
With a couple of symbols you define yourself a lot of the associated g++ code size is sharply reduced while still allowing exceptions to work. (Slide 60 on)
Fair enough.
> So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing)
Interesting claim, do you have any examples?
Then let me pick and share some of my favorites that I found enlightening, and summarize with some information that I found useful.
By far, the most useful one is Khalil Estell's presentation last year [0]. It's a fairly face paced but relatively deep dive into exception mechanics. At the end, he advocates for a new tool that would audit a program to determine what exceptions could be thrown. I think that's a flipping fantastic idea for a tool. Unfortunately I haven't seen any progress toward it -- if someone here knows where his tool is, or a similar tool, please reply! I did send him an email a few months ago inquiring about it, but haven't received a reply. Nonetheless, the whole presentation was excellent in my opinion. I did see that he had another related presentation at ACCU this year [4] with a topic of "C++ Exceptions are Code Compression" (which I totally can believe -- I've seen it myself in binary sizes), but I haven't seen his presentation yet. I'll watch it later today.
Just about anything from Herb Sutter is good. I don't like that he works for Microsoft, but he does great stuff for C++, including the old Guru of the Week series [1]. In particular, his 2019 presentation [2] describes different error handling techniques, some difficulties and pitfalls in combining libraries with different error handling techniques, and leads up to explaining why std::expected came about. He does pontificate a lot though, so the presentation is fairly high level and slow paced.
Dave Watson's 2017 presentation [3] dives into a few different implementations of stack unwinding. It's good to understand how different compilers implement exceptions with low- or zero-cost overhead and what that "overhead" is really measuring.
So, there's about a half of a day of presentations to watch here. I hope that's not too much for you.
[0]: https://www.youtube.com/watch?v=bY2FlayomlE
[1]: https://herbsutter.com/gotw/
[2]: https://www.youtube.com/watch?v=ARYP83yNAWk
[3]: https://www.youtube.com/watch?v=_Ivd3qzgT7U
[4]: https://www.youtube.com/watch?v=LorcxyJ9zr4
[0]: https://www.youtube.com/watch?v=bY2FlayomlE
[4]: https://www.youtube.com/watch?v=LorcxyJ9zr4
If I'm honest, I've never looked into pmr, but I always thought that that's where std has arena allocators and stuff
https://en.cppreference.com/w/cpp/header/memory_resource.htm...
https://chatgpt.com/share/68b5e0db-a6d0-8005-9101-d326d2af0a...
In any case, if you really believe mmap is great for an assembler, then sure, go ahead. But it's not.
I feel like this might mostly be useful as a reference, because currently RISC-V assembly's specification is mostly "what do GCC/Clang do?"
PTX isn’t the assembly language, it is a virtual ISA, so you need a full backend compiler with 10s to 100s of passes to get to machine code
It's imperfect, but I take any excuse to point out how bad monopolies are for customers. All you have to do is build the driver to see that "low priority" is a pretty broad term on the allegedly elite trillion dollar toolchain.
I'm not saying CUDA is unimpressive, its a very, very, very hard problem. But if they were in an uncorrupted market ptxas would be fast instead of devastating znver5 workstations with 6400MT DDR5.
This last fact might actually support the current model, as it grows linearly-ish in the size of the instruction, instead of being constant like hash.
It is not part of RISC-V, nor supported by any CPUs outside of that vendors' own.
Ofc, I did add my own bugs.