Eternal Struggle (yoavg.github.io)

I wonder if you thought about perfect hashing instead of that comparison tree. Also, flex (as in flex and bison) can generate what amounts to trees like that, I believe. I haven't benchmarked it compared to a really careful explicit tree though.

netr0ute · 2h ago

I thought about hashing, but found that hashing would be enormously slow to compute compared to a perfectly crafted tree.

dafelst · 2h ago

But did you think about using a perfect hash function and table? Based on my prior research, it seems like they are almost universally faster on small strings than trees and tries due to lower cache miss rates.

dist1ll · 1h ago

Ditto. Perfect hashing strings smaller than 8 bytes has been the fastest lookup method in my experience.

netr0ute · 1h ago

Problem is, there are a lot of RISC-V instruction way longer than that (like th.vslide1down.vx) so hashing is going to be slow.

ashdnazg · 31m ago

You could copy the instruction to a 16 byte sized buffer and hash the one/two int64s. Looking at the code sample in the article, there wasn't a single instruction longer than 5 characters, and I suspect that in general instructions with short names are more common than those with long names.

This last fact might actually support the current model, as it grows linearly-ish in the size of the instruction, instead of being constant like hash.

Sesse__ · 59m ago

You're probably thinking of gperf, not flex and bison.

sylware · 5m ago

Oh, I remember I did a plain and simple C port of an old gperf, cgperf https://www.rocketgit.com/user/sylware/cgperf

Ofc, I did add my own bugs.

netr0ute · 4h ago

Hi everyone, I'm the author of this article.

Feel free to ask me any questions to break the radio silence!

benreesman · 3h ago

Nice work and good writeup. I think most of that is very sound practice.

The codegen switch with the offsets is in everything, first time I saw it was in the Rhino JS bytecode compiler in maybe 2006, written it a dozen times since. Still clever you worked it out from first principles.

There are some modern C++ libraries that do frightening things with SIMD that might give your bytestring stuff a lift on modern stupid-wide high mispredict penalty stuff. Anything by lemire, stringzilla, take a look at zpp_bits for inspiration about theoretical minimum data structure pack/unpack.

But I think you got damn close to what can be done, niiicccee work.

Sesse__ · 1h ago

FWIW, this is basically an implementation of perfect hashing, and there's a myriad of different strategies. Sometimes “switch on length + well-chosen characters” are good, sometimes you can do better (e.g. just looking up in a table instead of a long if chain).

The “value speculation” thing looks completely weird to me, especially with the “volatile” that doesn't do anything at all (volatile is generally a pointer qualifier in C++). If it works, I'm not really convinced it works for the reason the author thinks it works (especially since it refers to an article talking about a CPU from the relative stone age).

inetknght · 3h ago

Overall, this is a fantastic dive into some of RISC-V's architecture and how to use it. But I do have some comments:

> However, in Chata's case, it needs to access a RISC-V assembler from within its C++ code. The alternative is to use some ugly C function like system() to run external software as if it were a human or script running a command in a terminal.

Have you tried LLVM's C++ API [0]?

To be fair, I do think there's merit in writing your own assembler with your own API. But you don't necessarily have to.

I'm not likely to go back to assembly unless my employer needs that extra level of optimization. But if/when I do, and the target platform is RISC-V, then I'll definitely consider Ultraseembler.

> It's not clear when exactly exceptions are slow. I had to do some research here.

There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah. There's also other C++ conferences that have similar presentations (or even, almost identical presentations because the presenters go to multiple conferences), though I don't have a link handy because I pretty much only attend cppcon.

[0]: https://stackoverflow.com/questions/10675661/what-exactly-is...

[1]: https://www.youtube.com/results?search_query=cppcon+exceptio...

netr0ute · 3h ago

> LLVM's C++ API

I think I read something about this but couldn't figure out how to use it because the documentation is horrible. So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing) probably because nobody is using the C++ API.

> There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah.

I don't have enough time to watch these kinds of presentations.

NooneAtAll3 · 1h ago

isn't your MemoryBank already somewhere in std::pmr?

If I'm honest, I've never looked into pmr, but I always thought that that's where std has arena allocators and stuff

https://en.cppreference.com/w/cpp/header/memory_resource.htm...

jclarkcom · 1h ago

You might look into using memory mapped IO for reading input and writing your output files. This can save some memory allocations and file read and write times. I did this with a project where I got more than 10x speed up. For many cases file IO is going to be your bottleneck.

Sesse__ · 59m ago

mmap-based I/O still needs to go through the kernel, including memory allocation (in the page cache) and all. If you've got 10x speedup from mmap, it is usually because your explicit I/O was very inefficient; there are situations where mmap is useful, but it's rarely a high-performance strategy, as it's really hard for it to guess what your intended I/O patterns are just from the page faults it's seeing.

msla · 3h ago

What's the difference between a Programming Furu and a Programming Guru? Is there a joke I'm missing?

netr0ute · 2h ago

Furus are "fake gurus." It comes from the Fintwit space where "furus" share their +1000% option trades as if they're geniuses in order to get you to sign up for their expensive Substack.

IshKebab · 4h ago

Neat, but it's not like assembly is really a bottleneck in any but the most extreme cases. LLVM and GAS are already very fast.

I feel like this might mostly be useful as a reference, because currently RISC-V assembly's specification is mostly "what do GCC/Clang do?"

drob518 · 46m ago

Exactly. I don’t know too many assembly language programmer's who are griping about slow tools, particularly on today’s hardware. Yea, Orca/M on my old Apple II with 64k RAM and floppy drives was pretty slow, but since then not so much. But sure, as a fun challenge to see how fast you can make it run, go for it.

benreesman · 3h ago

ptxas comes to mind.

gdiamos · 3h ago

ptxas is a bit of a misnomer - it actually wraps the entire NVIDIA driver backend compiler

PTX isn’t the assembly language, it is a virtual ISA, so you need a full backend compiler with 10s to 100s of passes to get to machine code

benreesman · 3h ago

I appreciate that hitting sm_70 through sm_120 in one call isn't the same as hitting RISC-V in one call, but I do a lot of builds just for sm_120 which is closer to a fair comparison.

It's imperfect, but I take any excuse to point out how bad monopolies are for customers. All you have to do is build the driver to see that "low priority" is a pretty broad term on the allegedly elite trillion dollar toolchain.

I'm not saying CUDA is unimpressive, its a very, very, very hard problem. But if they were in an uncorrupted market ptxas would be fast instead of devastating znver5 workstations with 6400MT DDR5.

Eternal Struggle (yoavg.github.io)

"This telegram must be closely paraphrased before being communicated" Why? (history.stackexchange.com)

I Don't Have Spotify (idonthavespotify.sjdonado.com)

When the sun will literally set on what's left of the British Empire (oikofuge.com)

Installing UEFI Firmware on ARM SBCs (interfacinglinux.com)

Launch HN: VibeFlow (YC S25) – Web app generator with visual, editable workflows

We need to seriously think about what to do with C++ modules (nibblestew.blogspot.com)

Lunar soil machine developed to build bricks using sunlight (moondaily.com)

Why haven't quantum computers factored 21 yet? (algassert.com)

Jujutsu for Everyone (jj-for-everyone.github.io)

Pong Clock (bigjobby.com)

New Ruby Curl bindings with Fiber native support (github.com)

F-Droid site certificate expired (gitlab.com)

Survey: a third of senior developers say over half their code is AI-generated (fastly.com)

How is Ultrassembler so fast? (jghuff.com)

Plastic Before Plastic: How gutta-percha shaped the 19th century (worldhistory.substack.com)

Infisical (YC W23) Is Hiring Solutions Engineers to Scale the OSS Security Stack (ycombinator.com)

Lord of the Io_uring (2020) (unixism.net)

10-20x Faster LLVM -O0 Back-End (discourse.llvm.org)

Growing Up on Alcatraz (sf.gazetteer.co)

Use One Big Server (2022) (specbranch.com)

Spacing Over Cards (smagin.fyi)

The Last Vestal Virgin and the Fall of Rome (debramaymacleod.com)

How to run latest Vegas Pro 22 in Windows 7 x64 (trackerninja.codeberg.page)

Vibe coding as a coding veteran: from 8-bit assembly to English-as-code (levelup.gitconnected.com)

Replacing a cache service with a database (avi.im)

A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings (ai.stanford.edu)

Code Is Debt (tornikeo.com)

Notes on Managing ADHD (borretti.me)

Bitwig Studio 6 details revealed, and editing gets a big boost (cdm.link)

Nobody cares about decentralization until they do (2024) (kyefox.com)

No clicks, no content: The unsustainable future of AI search (bradt.ca)

Cline and LM Studio: the local coding stack with Qwen3 Coder 30B (cline.bot)

Show HN: An ncurses CUDA-based fluid simulation (github.com)

Cognitive load is what matters (github.com)

Ask HN: How do you fight YouTube addiction and procrastination? I'm struggling

My Foray into Vlang (kristun.dev)

Rose Scent Increases Brain Gray Matter (sciencealert.com)

What Are Traces and Spans in OpenTelemetry? (oneuptime.com)

My phone is an ereader now (davepagurek.com)

Is it possible to allow sideloading and keep users safe? (shkspr.mobi)

FDA official demands removal of YouTube videos of himself criticizing vaccines (theguardian.com)

Run a legal LTE network at home for $100 (lantian.pub)

Sheafification – The optimal path to mathematical mastery: The fast track (2022) (sheafification.com)

Shepard Tables (en.wikipedia.org)

eBPF 101: Your First Step into Kernel Programming (journal.hexmos.com)

A 'Third Way' Between Buying or Renting? Swiss Co-Ops Say They've Found It (nytimes.com)

The Untold Story Behind Prince of Persia's Impressive SNES Port (timeextension.com)

I trapped an LLM into a Raspberry Pi and it spiraled into an existential crisis (trappedinside.ai)

Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019) (snf.github.io)

How is Ultrassembler so fast?

Comments (26)