I’ve been trying out various LLMs for working on assembly code in my toy OS kernel for a few months now. It’s mostly low-level device setup and bootstrap code, and I’ve found they’re pretty terrible at it generally. They’ll often generate code that won’t quite assemble, they’ll hallucinate details like hardware registers etc, and very often they’ll come up with inefficient code. The LLM attempt at an AP bootstrap (real-mode to long) was almost comical.
All that said, I’ve recently started a RISC-V port, and I’ve found that porting bits of low-level init code from x86 (NASM) to RISC-V (GAS) is actually quite good - I guess because it’s largely a simple translation job and it already has the logic to work from.
simonw · 18h ago
> They’ll often generate code that won’t quite assemble
Have you tried using a coding agent that can run the compiler itself and fix any errors in a loop?
The first version I got here didn't compile. Firing up Claude Code and letting it debug in a loop fixed that.
noone_youknow · 11h ago
I have, and to be fair that has solved the “basically incorrect code” issue with reasonable regularity. Occasionally the error messages don’t seem helpful enough for it, which is understandable, and I’ve had a few occurrences of it getting “stuck” in a loop trying to e.g. use an invalid addressing mode (it may have gotten itself out of those situations if I were more patient) but generally, with one of the Claude 4 models in agent mode in cursor or Claude code, I’ve found it’s possible to get reasonably good results in terms of “does it assemble”.
I’m still working on a good way to integrate more feedback for this kind of workflow, e.g. for the attempt it made at AP bootstrap - debugging that is just hard, and giving an agent enough control over the running code and the ability to extract the information it would need to debug the resulting triple fault is an interesting challenge (even if probably not all that generally useful).
I have a bunch of pretty ad-hoc test harnesses and the like that I use for general hosted testing, but that can only get you so far in this kind of low-level code.
vidarh · 21h ago
Similar experience - they seem to generally have a lot more problems with ASM than structured languages. I don't know if this reflects less training data, or difficulty.
73kl4453dz · 18h ago
As far as i can tell they have trouble with sustained satisfaction of multiple constraints, and asm has more of that than higher level languages. (An old Boss once said his record for bug density was in asm: he'd written 3 bugs in a single opcode)
noone_youknow · 11h ago
I agree with this. Just the need to keep track of stack, flags and ad-hoc register allocations is something I’ve found they really struggle with. I think this may be why it does so much better at porting from one architecture to another - but even then I’ve seen it have problems with e.g. m68k assembly, where the rules for which moves affect flags are different from, say, x86.
msgodel · 21h ago
The few times I've messed with it I've noticed they're pretty bad at keeping track of registers as they move between subroutines. They're just not great at coming up with a consistent "sub language" the way human assembly programmers tend to.
LtdJorge · 22h ago
A bit tangential, but I've found 4 Sonnet to be much, much better at SIMD intrinsics (in my case, in Rust) than Sonnet 3.5 and 3.7, which were kind of atrocious. For example, 3.7 would write a scalar for loop and tell you "I've vectorized...", when I explicitly asked to do the operations with x86 intrinsics and gave it the capabilities of the hardware. Also, telling it to use AVX2 as supported would not make it use SSE or it would make conditionals to use them, which makes no sense. Seems Claude 4 solves most of that.
Edit: that -> than
noone_youknow · 11h ago
This fits my experience. I’m definitely getting considerably better results with 4 than previous Claudes. I’d essentially dropped sonnet from my rotation before 4 became available, but now it’s a go-to for this sort of thing.
userbinator · 1d ago
I wonder how many demoscene productions it was trained on. Probably not many, because stuff like this sticks out like a sore thumb:
First try worked but didn't use correct terminal size.
aargh_aargh · 20h ago
Tangent: godbolt.org greeted me with a popup but boy, I have never seen a clearer privacy notice, minimal possible data retention, including a diff with the last version. Great job, Matt!
Jare · 23h ago
Might be interesting to try this in ARM assembly where it's a lot less likely to be existing code in the training set.
sitkack · 23h ago
It does fine on Arm assembly (and Neon).
ur-whale · 19h ago
Mmmmyeah, well, one thing LLM are very decent at is translating, it being from human language to human language or from code to code, so not sure your point stands.
broken_broken_ · 19h ago
The x64 assembly would probably work natively on the Mac, no need for docker, provided the 2 syscall numbers (write and exit) are adjusted. Which llms can likely do.
If it’s an ARM Mac, under Rosetta. Otherwise directly.
worldsayshi · 23h ago
Given the price of Claude Code I'm surprised that not more people go the route of using claude through aider with copilot or something like that. Is Claude Code the tool worth the extra expense?
petercooper · 21h ago
It's a lot more agentic. I'm an Aider fan and use it the most because I prefer its simplicity, but it tends to require you be more "involved" in development and decision making than tools like Claude Code which can cycle around more on their own to figure things out and make decisions (which might not be the ones you want).
wiz21c · 21h ago
> which might not be the ones you want
And in your experience, how often is that ?
petercooper · 18h ago
Hard to quantify, but as an opinionated developer I've found that AI systems with too much leash will often head off in directions I don't appreciate and have to undo. This is desirable in areas where I have less knowledge or want more velocity, though, but a tighter cycle with smaller steps enables me to maintain more of my taste through making more concrete decisions rather than merely pointing a direction.
suddenlybananas · 1d ago
Googling "Mandelbrot set in assembly" returns a bunch of examples of this.
djaychela · 1d ago
It does.... I was just surprised that it turned up as terminal output - for some reason I was expecting something in some form of GUI window for some OS or other but I guess that's orders of magnitude more complex and more likely to not work. But he did actually ask for ASCII output, so that does make sense - unlike my assumption!
ale42 · 23h ago
I think that opening a window and rendering something inside it using the native Win32 API from assembly code on Windows would not be so terrifyingly complex. It's just more code as it needs to call the appropriate GUI APIs (not just syscalls), and it's OS-specific... but such code is anyway always OS-specific (the one mentioned here seems to be for Linux, given the used syscalls). No idea how complex it would be with X or on Mac, as I don't know their low-level GUI APIs.
piker · 1d ago
I actually expected the struggle to continue based on experience. Though these things can produce some magical results sometimes.
askl · 21h ago
Ascii art mandelbrot seems like the perfect toy example with tons of examples to copy from. That's what LLMs are really good at.
abujazar · 13h ago
DeepSeek actually does this is one go
revskill · 20h ago
Llm is useless in real world codebase. Tons of hallucination and nonsense. Garbagd everywhere. The danger thing is they messed things up rdomly, o consistence at all.
It is fine to treat it as a better autocompletion tool.
ur-whale · 22h ago
The code seem to be doing calculations with integers instead of floats.
If so, why?
danbruc · 21h ago
Floating point calculations used to be a lot slower than integer calculations so it was very common to use fixed point numbers. Also for good performance, you would usually not do what this code does, calculate the coordinates for each pixel explicitly. Instead you would calculate the coordinates of the starting corner and the delta between adjacent pixels and then just add the delta each time you move to the next pixel. That is also generally easier to do with fixed point numbers as adding 0.1 with floating point numbers a 1000 times will not yield 100 because 0.1 is not exactly representable with base 2 floating point numbers. For this visualization this probably does not matter too much, but if you care that you are not slightly off and you want to calculate stuff incrementally, then doing it in fixed point might make things easier. I have no clue about the first point on current hardware, if floating point calculations are still notably less performant, if there is a relevant difference in the number of execution units and so on.
adrian_b · 15h ago
In very old CPUs and in many microcontrollers, floating-point operations are slower than integer operations.
However, in many old but not very old CPUs, e.g. in many from the last decade of the 20th century and from the first decade of this century, floating-point multiplication and division were much faster than integer multiplication and division.
So for those CPUs, which include models like Pentium 4, or even the first Intel Core models (those older than Nehalem, which got a dedicated integer multiplier), there were many cases when converting an integer computation into a floating-point computation could increase the speed many times.
dspillett · 19h ago
This is _very_ common, so I'm not surprised an LLM would use the method.
Integers are much faster to process than floating point, so if you have a fixed acceptable lower level of precision it is usually a good optimisation to scale your values so they exist in integer space, so 12345678 instead of 1.2345678, perform the mass calculations on those integers, and scale back down for subsequent display.
As well as speed, this also (assuming your scaling to ints gives enough precision for your use) removes the rounding issues inherent with floating point (see https://floating-point-gui.de/), which can balloon over many iterations. This is less important to fractal calculations then the speed issue, though can be visible at very high magnifications, and a bigger issue for things like monetary work. Monetary values are often stored as pennies (or other local minimum subdivision) rather than the nominal currency, so £123.45 is stored and processed as 12345, or for thing like calcs where factions of pennies may be significant, thousandths of pennies (£122.45 -> 12345000).
ur-whale · 19h ago
> Integers are much faster to process than floating point
This was common and accepted knowledge circa 1990-2005.
Is that still the case?
In 2025, I'm not so sure, and if speed is indeed what you're after, what you typically want is to pump calculations into both the floating point units and the integer units in parallel using SIMD instructions, something that should be easy on an embarrassingly parallel problem like Mandelbrot.
And regarding precision ... I was under the impression that proving with 100% certainty that a point (x,y) actually belongs to the Mandelbrot set is basically impossible unless you prove some sort of theorem about that specific point.
The numerical method is always something along the lines of "as long as the iteration hasn't diverged after N iterations we consider the point to be inside".
And both N and the definition of "diverged" are usually completely arbitrary ... so precision, meh, unless you're going to try draw something with a zoom of 1e+20 on the frontier.
About performance: floating point is definitely much faster than integer arithmetic these days, for most comparable operations.
One more point about precision: integer arithmetic is exact, and so if you can work with a reasonably small bound (e.g. [-2,2]^2 for Mandelbrot) then integer arithmetic actually becomes more accurate than FP for the same number of bits.
adrian_b · 15h ago
With most available software packages floating point arithmetic remains much faster than integer arithmetic, but because those software packages do not use the more recent CPU features, since thanks to Intel's stupidity there still are many CPUs which do not support those features.
Already in 2018, with Cannon Lake, Intel has introduced instructions (IFMA) which can reuse for integer computations the same vector multipliers and adders that are provided for floating-point computations. However, later Intel has removed AVX-512 from its consumer CPUs. The latest models of Intel CPUs have introduced a slower AVX-encoded variant of IFMA, but this is too little and too late, in comparison with what AMD offers.
Using IFMA and the other AVX-512 integer vector instructions, where supported, it is possible to obtain very similar throughputs, regardless if the computation is done with integer or floating-point numbers.
Hopefully, with the growing installed base of Zen 4 and Zen 5 based computers there will be the incentive to update many software packages to take advantage of their capabilities.
dspillett · 16h ago
> floating point is definitely much faster than integer arithmetic these days, for most comparable operations.
Is that true of code running on CPUs? My GPU can run rings around the CPU at floating point, and probably itself at integer arithmetic, but I find it hard to believe any CPU is faster at a floating point version of an algorithm than an integer version of the same (assuming both are well, or equally badly, optimised).
Maybe it would have generated with floats if it was prompted for a generator in x87 assembly? It did originate as an extension to x86 on a separate chip, so it could explain the AI sticking to integers.
All that said, I’ve recently started a RISC-V port, and I’ve found that porting bits of low-level init code from x86 (NASM) to RISC-V (GAS) is actually quite good - I guess because it’s largely a simple translation job and it already has the logic to work from.
Have you tried using a coding agent that can run the compiler itself and fix any errors in a loop?
The first version I got here didn't compile. Firing up Claude Code and letting it debug in a loop fixed that.
I’m still working on a good way to integrate more feedback for this kind of workflow, e.g. for the attempt it made at AP bootstrap - debugging that is just hard, and giving an agent enough control over the running code and the ability to extract the information it would need to debug the resulting triple fault is an interesting challenge (even if probably not all that generally useful).
I have a bunch of pretty ad-hoc test harnesses and the like that I use for general hosted testing, but that can only get you so far in this kind of low-level code.
Edit: that -> than
[0]: https://mathr.co.uk/mandelbrot/book-draft-2017-11-10.pdf
[1]: https://mathr.co.uk/web/mandelbrot.html
If you'd like to join our great little fractal community, here's a Discord invite link: https://discord.gg/beKyJ8HSk5
[0] https://code.golf/mandelbrot#assembly
First try worked but didn't use correct terminal size.
If it’s an ARM Mac, under Rosetta. Otherwise directly.
And in your experience, how often is that ?
It is fine to treat it as a better autocompletion tool.
If so, why?
However, in many old but not very old CPUs, e.g. in many from the last decade of the 20th century and from the first decade of this century, floating-point multiplication and division were much faster than integer multiplication and division.
So for those CPUs, which include models like Pentium 4, or even the first Intel Core models (those older than Nehalem, which got a dedicated integer multiplier), there were many cases when converting an integer computation into a floating-point computation could increase the speed many times.
Integers are much faster to process than floating point, so if you have a fixed acceptable lower level of precision it is usually a good optimisation to scale your values so they exist in integer space, so 12345678 instead of 1.2345678, perform the mass calculations on those integers, and scale back down for subsequent display.
As well as speed, this also (assuming your scaling to ints gives enough precision for your use) removes the rounding issues inherent with floating point (see https://floating-point-gui.de/), which can balloon over many iterations. This is less important to fractal calculations then the speed issue, though can be visible at very high magnifications, and a bigger issue for things like monetary work. Monetary values are often stored as pennies (or other local minimum subdivision) rather than the nominal currency, so £123.45 is stored and processed as 12345, or for thing like calcs where factions of pennies may be significant, thousandths of pennies (£122.45 -> 12345000).
This was common and accepted knowledge circa 1990-2005.
Is that still the case?
In 2025, I'm not so sure, and if speed is indeed what you're after, what you typically want is to pump calculations into both the floating point units and the integer units in parallel using SIMD instructions, something that should be easy on an embarrassingly parallel problem like Mandelbrot.
And regarding precision ... I was under the impression that proving with 100% certainty that a point (x,y) actually belongs to the Mandelbrot set is basically impossible unless you prove some sort of theorem about that specific point.
The numerical method is always something along the lines of "as long as the iteration hasn't diverged after N iterations we consider the point to be inside".
And both N and the definition of "diverged" are usually completely arbitrary ... so precision, meh, unless you're going to try draw something with a zoom of 1e+20 on the frontier.
About performance: floating point is definitely much faster than integer arithmetic these days, for most comparable operations.
One more point about precision: integer arithmetic is exact, and so if you can work with a reasonably small bound (e.g. [-2,2]^2 for Mandelbrot) then integer arithmetic actually becomes more accurate than FP for the same number of bits.
Already in 2018, with Cannon Lake, Intel has introduced instructions (IFMA) which can reuse for integer computations the same vector multipliers and adders that are provided for floating-point computations. However, later Intel has removed AVX-512 from its consumer CPUs. The latest models of Intel CPUs have introduced a slower AVX-encoded variant of IFMA, but this is too little and too late, in comparison with what AMD offers.
Using IFMA and the other AVX-512 integer vector instructions, where supported, it is possible to obtain very similar throughputs, regardless if the computation is done with integer or floating-point numbers.
Hopefully, with the growing installed base of Zen 4 and Zen 5 based computers there will be the incentive to update many software packages to take advantage of their capabilities.
Is that true of code running on CPUs? My GPU can run rings around the CPU at floating point, and probably itself at integer arithmetic, but I find it hard to believe any CPU is faster at a floating point version of an algorithm than an integer version of the same (assuming both are well, or equally badly, optimised).