"FP32 is less common in modern ML workloads and often less optimized on recent hardware compared to FP16 or BF16, which may partly explain why it’s easier to achieve performance gains over PyTorch with FP32 kernels."
People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.
adrian_b · 2h ago
I believe that these good results are explained at least in part by the fact that NVIDIA does not provide detailed enough documentation for their GPUs.
For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program, it is much less likely that applying ML/AI can be successful, except as a substitute for searching already known solutions.
On the other hand, for less documented microarchitectures, like of the NVIDIA GPUs, finding an optimal program may be impossible other than by doing a random search guided by examples of previous optimized programs, and possibly doing some reverse-engineering work to determine the real behavior of the GPU in some circumstances.
Improving over something like this is likely to be feasible for ML/AI, where training over known good programs may be able to extract some of the undocumented behavior that may be non-obvious for humans reading those examples.
almostgotcaught · 1h ago
What in the world are you saying
> for which a programmer or a compiler can deterministically write an optimal program
Why do you believe a programmer can't "deterministically write" a program for NV? Do you think that PTX->SASS translation is somehow non-deterministic? Do you also believe the same thing about x86...?
Why are there so many tinfoil hot-takes on hn? It's like everyone is always claiming some kind of deeply controversial insight to seem so wise.
throwaway81523 · 20m ago
The running time of a CUDA kernel is apparently impossible to determine except by experiment and measurement, and might be nondeterministic. By contrast for a more typical CPU, there's a compiler whose assembly output you can examine, and there's a processor manual that gives the cycle timing of each instruction. So you can compute the running time at least of inner loops that stay in cache, and that sort of thing.
speerer · 1h ago
The point was about being able to write an optimal program with certainty, not about just getting the thing to operate.
No comments yet
suddenlybananas · 8h ago
I wonder if it's using known improvements from the fp16/bf16 kernels that are transferable to fp32?
moralestapia · 8h ago
>People haven't spent time optimizing the fp32 versions of these kernels in years.
Wow, so, you're basically saying the AI created new algos in a domain with no pre-existing solutions? Awesome!
Aurornis · 6h ago
No one said the AI created new algorithms nor that there weren’t pre-existing solutions.
The implication was that the FP32 versions of these kernels have lagged behind the more popular versions. There was opportunity to translate the advancements from other kernels into these. Someone would need to look closely to see exactly what was done, but it’s premature to suggest anything like “new algos” or “no pre-existing solutions”
This is a great use case for LLMs, though. I often do something similar where I make improvements to something I use most frequently and ask an LLM to translate that pattern to other similar parts of the code.
moralestapia · 6h ago
>The implication was that the FP32 versions of these kernels have lagged behind the more popular versions.
Help me understand this 'cause I'm a bit slow these days ...
Does that mean optimized FP32 versions of these kernels were already there or not?
almostgotcaught · 1h ago
> Help me understand this 'cause I'm a bit slow these days ...
If I do `sed 's/f32/f16/g' kernel.cu` does this count as AI? Help me understand because I'm a little slow when it comes to all the dumb shit people attribute to LLMs these days...
vlovich123 · 4h ago
The solution not existing in PyTorch does not mean the solution doesn’t exist elsewhere on the internet. Remember - PyTorch is largely maintained by employees of companies that have their own priorities for the SW and those priorities may not include hyper optimizing fp32 kernels.
That being said, it is cool if AI is enabling lower cost adoption of better more optimized kernels with less effort.
uoaei · 5h ago
The hype cycle in action, folks. Pay heed.
No comments yet
ekelsen · 8h ago
"the reference code is in the default FP32, and given a tolerance threshold (1e-02)"
that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.
unignorant · 7h ago
yeah, it seems likely the underlying task here (one reasoning step away) was: replace as many fp32 operations as possible in this kernel with fp16. i'm not sure exactly how challenging a port like that is, but intuitively seems a bit less impressive
maybe this intuition is wrong but would be great for the work to address it explicitly if so!
This means the results are useless. Did they even check the relative error at all?
Replacing float32 operations with float16 is also pointless. There is nothing to be gained by doing this, as it removes the actual accuracy advantage of float32s, which would the single most important reason to use that version of the algorithm.
thorum · 9h ago
My takeaway - from this article, from Google’s AlphaEvolve [1], and the recent announcement about o3 finding a zero day in the Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular have reached a new level of capability where these ideas that were tried unsuccessfully with other models, suddenly just work.
In my opinion, I wouldn’t say so much that they are suddenly working. Rather we’ve reached a point where they can iterate and test significantly faster than humans are capable of doing and have the ability to call on significantly more immediately available information that it can make sense of, and as a result, the combination information, advancement and intelligently applied brute force seems to be having success in certain applications.
thorum · 7h ago
Good points. I suspect that o3 is able to reason more deeply about different paths through a codebase than earlier models, though, which might make it better at this kind of work in particular.
westoncb · 6h ago
I was blown away by some debugging results I got from o3 early on and have been using it heavily since. The early results that caught my attention were from a couple cases where it tracked down some problematic cause through several indirect layers of effects in a way where you'd typically be tediously tracing step-by-step through a debugger. I think whatever's behind this capability has some overlap with really solid work it'll do in abstract system design, particularly in having it think through distant implications of design choices.
MangoToupe · 49m ago
In the context of LLMs, what do you mean by "reason"? What does reasoning look like in LLMs and how do you recognize it, and more importantly, how do you invoke it? I haven't had much success in getting LLMs to solve, well, basically any problem that involves logic.
Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".
suddenlybananas · 44m ago
People think an approximation of a thing is the thing.
therealpygon · 7h ago
Very likely. Larger context is significantly beneficial to the LLMs when they can maintain attention, which was part of my point. Imagine being able to hold the word for word text of your required reading book while you are taking a test, while older models were more like a couple chapters worth of text. Two years ago.
geraneum · 1h ago
It’s true that there are similarities between what you mentioned and what’s happening in this case. From the article:
> The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation.
My conclusion would be that we’ve now learned to apply LLMs’ capabilities to shrink solution space where we have a clear evaluation function as well as solutions to problems that might follow similar patterns. This applies in this case as well.
IMO, It’s not about model X gaining on other models or model Y being able to reason about the solutions, etc. in a way that other models couldn’t.
MangoToupe · 50m ago
Interesting. Do you have stronger evidence to support your claim? A sample size of one is pretty unconvincing.
zozbot234 · 8h ago
Wait, what are you saying? These have nothing to do with the Linux kernel whatsoever, they are "kernels" in the GPU programming sense. Did you just hallucinate this whole comment or what?
thorum · 7h ago
Sorry, I added links! Just a week ago someone built a system that used o3 to find novel zero days in the Linux kernel’s SMB implementation.
stefan_ · 7h ago
Theres zero days in obscure parts of the kernel nobody uses every other day. (It also of course found 100 other things that were not zero days or vulnerabilities, yet professed they were, which is why this trash even on Gemini 9000 Pro keeps spamming security mails)
None4U · 8h ago
There was a post on HN a bit ago from someone who used o3 to find a vulnerability in the Linux kernel's SMB server, which this person is just saying should've been tried earlier and probably recently became possible
jiggawatts · 8h ago
Gemini Pro 2.5 is the first AI that I can productively use for anything other than human language translation, but it's just barely crossed that threshold. Sometimes I get success hit rates below 20%.
When 3.0 comes out, that... that's going to start getting a little scary.
manmal · 7h ago
o3 is in my experience often even better, but too slow and too rate limited to use it all the time.
jacob019 · 7h ago
What domain?
jiggawatts · 6h ago
SRE / DevOps / coding mostly in the Azure and .NET ecosystems.
The problems I have to solve tend to be the horrible ones that nobody has answers to, anywhere on the Internet, so unsurprisingly the AIs aren't good at it either.
The trick has been to use the AIs for what they are good that, which used to be "nothing" for me at least, but now I can use them productively for certain "spot" tasks.
Random examples:
- Cross-language and cross-platform benchmarking of a bunch of different database clients to see how they stack up. I gave the AI a working example in one language and got it to whip up a series of equivalents with other DB drivers and languages. Sure, it's trivial, but it's way faster than doing it myself!
- Crash dump analysis using WinDbg. I read somwhere that "vibe debugging" of kernel dumps totally works, so when I had an actual crash I gave it a go for laughs. With AI help I managed to extract the name of the specific file that had NTFS corruption and was crashing the server. Deleted the file, restored it from backups, and the server was good to go again!
- If you ever watch the top mechanical engineers on YouTube, they all make their own tools instead of just buying them. Jigs, extenders, unusual sizes, etc... IT work is the same. As a recent example, I got Gemini to make me a code-AST rewriter for a specific issue I wanted to clean up in bulk across a huge code base. Using the Roslyn compiler SDK is a bit fiddly, but it spat out a working tool for me in under an hour. (This is not something you can solve with a script full of regex, it needed a proper parser to handle commented-out blocks and the like.)
jacob019 · 5h ago
Sounds like interesting work, thanks for sharing! "Vibe debugging", hah, I like that one. The latest crop of models is definately unlocking new capabilities, and I totally get the desire to make your own tools. I do that to a fault sometimes, but it's nice to have a simple tool that does exactly one thing, exactly the way you want it.
I've been pair programming with the models for a while, and wrote some "agents" before I knew to call it that back in the dark days of GPT-3.5, but only recently with the latest models unlocking capabilities beyond what I could achieve with handwritten code.
mholm · 4h ago
> Sure, it's trivial, but it's way faster than doing it myself
That's the clincher for me. So much software work is just excecuting on a design, not inventing anything new. Being able to do 5x the trivial work in an hour is life changing, and it lets me pull my head out of that work to see how I can make larger process improvements. AI doesn't need to rewrite the linux kernel in Rust to be extremely valuable to the average developer
vessenes · 6h ago
By far the most interesting part (after the 400% speed up in some cases) is the methodology: rather than hill climb on operations, they forced a language reasoning step between iterations to encourage diversity of search. This seems to have worked. Very very interesting.
lucidrains · 6h ago
oh wow, I was looking for use of islands or map-elites that I missed this.. thought it was the blandest mimetic evolution possible
vessenes · 2h ago
Just anecdotally I feel like hill climbing on operations is just so slow; I’m not saying it doesn’t work, but it always feels one step away from brute force search. I really like the idea of just throwing stuff at the LLM and giving it access to old strong variants in context.
yahoozoo · 9h ago
Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they don’t mention which one produced the better kernels.
Workaccount2 · 9h ago
Very fascinating result, and it seems they wrote this blog post out of pure excitement to share their findings, and maybe to have someone throw cold water on it before publishing, ha.
Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.
suddenlybananas · 9h ago
> Who knows if this is the actual fabled path of "self improvement"
Seems doubtful as this works only on an extremely well-defined evaluation function.
observationist · 9h ago
Each time you define another task well enough for the system to work, you generalize the system just a little bit - repeat enough times and you can start to expand, develop taxonomies of functions, precisely define function spaces and metrics for improvement. This might not be a bootstrap for recursive self improvement generally, but it could definitely inform the theory or design of a system that does bootstrap rsi.
suddenlybananas · 9h ago
That's an entirely different idea that may or may not work. This is not evidence of that.
observationist · 8h ago
The structure of their research - the process, the specific task, and the data they generate - will help inform how other research gets performed. Instead of GPU kernels, maybe the next task is something like neuron modules, looking for structures that improve on attention blocks, or things like that - each time you run through an experiment like this, you're creating foundational data upon which other experiments can be run and improved. Once you've done enough of them, you can generalize.
It could be that the end result is the knowledge of strict boundaries of LLM capabilities, that they can only operate in specific domains, or only improve to a certain extent, and some currently unspecified defect limits the level of improvement.
The underlying idea of specifying a domain and task conditions, then letting an LLM run thousands of experiments, is a great search technique. The hope is that there is no implicit defect and that the methodology will extend and generalize - it's not too complex a notion to think that you could have an LLM create a broad range of individual tasks, with a meta-goal of identifying better and more general recursive improvement processes and algorithms.
suddenlybananas · 8h ago
>The hope is that there is no implicit defect and that the methodology will extend and generalize - it's not too complex a notion to think that you could have an LLM create a broad range of individual tasks, with a meta-goal of identifying better and more general recursive improvement processes and algorithms
Again, entirely different idea that doesn't have a straightforward evaluation function. As it stands, this is more akin to genetic programming with a very good mutation function.
EMIRELADERO · 9h ago
That may be true, but this is the first example I've seen where the concept is successfully implemented in a noticeable way.
It's just like image generation: the first iteration is the worst it will ever be.
brrrrrm · 8h ago
what's going to be interesting is to see the large space of fused kernels being tackled by AI generated code. that might include gemm + relu + gemm + a norm of some kind - which would be annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite as a human
AtlasBarfed · 7h ago
Uh, what is a "kernel" in the sense of AI? Because it sure looks like this isn't an OS kernel.
A function that is meant to be executed in parallel on an attached GPU is called a kernel. In CUDA, a kernel is usually identified by the presence of the __global__ specifier in front of an otherwise normal-looking C++ function declaration.
adityamwagh · 8h ago
Sometimes I think of LLMs as kind of a hive mind. It’s trained on thought processes of so many humans. I think that’s why it’s able to do these kinds of things given the fact that it has so much information and context compressed in weights.
MangoToupe · 8h ago
The market itself is also kind of a hive-mind metaphor. Worth thinking about.
suddenlybananas · 8h ago
Maybe we could replace it with a central planning now that we can distill information.
MangoToupe · 8h ago
Whoops you just did a communism
gpm · 8h ago
A "vertical integration" in the capitalist world ;)
MangoToupe · 7h ago
This got a legitimate chortle out of me
yieldcrv · 6h ago
a non-human standing committee following the directives of a trust could work
MangoToupe · 56m ago
What like you want to govern by divining patterns of snake coils or bird guts?
MangoToupe · 52m ago
> Our results are benchmarked on an Nvidia L40S
At the very least they could have used consumer hardware. I don't even know how to parse that model it's so consumer-alien.
david-gpu · 5h ago
Disclaimer: This used to be my bread and butter, but I'm really rusty after five years of not working on this sort of stuff.
That said, after quickly skimming the example AI-generated kernel I am not seeing anything novel there. While working at nVidia I did see a handful of techniques that, frankly, blew my mind.
Thus, I wonder what makes this AI-generated kernel faster than the standard pyTorch kernel, which I presume is simply delegating all the heavy lifting onto cuDNN. My guess, and it's just a guess, is that they are comparing the fastest AI-generated kernel they produced for a very particular set of parameters against whatever kernel cuDNN is picking for that same scenario, and perhaps the subsystem inside cuDNN that picks which kernel to execute out of the very large database it manages chose a suboptimal candidate. Researchers tend to completely ignore this issue and assume that cuDNN is always able to choose the very best kernel in every possible scenario, something that is just not realistic.
Maybe there is something else going on, but these sort of "we have beaten this heavily optimized proprietary library" always seem to miss this very important point.
Kind regards to any NVidia insiders who may read this. You guys are the brightest people I've ever met.
JSR_FDED · 7h ago
Could this be used to create kernels for OpenCL, ROCm, etc?
reliabilityguy · 9h ago
Is my understanding correct that they assumed a fixed size of the input?
If so, why is it surprising that generic implementations in PyTorch are worse?
GaggiX · 9h ago
Pytorch uses different kernels depending on the input size. There is a reason why it's so massive to download.
reliabilityguy · 8h ago
Sure, some degree of customization is expected. However, I doubt that PyTorch implements every input size separately.
constantcrying · 8h ago
>and test for correctness by checking the numerical equality of the two outputs over many random inputs.
This is fundamentally different to how any human would approach this problem. And also different to how some recent advances in this area were made, where AI actually came up with superior and correct algorithms.
This approach also seems quite unfortunate and makes many of theses results somewhat doubtful.
IIRC there was another paper recently, with similar methodology about computing xAx. These papers produce algorithms which aren't empirically correct, but provably correct. They do this by operating on a graph data structure, which describes the algorithm and then verifying the algebraic equality to the correct result.
There is a substantial difference here. And I think utilizing algorithms which only are empirically correct can be dangerous.
People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.
For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program, it is much less likely that applying ML/AI can be successful, except as a substitute for searching already known solutions.
On the other hand, for less documented microarchitectures, like of the NVIDIA GPUs, finding an optimal program may be impossible other than by doing a random search guided by examples of previous optimized programs, and possibly doing some reverse-engineering work to determine the real behavior of the GPU in some circumstances.
Improving over something like this is likely to be feasible for ML/AI, where training over known good programs may be able to extract some of the undocumented behavior that may be non-obvious for humans reading those examples.
> for which a programmer or a compiler can deterministically write an optimal program
Why do you believe a programmer can't "deterministically write" a program for NV? Do you think that PTX->SASS translation is somehow non-deterministic? Do you also believe the same thing about x86...?
Why are there so many tinfoil hot-takes on hn? It's like everyone is always claiming some kind of deeply controversial insight to seem so wise.
No comments yet
Wow, so, you're basically saying the AI created new algos in a domain with no pre-existing solutions? Awesome!
The implication was that the FP32 versions of these kernels have lagged behind the more popular versions. There was opportunity to translate the advancements from other kernels into these. Someone would need to look closely to see exactly what was done, but it’s premature to suggest anything like “new algos” or “no pre-existing solutions”
This is a great use case for LLMs, though. I often do something similar where I make improvements to something I use most frequently and ask an LLM to translate that pattern to other similar parts of the code.
Help me understand this 'cause I'm a bit slow these days ...
Does that mean optimized FP32 versions of these kernels were already there or not?
If I do `sed 's/f32/f16/g' kernel.cu` does this count as AI? Help me understand because I'm a little slow when it comes to all the dumb shit people attribute to LLMs these days...
That being said, it is cool if AI is enabling lower cost adoption of better more optimized kernels with less effort.
No comments yet
that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.
maybe this intuition is wrong but would be great for the work to address it explicitly if so!
Replacing float32 operations with float16 is also pointless. There is nothing to be gained by doing this, as it removes the actual accuracy advantage of float32s, which would the single most important reason to use that version of the algorithm.
[1] https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...
[2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".
> The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation.
My conclusion would be that we’ve now learned to apply LLMs’ capabilities to shrink solution space where we have a clear evaluation function as well as solutions to problems that might follow similar patterns. This applies in this case as well.
IMO, It’s not about model X gaining on other models or model Y being able to reason about the solutions, etc. in a way that other models couldn’t.
When 3.0 comes out, that... that's going to start getting a little scary.
The problems I have to solve tend to be the horrible ones that nobody has answers to, anywhere on the Internet, so unsurprisingly the AIs aren't good at it either.
The trick has been to use the AIs for what they are good that, which used to be "nothing" for me at least, but now I can use them productively for certain "spot" tasks.
Random examples:
- Cross-language and cross-platform benchmarking of a bunch of different database clients to see how they stack up. I gave the AI a working example in one language and got it to whip up a series of equivalents with other DB drivers and languages. Sure, it's trivial, but it's way faster than doing it myself!
- Crash dump analysis using WinDbg. I read somwhere that "vibe debugging" of kernel dumps totally works, so when I had an actual crash I gave it a go for laughs. With AI help I managed to extract the name of the specific file that had NTFS corruption and was crashing the server. Deleted the file, restored it from backups, and the server was good to go again!
- If you ever watch the top mechanical engineers on YouTube, they all make their own tools instead of just buying them. Jigs, extenders, unusual sizes, etc... IT work is the same. As a recent example, I got Gemini to make me a code-AST rewriter for a specific issue I wanted to clean up in bulk across a huge code base. Using the Roslyn compiler SDK is a bit fiddly, but it spat out a working tool for me in under an hour. (This is not something you can solve with a script full of regex, it needed a proper parser to handle commented-out blocks and the like.)
I've been pair programming with the models for a while, and wrote some "agents" before I knew to call it that back in the dark days of GPT-3.5, but only recently with the latest models unlocking capabilities beyond what I could achieve with handwritten code.
That's the clincher for me. So much software work is just excecuting on a design, not inventing anything new. Being able to do 5x the trivial work in an hour is life changing, and it lets me pull my head out of that work to see how I can make larger process improvements. AI doesn't need to rewrite the linux kernel in Rust to be extremely valuable to the average developer
Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.
Seems doubtful as this works only on an extremely well-defined evaluation function.
It could be that the end result is the knowledge of strict boundaries of LLM capabilities, that they can only operate in specific domains, or only improve to a certain extent, and some currently unspecified defect limits the level of improvement.
The underlying idea of specifying a domain and task conditions, then letting an LLM run thousands of experiments, is a great search technique. The hope is that there is no implicit defect and that the methodology will extend and generalize - it's not too complex a notion to think that you could have an LLM create a broad range of individual tasks, with a meta-goal of identifying better and more general recursive improvement processes and algorithms.
Again, entirely different idea that doesn't have a straightforward evaluation function. As it stands, this is more akin to genetic programming with a very good mutation function.
It's just like image generation: the first iteration is the worst it will ever be.
https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteris...
A function that is meant to be executed in parallel on an attached GPU is called a kernel. In CUDA, a kernel is usually identified by the presence of the __global__ specifier in front of an otherwise normal-looking C++ function declaration.
At the very least they could have used consumer hardware. I don't even know how to parse that model it's so consumer-alien.
That said, after quickly skimming the example AI-generated kernel I am not seeing anything novel there. While working at nVidia I did see a handful of techniques that, frankly, blew my mind.
Thus, I wonder what makes this AI-generated kernel faster than the standard pyTorch kernel, which I presume is simply delegating all the heavy lifting onto cuDNN. My guess, and it's just a guess, is that they are comparing the fastest AI-generated kernel they produced for a very particular set of parameters against whatever kernel cuDNN is picking for that same scenario, and perhaps the subsystem inside cuDNN that picks which kernel to execute out of the very large database it manages chose a suboptimal candidate. Researchers tend to completely ignore this issue and assume that cuDNN is always able to choose the very best kernel in every possible scenario, something that is just not realistic.
Maybe there is something else going on, but these sort of "we have beaten this heavily optimized proprietary library" always seem to miss this very important point.
Kind regards to any NVidia insiders who may read this. You guys are the brightest people I've ever met.
If so, why is it surprising that generic implementations in PyTorch are worse?
This is fundamentally different to how any human would approach this problem. And also different to how some recent advances in this area were made, where AI actually came up with superior and correct algorithms.
This approach also seems quite unfortunate and makes many of theses results somewhat doubtful.
IIRC there was another paper recently, with similar methodology about computing xAx. These papers produce algorithms which aren't empirically correct, but provably correct. They do this by operating on a graph data structure, which describes the algorithm and then verifying the algebraic equality to the correct result.
There is a substantial difference here. And I think utilizing algorithms which only are empirically correct can be dangerous.