This is very cool. I enjoyed going through the writeup and GitHub README.
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!
kp1197 · 1h ago
After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.
zhihaojia · 1h ago
Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.
bdbenton5255 · 15m ago
Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.
baq · 2h ago
Next step - compile straight to verilog so I can buy some LLMs on aliexpress
But I suspect parallel computing in GPU style is going to dominate acclerated computing.
General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.
Ideas of software direct to hardware transition might never be the mainstream.
baq · 1h ago
I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.
fxtentacle · 16m ago
Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?
(Edited):
Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751
zhihaojia · 1h ago
This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.
liuliu · 2h ago
The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.
skavi · 1h ago
Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?
saagarjha · 37m ago
> The CPU launch cost of a graph is tiny
Absolutely not; it’s comparable to the launch overhead of a kernel.
refulgentis · 1h ago
You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.
CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.
With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.
olivia111 · 1h ago
really cool. would love to try it for our 3b model.
> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.
What? Why? This seems like an obvious optimization if it's possible.
catlifeonmars · 2h ago
From the article
> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.
So my naive assumption is that yes it is obvious, but nontrivial.
saagarjha · 35m ago
Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.
liuliu · 2h ago
It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.
CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.
Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.
shawntan · 2h ago
Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.
That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.
Not as low-hanging as you might imagine.
delusional · 1h ago
In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.
What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!
But I suspect parallel computing in GPU style is going to dominate acclerated computing.
General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.
Ideas of software direct to hardware transition might never be the mainstream.
Good to see the competition in this area.
(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751
That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.
Absolutely not; it’s comparable to the launch overhead of a kernel.
CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.
With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.
What? Why? This seems like an obvious optimization if it's possible.
> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.
So my naive assumption is that yes it is obvious, but nontrivial.
CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.
Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.
That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.
Not as low-hanging as you might imagine.
What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.