Show HN: Luminal – Open-source, search-based GPU compiler
We take high level model code, like you'd have in PyTorch, and generate very fast GPU code. We do that without using LLMs or AI - rather, we pose it as a search problem. Our compiler builds a search space, generates millions of possible kernels, and then searches through it to minimize runtime.
You can try out a demo in `demos/matmul` on mac to see how Luminal takes a naive operation, represented in our IR of 12 simple operations, and compiles it to an optimized, tensor-core enabled Metal kernel. Here’s a video showing how: https://youtu.be/P2oNR8zxSAA
Our approach differs significantly from traditional ML libraries in that we ahead-of-time compile everything, generate a large search space of logically-equivalent kernels, and search through it to find the fastest kernels. This allows us to leverage the Bitter Lesson to discover complex optimizations like Flash Attention entirely automatically without needing manual heuristics. The best rule is no rule, the best heuristic is no heuristic, just search everything.
We’re working on bringing CUDA support up to parity with Metal, adding more flexibility to the search space, adding full-model examples (like Llama), and adding very exotic hardware backends.
We aim to radically simplify the ML ecosystem while improving performance and hardware utilization. Please check out our repo: https://github.com/luminal-ai/luminal and I’d love to hear your thoughts!
Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?
And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include
- Different ways to partition workloads among GPU threads and blocks
- Varying memory access strategies (using shared memory, global memory)
- Various instruction-level optimizations or reordering
- Alternative loop unroll factors or vectorization strategies
The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.
In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?
This is such an interesting technique.
This obviously creates a combinatorial problem that we mitigate with smarter search.
The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.
The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.
Great that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!
Also, what about CUDA alternatives like ROCm?