Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

110 nserrino 14 9/3/2025, 5:03:35 PM gimletlabs.ai ↗

Comments (14)

turbo_wombat · 2h ago
They are comparing unoptimized PyTorch inference, something you would never deploy on a device, to a model with custom kernels.

Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

Generally, PyTorch inference is meant to be used during the training process, and when running metrics, not when deploying. When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device.

If you aren't familiar with the pipeline for ML deployment, this is the equivalent of comparing interpreted code to compiled code.

CapsAdmin · 20m ago
I have never really worked with pytorch professionally, but it feels to me a lot of the open source, especially generative oriented projects, just use pytorch like this. It makes hacking on the models a whole lot easier.

comfyui is a good example of a project like this.

nserrino · 38m ago
PyTorch is the baseline because that's what people prototype in, and the most common reference point. The aim here is to show that you can start from prototype code and automatically produce lower-level kernels (in this case Metal) that are more usable in real deployments, without additional work from the developer. Frontier models are capable at generating efficient Metal kernels automatically/immediately, and will only get better. We expect to see significant improvements as we refine the approach, but it's enough to show this seems to be a tractable problem for AI.
Tiberium · 1h ago
It'd be curious to see how those AI generated kernels compare to kernels generated by https://github.com/tinygrad/tinygrad
earthnail · 2h ago
This is amazing. I wouldn't have thought that AI is this good in niche topics. Very impressive experiment and write up.

Still, I can't help but think we should bet on sth like Mojo instead for the long run.

moelf · 1h ago
>I can't help but think we should bet on sth like Mojo instead for the long run.

JAX or Julia hopefully.

No comments yet

ipsum2 · 2h ago
Mojo is a terrible language and its main feature (GPU acceleration through Mojo max) is closed source and requires a commercial license to be purchased.
earthnail · 38m ago
Why is it a terrible language? Genuinely curious question.
nikolayasdf123 · 2h ago
> non 100% correctness of kernels

wouldn't model not work properly if kernels are even slightly off?

wasn't kernels a part of training stack for models? am I missing anything?

ymsodev · 1h ago
The article is referring to GPU compute kernel (https://en.wikipedia.org/wiki/Compute_kernel), not the term kernel used in ML/NN/etc.
magicalist · 59m ago
This is cool it works so well (less the performance claims and more the correct translations).

Not to take away from the nice writeup, but for anyone not getting far enough into the writeup, this is essentially taking https://github.com/ScalingIntelligence/KernelBench and seeing if it can generate Metal kernels in addition to the CUDA kernels the benchmark is written for. The dataset was released in November 2024, it looks like, with a paper on arXiv in February and a bunch of discussion at the time[1], so worth keeping likelihood of inclusion in training data in mind when comparing models.

The different levels are interesting. Level 1 and 3 are successfully (5-shot) translated to Metal kernels by gpt5 97% and 88% of the time , but in both cases, the majority of generated kernels are slower than the reference compiled pytorch versions. The speculation about more simple op fusion opportunities in the Level 2 kernels vs the very simple Level 1 kernels and the complex architecture Level 3 kernels seems plausible. From the KernelBench paper, it looks like Level 2 kernels were mostly automatically generated from randomly picking operators and then getting an LLM to generate a kernel combining them, while Level 1 were mostly hand written and Level 3 came from well-known ML architectures.

The swarm part seemed a bit of a stretch. They fired off requests to 8 different models to do the translation, and the "supervisor" benchmarked the returned kernels and picked the fastest one. Technically a swarm, I guess, but feels like we're devaluing the term :)

The correctness testing used made my eye twitch a bit:

> We tested the generated kernel's output against the default implementation's output on 100 random inputs. We set a 0.01 tolerance for both relative and absolute. Let a be the generated kernel output, and b be the reference kernel output. Outputs were considered equal if for every element in the output, absolute(a - b) ≤ (atol + rtol absolute(b)) held true.*

For a numerical kernel, this seems way too loose, but turns out those bounds come straight from KernelBench, which only tested for correctness on 5 random inputs by default in their harness, not the 100 they used here. KernelBench mentions the clear tradeoff they get between how strictly they define correctness and kernel performance, but for Level 1 kernels in particular, which are really just single operations, it seems like the bounds should be multiple orders of magnitude smaller to ensure robust translation. For instance, the all 0s "optimization" mentioned in the writeup allowing for trivially translating the kernel looks like it's due to those loose tolerances[2] and KernelBench was looking to make the evaluation more robust.

[1] Like https://metr.org/blog/2025-02-14-measuring-automated-kernel-...

[2] https://github.com/ScalingIntelligence/KernelBench/pull/25

simlevesque · 3h ago
Are these kernel available ? I'd love to try that !
magicalist · 57m ago
Feels like they should have released some code, yeah, but gpt5 success rate was high enough that it looks like you can just pass the kernels they got from https://github.com/ScalingIntelligence/KernelBench/ to gpt5 (with up to five rounds of feeding back compilation/correctness errors back to the model) and get the results yourself.
pbronez · 3h ago
This is pretty cool.

I initially thought they were writing custom kernels for proprietary models like GPT-5. They aren't - they're using proprietary models to write kernels for a set of ~250 open Pytorch modules.