This looks amazing, I've been shopping for an implementation of this I could play around with for a while now
They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?
m-schuetz · 1h ago
That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.
Not even close. The most important application (in today's world) is radix sort.
WJW · 2h ago
What specific application do you have in mind that radix sort is more important than matrix multiplication?
otherjason · 48m ago
I think they were trying to say “radix sort is a more important application of prefix sum than extraction of values from a sparse matrix/vector is.”
woadwarrior01 · 48m ago
Top K sampling comes to mind, although it's nowhere nearly as important as matmult.
almostgotcaught · 29m ago
ranking models benefit from gpu impls of sort but yup they're not nearly as common/important as spmm/spmv
m-schuetz · 44m ago
Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.
They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...