GPUPrefixSums – state of the art GPU prefix sum algorithms

49 coffeeaddict1 11 8/28/2025, 12:49:10 PM github.com ↗

Comments (11)

luizfelberti · 31m ago
This looks amazing, I've been shopping for an implementation of this I could play around with for a while now

They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?

m-schuetz · 1h ago
That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.
coffeeaddict1 · 3h ago
genpfault · 4h ago
almostgotcaught · 4h ago
this is missing the most important one (in today's world): extracting non-zero elements from a sparse vector/matrix

https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

merope14 · 3h ago
Not even close. The most important application (in today's world) is radix sort.
WJW · 1h ago
What specific application do you have in mind that radix sort is more important than matrix multiplication?
otherjason · 25m ago
I think they were trying to say “radix sort is a more important application of prefix sum than extraction of values from a sparse matrix/vector is.”
woadwarrior01 · 25m ago
Top K sampling comes to mind, although it's nowhere nearly as important as matmult.
almostgotcaught · 5m ago
ranking models benefit from gpu impls of sort but yup they're not nearly as common/important as spmm/spmv
m-schuetz · 20m ago
Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.