GPUPrefixSums – state of the art GPU prefix sum algorithms

48 coffeeaddict1 10 8/28/2025, 12:49:10 PM github.com ↗

Comments (10)

luizfelberti · 13m ago
This looks amazing, I've been shopping for an implementation of this I could play around with for a while now

They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?

m-schuetz · 1h ago
That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.
coffeeaddict1 · 3h ago
genpfault · 4h ago
almostgotcaught · 4h ago
this is missing the most important one (in today's world): extracting non-zero elements from a sparse vector/matrix

https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

merope14 · 3h ago
Not even close. The most important application (in today's world) is radix sort.
WJW · 1h ago
What specific application do you have in mind that radix sort is more important than matrix multiplication?
m-schuetz · 2m ago
Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.
otherjason · 7m ago
I think they were trying to say “radix sort is a more important application of prefix sum than extraction of values from a sparse matrix/vector is.”
woadwarrior01 · 7m ago
Top K sampling comes to mind, although it's nowhere nearly as important as matmult.