Every couple of years I refresh my own parallel reduction benchmarks (https://github.com/ashvardanian/ParallelReductionsBenchmark), which are also memory-bound. Mine mostly focus on the boring simple throughput-maximizing cases on CPUs and GPUs.
Lately, as GPUs are pulled into more general data-processing tasks, I keep running into non-coalesced, pointer-chasing patterns — but I still don’t have a good mental model for estimating the cost of different access strategies. A crossover between these two topics — running MLP-style loads on GPUs — might be exactly the benchmark missing, in case someone is looking for a good weekend project!
ericye16 · 13h ago
I wish the chart extended past 28, otherwise how do we know that it tops out there?
saagarjha · 13h ago
You don't; the author explains that testing beyond that produces noise that makes it hard to analyze.
pixelpoet · 13h ago
It's pretty trivial to keep randomising the array and plot some min/max bands, or just the average.
It really seemrd like there was more to be said there.
I think the expectation of more comes from the experience of predominantly encountering articles with a different form.
No comments yet
Every couple of years I refresh my own parallel reduction benchmarks (https://github.com/ashvardanian/ParallelReductionsBenchmark), which are also memory-bound. Mine mostly focus on the boring simple throughput-maximizing cases on CPUs and GPUs.
Lately, as GPUs are pulled into more general data-processing tasks, I keep running into non-coalesced, pointer-chasing patterns — but I still don’t have a good mental model for estimating the cost of different access strategies. A crossover between these two topics — running MLP-style loads on GPUs — might be exactly the benchmark missing, in case someone is looking for a good weekend project!
No comments yet