Memory optimization is the best way to write high performing CUDA kernel for AI
1 thecongluong 0 6/6/2025, 2:20:29 PM
Is the current state of cuda kernel for AI applications all about memory optimization ? Compute units (tensor core) are so fast now and memory is lacking behind. So the highest performing kernel are the one that best utilized memory and data transfer to constantly feed the tensor core. By that assumption, one can become a good kernel programmer if they can master the art of memory loading ?
BTW, I've never used atomics, are they useful in writing kernels for DL applications ?
No comments yet