[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Access Optimization
6h ago · 33 min read · Note: Text translated by AI. Code crafted by human. Matrix transpose is one of the most fundamental operations in deep learning and high-performance computing. The deceptively simple coordinate swap
Join discussion