WWingEdge777inwingedge777.hashnode.dev·May 29 · 27 min read[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++0. Preface — The Last Stand of Scalar Compute Warning: Extremely dense content ahead, with many diagrams, heavy bit-manipulation, and memory-mapping derivations. Best read on a PC. Target audience: R00
WWingEdge777inwingedge777.hashnode.dev·May 24 · 8 min read[CUDA in Practice] RoPE — Why Kernel Fusion in Hand-Written Operators Matters: Reducing Memory Traffic and Launch OverheadAI compilers are evolving fast. In many cases, torch.compile plus JIT optimization in PyTorch can deliver striking speedups, to the point that people often say, "hand-written operators are no longer n10
WWingEdge777inwingedge777.hashnode.dev·Apr 9 · 33 min read[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Access OptimizationNote: Text translated by AI. Code crafted by human. Matrix transpose is one of the most fundamental operations in deep learning and high-performance computing. The deceptively simple coordinate swap 00
WWingEdge777inwingedge777.hashnode.dev·Apr 5 · 9 min readA Deep Dive into DeviceQuery: Understanding Your GPU Hardware0. Preface Before writing a single line of high-performance CUDA code, you must know your silicon. deviceQuery is often the first command a developer runs, yet its output is usually ignored. This post00