@wing-edge-777

WingEdge777

@wing-edge-777Shanghai, ChinaJoined April 2026

success is fleeting, failure is the constant.

About

AI Systems & Compute Engineer. Maintainer of Vitamin-CUDA.

Available for

Nothing here yet.

WingEdge777's blogs

WingEdge777 | AI Systems & CPU/GPU Compute Engineeringwingedge777.hashnode.dev4 posts

Articles Comments

Recently published

WWingEdge777wingedge777.hashnode.devMay 29 · 27 min read

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

0. Preface — The Last Stand of Scalar Compute Warning: Extremely dense content ahead, with many diagrams, heavy bit-manipulation, and memory-mapping derivations. Best read on a PC. Target audience: R

WWingEdge777wingedge777.hashnode.devMay 24 · 8 min read

[CUDA in Practice] RoPE — Why Kernel Fusion in Hand-Written Operators Matters: Reducing Memory Traffic and Launch Overhead

AI compilers are evolving fast. In many cases, torch.compile plus JIT optimization in PyTorch can deliver striking speedups, to the point that people often say, "hand-written operators are no longer n

WWingEdge777wingedge777.hashnode.devApr 9 · 33 min read

[CUDA in Practice] Matrix Transpose — From Padding to XOR Swizzle: The Art of Shared Memory Access Optimization

Note: Text translated by AI. Code crafted by human. Matrix transpose is one of the most fundamental operations in deep learning and high-performance computing. The deceptively simple coordinate swap

WWingEdge777wingedge777.hashnode.devApr 5 · 9 min read

A Deep Dive into DeviceQuery: Understanding Your GPU Hardware

0. Preface Before writing a single line of high-performance CUDA code, you must know your silicon. deviceQuery is often the first command a developer runs, yet its output is usually ignored. This post