May 29 · 27 min read · 0. Preface — The Last Stand of Scalar Compute Warning: Extremely dense content ahead, with many diagrams, heavy bit-manipulation, and memory-mapping derivations. Best read on a PC. Target audience: R
Join discussionMay 28 · 7 min read · TL;DR After del tensor; torch.cuda.empty_cache(), PyTorch’s caching allocator still holds 53.7 MB that it won’t release. We traced the CUDA Runtime and Driver APIs with eBPF uprobes to see exactly wh
Join discussion
May 24 · 8 min read · AI compilers are evolving fast. In many cases, torch.compile plus JIT optimization in PyTorch can deliver striking speedups, to the point that people often say, "hand-written operators are no longer n
Join discussionApr 28 · 6 min read · SDXL Turbo is genuinely impressive for what it does — real-time or near-real-time image generation in 1-4 steps. But it has a specific set of requirements and failure modes that are different from base SDXL, and if you're bringing over your SDXL setu...
Join discussionApr 9 · 33 min read · Note: Text translated by AI. Code crafted by human. Matrix transpose is one of the most fundamental operations in deep learning and high-performance computing. The deceptively simple coordinate swap
Join discussionApr 9 · 11 min read · TL;DR CUDA graphs shipped in 2018 but only became critical infrastructure in the past two years, driven by LLM inference demands and framework automation. They also create an observability blind spot
Join discussion