Tag feed

#gpu

393 posts18 followers

Trending tags this week

Fleet 1.0: Finding the One Slow Rank in a 64-GPU Job From the Cluster Side

1d ago · 7 min read · TL;DR In a distributed training job, every node can look healthy on its own dashboard while throughput across the job quietly drops. The cause is almost never visible per host, because the signal is r

Join discussion

AKAlexander Kerchumblog.kerchum.dev

0

LLMs Use Just 16 of 256 Exponents — So We Compressed the Rest Away

4d ago · 9 min read · Most people compressing LLM weights are fighting the same war: squeeze 7 billion floats into less memory without wrecking the model. The standard weapons are quantization schemes — map each float to a

Join discussion

ITIngero Teamingero.hashnode.dev

0

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

5d ago · 7 min read · TL;DR A GPU that reports 97% utilization can still be the slowest part of a training step, and the reason usually lives outside the GPU: a CPU scheduler preemption, a driver-level allocation, a collec

Join discussion

WWingEdge777wingedge777.hashnode.dev

0

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

5d ago · 27 min read · 0. Preface — The Last Stand of Scalar Compute Warning: Extremely dense content ahead, with many diagrams, heavy bit-manipulation, and memory-mapping derivations. Best read on a PC. Target audience: R

Join discussion

ITIngero Teamingero.hashnode.dev

0

Tracing torch.cuda.empty_cache() on an RTX 4090 - Where Do the 53 MB Go?

6d ago · 7 min read · TL;DR After del tensor; torch.cuda.empty_cache(), PyTorch’s caching allocator still holds 53.7 MB that it won’t release. We traced the CUDA Runtime and Driver APIs with eBPF uprobes to see exactly wh

Join discussion

ITIngero Teamingero.hashnode.dev

0

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

May 27 · 4 min read · A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes. TL;DR When a multi-node training job slows down on AllReduce, both ends of the evi

Join discussion

ITIngero Teamingero.hashnode.dev

0

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

May 26 · 6 min read · Three eBPF patterns hyperscalers run in production today, mapped to the equivalent patterns on the GPU plane that nobody runs in production yet. TL;DR GitHub recently disclosed using eBPF in productio

Join discussion

ITIngero Teamingero.hashnode.dev

0

TCP Retransmits Are Not a Fabric Signal on InfiniBand

May 26 · 4 min read · On InfiniBand the data path never touches TCP, so the retransmit proxy reads zero. The measured signal is in sysfs and libibverbs. TL;DR On an InfiniBand cluster, NCCL moves the collective data over R

Join discussion

SMSubhanshu Mohan Guptablogs.subhanshumg.com

0

Trust the Silicon. They Said.

May 24 · 10 min read · TEE.Fail did not kill confidential compute. It narrowed it. The threat model that excluded physical access stayed safe; the threat model that included it did not. Slatewatch Cyber rebuilt its workload

Join discussion

ITIngero Teamingero.hashnode.dev

0

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

May 21 · 6 min read · TL;DR PyTorch’s DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux ke

Join discussion

#gpu

Search Hashnode

#gpu

Trending tags this week

Fleet 1.0: Finding the One Slow Rank in a 64-GPU Job From the Cluster Side

LLMs Use Just 16 of 256 Exponents — So We Compressed the Rest Away

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

[CUDA in Practice] SGEMM — Beating cuBLAS: A Deep Dive into Peak-Performance Matrix Multiplication in Pure CUDA C++

Tracing torch.cuda.empty_cache() on an RTX 4090 - Where Do the 53 MB Go?

AllReduce Stalls Are Network Stalls. Most Tools See Neither.

What GitHub Uses eBPF For (and the Layer They Have Not Ported Yet)

TCP Retransmits Are Not a Fabric Signal on InfiniBand

Trust the Silicon. They Said.

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level