124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
TL;DR
PyTorch’s DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux ke
ingero.hashnode.dev6 min read