Debugging Multi-node GPU training
Table of Contents
1. The Problem
2. Physical Hardware
3. Intra-Node Communication: NVLink
4. Inter-Node Communication: InfiniBand and RDMA
5. PCIe Topology: Why GPU-NIC Affinity Matters
6. The S
cliolabs.hashnode.dev19 min read