AllReduce Stalls Are Network Stalls. Most Tools See Neither.
A slow AllReduce on rank 5 lines up against TCP retransmits on rank 5’s NIC, four ms before the collective completes.
TL;DR
When a multi-node training job slows down on AllReduce, both ends of the evi
ingero.hashnode.dev4 min read