TCP Retransmits Are Not a Fabric Signal on InfiniBand
On InfiniBand the data path never touches TCP, so the retransmit proxy reads zero. The measured signal is in sysfs and libibverbs.
TL;DR
On an InfiniBand cluster, NCCL moves the collective data over R
ingero.hashnode.dev4 min read