GPU Deadlock on EKS: What Gang Scheduling Actually Is, Why the Default Scheduler Fails You, and Three Ways to Fix It
There's a class of production incident that doesn't page anyone. No error rate spikes. No latency alert fires. The cluster health dashboard shows green. GPU nodes are online. Pods are running.
And yet
aditmodi.hashnode.dev30 min read