1d ago · 9 min read · The problem an eBPF GPU agent has to solve, when a real workload stalls, is not "what is happening on this host" but "which rank in this cluster is dragging the rest, and why." Across seven weeks and
Join discussion1d ago · 6 min read · TL;DR A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization on every batch, every loop iteration. The GPU would finish its work in milliseconds, then sit idle
Join discussion
May 7 · 4 min read · An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel paths the agent has no view of. TL;DR Through April and early May, vendors shipped MCP servers in bat
Join discussion
May 6 · 11 min read · Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms late. The cause lives in a cross-rank query, not in any single host’s trace. TL;DR Eight ranks on two hos
Join discussion
May 5 · 12 min read · During kernelEye detection rule adjustments, I encountered an interesting bug worth sharing. The issue can be reproduced and understood within a few minutes through the write-up or the debugging video
Join discussion