2h ago · 3 min read · The Backup That Wasn't We had backups. Daily snapshots to S3. Perfectly configured. Never tested. When we needed to restore after a data corruption incident, we discovered the backups had been silently failing for 3 weeks. The S3 bucket policy had ch...
Join discussion
18h ago · 3 min read · Everyone's Debugging, Nobody's Leading Five engineers in an incident channel. All debugging independently. Nobody coordinating. Three people checking the same dashboard. Two trying conflicting fixes. Customers waiting. This is what incidents look lik...
Join discussion
2d ago · 11 min read · Github repo: https://github.com/SubhanshuMG/agents-as-state-machines The thesis in one paragraph Stop calling them agents. They are state machines that invoke LLMs at certain transitions. The multi-a
Eermastondang71-lang commented
1d ago · 3 min read · The Monday Morning Disaster Every Monday, the same story: the incoming on-call engineer has no idea what happened over the weekend. The outgoing engineer left a cryptic Slack message at 11pm and went to bed. We lost 2 hours every Monday rebuilding co...
Join discussion
2d ago · 3 min read · The SLO Translation Problem You define an SLO: 99.95% availability with p99 latency under 200ms. Engineering loves it. Product managers glaze over. The problem isn't the SLO. It's how we communicate it. Speaking Product Language Translate technical S...
Join discussion
2d ago · 4 min read · MTTR Is a Lagging Indicator Everyone tracks Mean Time to Resolve. Few understand what actually drives it. MTTR isn't one metric — it's four: MTTR = MTTD + MTTA + MTTI + MTTF MTTD: Mean Time to Detect (monitoring fired) MTTA: Mean Time to Acknowl...
Join discussion