Tag feed

#incidents

9 posts0 followers

Explore Hashnode

Alternatives

Trending tags this week

STSamson Tanimawonovaaiops.hashnode.devMay 5 · 5 min read

Building an Incident Response Playbook Library

The Folder Full of Stale Runbooks Every engineering org has a Confluence folder of incident runbooks. Every runbook was written during or after an incident. Each is a snapshot of how to fix one specific thing. After 2 years, the folder has 400 runboo...

0

STSamson Tanimawonovaaiops.hashnode.devMay 2 · 4 min read

Incident Severity Levels: SEV-1 to SEV-5 Calibration

Why Severity Is Broken at Most Companies Everyone has severity levels. Almost nobody agrees on what they mean. Ask ten engineers what SEV-2 means and you'll get eight different answers. This causes: Under-paged incidents (people thought SEV-3 meant ...

0

STSamson Tanimawonovaaiops.hashnode.devApr 20 · 3 min read

The Incident Commander Role: Running Incidents Without Chaos

Everyone's Debugging, Nobody's Leading Five engineers in an incident channel. All debugging independently. Nobody coordinating. Three people checking the same dashboard. Two trying conflicting fixes. Customers waiting. This is what incidents look lik...

0

JBJordan Bourbonnaisclawpulse.hashnode.devApr 20 · 3 min read

A Practical Guide to Managing AI Agent Incidents: From Detection to Resolution

What you'll learn: How to identify the root causes of AI agent failures before they impact users A structured incident response framework specifically designed for autonomous agents How to implement monitoring strategies that catch anomalies in real...

0

STSamson Tanimawonovaaiops.hashnode.devApr 19 · 4 min read

MTTR Optimization: The 7 Levers That Actually Move the Needle

MTTR Is a Lagging Indicator Everyone tracks Mean Time to Resolve. Few understand what actually drives it. MTTR isn't one metric — it's four: MTTR = MTTD + MTTA + MTTI + MTTF MTTD: Mean Time to Detect (monitoring fired) MTTA: Mean Time to Acknowl...

0

STSamson Tanimawonovaaiops.hashnode.devApr 18 · 3 min read

AI in Incident Response: Hype vs. Reality in 2024

Every Vendor Claims AI Magic Open any monitoring vendor's website and you'll see: "AI-powered incident detection!" "ML-driven root cause analysis!" "Intelligent alerting!" After evaluating a dozen AI ops tools and running three in production, here's ...

0

STSamson Tanimawonovaaiops.hashnode.devApr 15 · 3 min read

Post-Mortem Best Practices That Actually Drive Change

The Post-Mortem Nobody Learns From I've sat through hundreds of post-mortems. Most follow the same pattern: something breaks, someone writes a Google Doc, we have a meeting, we list action items, nobody follows up, the same thing happens again in 3 m...

0

STSamson Tanimawonovaaiops.hashnode.devApr 13 · 3 min read

3am Incident Response: What I Learned from 200+ Pages

The First 5 Minutes Matter Most I've been paged over 200 times in my career. The pattern is always the same: the first 5 minutes determine whether you resolve in 15 minutes or 3 hours. Here's what I've learned. The 3am Brain Problem At 3am, your cogn...

0

JHJono Herringtonjonoherrington.hashnode.devMar 21 · 6 min read

I Added a Meeting to Feel Like a Leader

A release broke payments for 10 minutes. So I added a meeting. Release retrospective. Every deployment gets a review. Sounds like leadership. Here's what actually fixed the problem. I got into the logs with the developer who shipped the feature. It w...

0

#incidents

Search Hashnode

#incidents

Explore Hashnode

Trending tags this week

Building an Incident Response Playbook Library

Incident Severity Levels: SEV-1 to SEV-5 Calibration

The Incident Commander Role: Running Incidents Without Chaos

A Practical Guide to Managing AI Agent Incidents: From Detection to Resolution

MTTR Optimization: The 7 Levers That Actually Move the Needle

AI in Incident Response: Hype vs. Reality in 2024

Post-Mortem Best Practices That Actually Drive Change

3am Incident Response: What I Learned from 200+ Pages

I Added a Meeting to Feel Like a Leader