Tag feed

#sre

919 posts277 followers

Explore Hashnode

Alternatives

Trending tags this week

AGAbhishek Gangurderagsystemedutech.hashnode.dev1d ago · 13 min read

Zero to Hero - Linux Production Troubleshooting Commands

⚙️Real-Time Linux Production Scenarios SRE / DevOps Field Reference — Full Command Guide with Explanations Server is 100% Full Diagnose : 🧰 • df -hT Shows disk usage per mounted filesystem in hum

0

LFLuyu Fangcodezwin.hashnode.dev1d ago · 3 min read

A 30-Minute Java Production Incident Triage Workflow

A production incident becomes harder when facts, guesses, and actions are mixed into the same chat thread. Someone sees a latency spike and calls the database the root cause. Another person notices a

0

OOnyiGlobal2025onyiglobal2025.hashnode.dev2d ago · 8 min read

I Deliberately Broke My Kubernetes Cluster 8 Times — Here's What It Taught Me

Why build a lab to break things on purpose Most portfolio projects prove you can deploy something. Fewer prove you know what to do when it breaks — and in production, it eventually does. I'd already b

0

Jjasmineparkjas-blogs.hashnode.dev2d ago · 18 min read

Our LLM service had no backpressure. The provider got 7x slower and our p99 got 25x worse.

TL;DR. Our summarization endpoint holds a p99 under 3 seconds. On 21/05 our provider degraded: median call time went from 620ms to about 4.3 seconds, roughly 7x. Our p99 went to 35 seconds, roughly 25

0

PPillarspillars.hashnode.dev2d ago · 12 min read

Kubernetes Cost Optimization: Why Your Cluster Is Overprovisioned

Every workload in Kubernetes is defined by two resource profiles: what it requests and what it actually consumes. The scheduler, autoscaler, and ultimately your cloud bill are driven by the first. Mos

0

MMMayank Mauryaai-k8s.hashnode.dev3d ago · 12 min read

What if the safest Kubernetes fix is no fix at all?

AI-assisted DevOps and SRE tools are becoming more common. Tools like K8sGPT can scan Kubernetes clusters, detect issues, and explain what might be going wrong. Most evaluations of these tools focus o

0

ANAlex Nyamburablog.lxmwaniky.me3d ago · 6 min read

The SRE Postmortem: Rebuilding Systems on the Scars of Production Failures

Let’s be honest: when a production system crashes, our natural, raw human instinct is to look for a scapegoat. Imagine a high-traffic payment gateway goes down for 45 minutes on a Friday afternoon. Th

0

ASabhinav sharmafieldnoteswithabhinav.hashnode.dev3d ago · 5 min read

Field Notes Weekly #1 - DevOps, Kubernetes & Azure Digest

Field Note AI agent infrastructure is starting to look a lot like early Kubernetes: a lot of hand-rolled glue, a few emerging runtimes, and everyone pretending standards exist. Reading through Google’

0

PAProjiQ Appprojiq.hashnode.dev3d ago · 6 min read

Incident Postmortem Template & Guide for Engineering Teams

Every engineering team has outages. The teams that improve fastest are not the ones that have the fewest incidents — they are the ones that extract the most learning from each one. A disciplined, blam

0

RPRudra Ponksheblog.realrudrap.dev5d ago · 7 min read

The Attacker's Discipline

There is a specific kind of cognitive dissonance that comes from reverse engineering a malware sample in one terminal while your own app's backend runs in another. I spent the better part of the last

0

#sre

Search Hashnode

#sre

Explore Hashnode

Trending tags this week

Zero to Hero - Linux Production Troubleshooting Commands

A 30-Minute Java Production Incident Triage Workflow

I Deliberately Broke My Kubernetes Cluster 8 Times — Here's What It Taught Me

Our LLM service had no backpressure. The provider got 7x slower and our p99 got 25x worse.

Kubernetes Cost Optimization: Why Your Cluster Is Overprovisioned

What if the safest Kubernetes fix is no fix at all?

The SRE Postmortem: Rebuilding Systems on the Scars of Production Failures

Field Notes Weekly #1 - DevOps, Kubernetes & Azure Digest

Incident Postmortem Template & Guide for Engineering Teams

The Attacker's Discipline