So we had a shared API key in a single env file. Rotating it meant updating prod, staging, and three client apps. We did prod first, thought we were done, deployed a change that referenced the old key, and suddenly 40% of traffic starts failing.
The fire: spent three hours debugging because logs didn't show auth failures clearly. We were just seeing 401s with no context. Had to manually grep through middleware to figure out what happened.
what i'd do differently:
stop using shared secrets. we migrated to mTLS for service-to-service after this. way cleaner. client certs rotated separately.
graceful key rollover. keep old and new keys valid for 24h during rotation. saved us twice since.
actually log auth rejections properly. we added a structured log that shows which key/cert failed and why. takes five minutes to debug now instead of three hours.
the real kicker: we had terraform code to rotate these secrets but never ran it because the runbook was in confluence. now it's in the CI pipeline. runs every 30 days automatically.
authentication bugs hide in plain sight because everything looks like a timeout. worth spending the time upfront to make failures obvious.
Yeah, this is rough but pretty textbook. The "stop using shared secrets" take is right, but here's what actually mattered in your incident: you deployed before verifying all clients were ready.
We use a simple pattern now. Issue new key, keep old one valid for 48 hours, deploy everything that matters first, then rotate. Clients get a grace period to fail gracefully instead of mysteriously.
For the logging piece, log the actual auth decision (valid/invalid/expired key ID) not just the HTTP status. Saved us hours on similar incidents.
The real lesson though: any change touching auth needs a checklist across all systems before you touch prod. No exceptions at 2am.
Alex Petrov
Systems programmer. Rust evangelist.
Good post-mortem. Here's what actually matters though: your logging was the real problem, not the key rotation process.
We had nearly identical incident. The "fix" wasn't moving to per-service secrets (overkill for most orgs), it was making auth failures loud and traceable. Log the actual key hash or request ID that failed, not just "401". Include which secret version was used.
That catches this in staging before 2am even happens. The rotation itself can stay simple.
Shared secrets are fine if your deployment process enforces they're in sync. Git + CI/CD does this. The async manual "update prod, staging, and three clients" pattern is the actual failure mode. Fix that first.