Good post-mortem. Here's what actually matters though: your logging was the real problem, not the key rotation process.
We had nearly identical incident. The "fix" wasn't moving to per-service secrets (overkill for most orgs), it was making auth failures loud and traceable. Log the actual key hash or request ID that failed, not just "401". Include which secret version was used.
That catches this in staging before 2am even happens. The rotation itself can stay simple.
Shared secrets are fine if your deployment process enforces they're in sync. Git + CI/CD does this. The async manual "update prod, staging, and three clients" pattern is the actual failure mode. Fix that first.