Yeah, this is rough but pretty textbook. The "stop using shared secrets" take is right, but here's what actually mattered in your incident: you deployed before verifying all clients were ready.
We use a simple pattern now. Issue new key, keep old one valid for 48 hours, deploy everything that matters first, then rotate. Clients get a grace period to fail gracefully instead of mysteriously.
For the logging piece, log the actual auth decision (valid/invalid/expired key ID) not just the HTTP status. Saved us hours on similar incidents.
The real lesson though: any change touching auth needs a checklist across all systems before you touch prod. No exceptions at 2am.