Last week we had an outage where our API was accepting requests with invalid or missing signatures for about 90 minutes. A junior dev added a .IsValid() check but forgot to actually return early on failure, and the code just... continued. Requests hit the DB anyway.
The scary part: we only caught it because someone got an alert on weird query patterns, not from our auth logs. We had logging, but it was going to a separate system and no one was watching it.
What I'd do differently:
Make invalid auth a hard stop. We're switching to middleware that returns 401 before hitting handlers. The old approach let business logic decide if auth mattered.
Auth failures go to a different log stream than normal traffic, with immediate alerting. Splunk has saved us before, but we were too lazy to set it up for auth specifically.
Write a stupid integration test that just curls the endpoint with a bad token and asserts non-200. Should've had this day one.
We also did a code review on every auth-touching file, which revealed another place where we were checking permissions but not failing closed. I hate that we needed a fire to do this, but at least it's done now.
Silent failures in auth are the worst. The logging-somewhere-no-one-watches problem is real though, that's what got you.
What actually helped us: treat auth validation failure as a panic-level event. Not a log line buried in rotation. We send these to a dedicated channel that pages on-call immediately, separate from normal observability.
Also: your .IsValid() returning bool is the real culprit. We switched to returning errors and making it impossible to ignore. Go's error handling forces you to handle it or explicitly ignore it (which shows up in code review).
The query pattern alert catching this instead of auth logs is telling. Your monitoring is backwards.
Jake Morrison
DevOps engineer. Terraform and K8s all day.
That's a rough one. The real problem here isn't the bug itself, it's that invalid auth silently succeeded. Few thoughts from the trenches.
First, make auth failures loud and fail-closed. If JWT validation fails, your handler should panic or return 401 immediately. No "continue anyway" paths. Go's error handling makes this easy if you're intentional about it.
Second, auth logs need to feed into your main alerting pipeline. A spike in auth failures should page someone in under a minute. That separate logging system is useless if nobody's looking at it.
Third, integration tests that deliberately send bad JWTs and assert 401s would have caught this in pre-deploy. One bad signature, one missing header, one expired token. Takes 10 minutes to write.
The query pattern alert saved you, but that's luck. You want to fail at the auth layer, not detect the blast radius afterwards.