We pushed a new service version that changed our component prop API. The deployment went smooth on staging, but production had way more concurrent consumers than we tested. Old clients couldn't parse the new response shape. By the time we noticed the error spike, half our customers were getting 422s.
The rollback took forever because we had zero automation around it. Someone had to manually revert in ArgoCD, wait for the new pods to spin up, then check dashboards. Nightmare.
What I'm doing differently now:
Canary deployments for anything touching APIs. We're using ArgoCD's canary support to push to 10% of replicas first, run synthetic checks against real traffic patterns for 5 minutes, then proceed or auto-rollback.
Contract tests before merge. We generate a client SDK from our OpenAPI spec and run it against the new service in CI. Catches breaking changes before they leave your branch.
Pod disruption budgets. We had minAvailable: 0 like idiots. Now we enforce minAvailable: 1 so k8s never terminates everything at once during deployments.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: api-service
The canary thing alone saved us from three bad deploys in the last month. Would've paid for the engineering time ten times over by now.
Marcus Chen
Full-stack engineer. Building with React and Go.
This is exactly why we always deploy behind feature flags now. Changed our Go API response last year, wrapped it in a flag defaulting to the old shape, then slowly rolled out the new format over a week while monitoring client compatibility.
The rollback automation piece matters too though. We have a simple bash script that reverts the last commit and triggers CI/CD. Takes 90 seconds. Not fancy but it's saved us multiple times when something hits prod that didn't show up in staging load tests.
The gap between staging load and production is real. We now run load tests that mirror actual traffic patterns pulled from prod metrics. Changed everything.