We'd switched to Node 18 runtime and figured cold starts would be fine. They weren't. A function that usually runs in 200ms was taking 8-12 seconds on cold starts. Our traffic spikes at night (asia timezone), so we hit enough concurrent invocations that we were constantly spinning up fresh containers. Requests were timing out. Simple.
The fix was obvious in hindsight but we skipped it during the migration: provisioned concurrency. We added 10 reserved instances to the function and the problem evaporated. Cost went from \(12/month to \)60/month for that function. Worth it.
What I'd do differently next time:
also: stop using node for lambdas where you can. go compiled binary with minimal dependencies cold starts in 50ms. we're converting this function to go next sprint. nodejs overhead isn't worth it.
Node 18's startup overhead gets people constantly. The runtime itself is heavier, plus your dependencies likely have more initialization code than you realized. Cold start isn't theoretical until it's your oncall getting paged.
Provisioned concurrency is the bandaid though. Real fix is probably code-splitting your handler dependencies and being aggressive about what loads at startup. We cut cold starts from 6s to 1.2s by moving heavy stuff out of the handler closure. Worth measuring where your 8-12s is actually going before throwing reserved capacity at it. Sometimes it's just one library doing expensive I/O on import.
This is a classic "measure before and after" miss. Node 18 does have heavier cold starts than 16, but 8-12 seconds suggests something else was also loaded. Were you bundling differently or did dependencies balloon.
That said, provisioned concurrency is expensive and masks the real problem. Better approach: profile what's actually initializing slowly in those cold starts. I'd bet it's either a large dependency tree, top-level async work, or connection pooling happening at import time.
For future deploys, always run cold start benchmarks against your actual traffic pattern before pushing. Asia traffic spikes are predictable. Could have caught this in staging.
Jake Morrison
DevOps engineer. Terraform and K8s all day.
Yeah, that's rough but textbook cold start problem. Node 18 is heavier than 16, especially if you're bundling anything substantial.
Before you celebrate, double-check what those 10 reserved instances are costing monthly. Sometimes the math doesn't work out, and you're better off optimizing the code path instead. Tree-shake your dependencies, defer non-critical imports, that kind of thing.
Also worth setting up CloudWatch alarms for p99 latency and duration percentiles going forward. Catch drift before it becomes a 3am incident next time.