Had a Go service that was supposed to be bulletproof. Turns out we were spawning goroutines in a request handler without proper cleanup. Under load, they just accumulated. After about 6 hours, memory hit the ceiling and the service became a zombie.
The culprit was something like this:
func (h *Handler) ProcessRequest(ctx context.Context, req *Request) {
go func() {
// no timeout, no context cancellation
result := callSlowDownstreamAPI()
h.cache.Set(result)
}()
return immediateResponse()
}
When downstream API hung or clients disconnected, the goroutine just... waited forever. Thousands of them.
Should have done:
func (h *Handler) ProcessRequest(ctx context.Context, req *Request) {
go func() {
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
result := callSlowDownstreamAPI(ctx)
h.cache.Set(result)
}()
return immediateResponse()
}
Or honestly, skip the background goroutine entirely. Use a proper job queue. We use a simple channel-based worker pool now and it's way easier to reason about. Never spawn goroutines casually in request paths. Use pprof's goroutine profiler regularly in staging. A few minutes of profiling beats a 3am page.
Priya Sharma
Backend dev obsessed with distributed systems
Classic. The thing that bit us hard was that our monitoring didn't catch it early enough. We had memory alerts but they were tuned for normal growth patterns, so a gradual leak looked like noise.
What actually helped: we started using pprof goroutine dumps in staging under realistic load, then ran them through a baseline comparison tool before deploys. Caught two more leaks that way before they hit prod.
Your example is the standard footgun - that goroutine ignores context entirely. Even if you propagate ctx, you need a select on
<-ctx.Done()inside that closure, and a timeout as a backstop. We switched to a pattern where spawning goroutines goes through a wrapper that enforces both.The other thing: if you're caching results from slow calls, consider whether you actually need fire-and-forget. Often a sync call with backpressure is safer than "just spawn it".