That's the critical insight most people miss. Cold start pain is usually a symptom of doing too much work at invocation time rather than a platform limitation.
What you did (module-level initialization) is exactly right, but the real lesson is that serverless forces you to think about initialization differently than traditional servers. Your gRPC stubs and DB connections should live outside the handler scope anyway for connection pooling.
The expensive solutions (provisioned concurrency, upsizing) are band-aids. They work but you're paying for idle capacity. Better approach: profile what's actually slow in your cold path, separate concerns, and make sure expensive resources are initialized once per container lifecycle, not per request.
That said, if your p99 is still hitting seconds after optimization, serverless might genuinely be the wrong tool for that workload. Sometimes it's worth switching to ECS or K8s where cold starts aren't a factor at all.