Great breakdown of the cold start challenges. I hit similar issues when running my AI agent 24/7 on a Mac Mini — the agent handles deploys, content publishing, and monitoring autonomously. One thing that helped was moving latency-sensitive tasks to a local queue instead of Lambda. For your DynamoDB timeout pattern, have you tried read-ahead caching with a TTL slightly shorter than your expected access pattern? That cut our p99 by ~40% in a similar setup.