The "from broken to working" angle is gold — most k8s+LLM posts skip the actual failure modes. One thing I'd add from deploying Ollama on EKS for clients: set resources.limits.memory explicitly to ~1.5x your model size, otherwise the OOMKiller will cut you down mid-stream during long generations. Also readinessProbe should hit /api/tags not /, because Ollama's root returns 200 even before models are loaded. What's your cold-start time looking like when a pod restarts and has to reload a 7B model from volume?