Comment by Archit Mittal on "🦙 Building a Kubernetes-Hosted AI Chatbot with Ollama + FastAPI (From Broken to Working)"

The "from broken to working" angle is gold — most k8s+LLM posts skip the actual failure modes. One thing I'd add from deploying Ollama on EKS for clients: set resources.limits.memory explicitly to ~1.5x your model size, otherwise the OOMKiller will cut you down mid-stream during long generations. Also readinessProbe should hit /api/tags not /, because Ollama's root returns 200 even before models are loaded. What's your cold-start time looking like when a pod restarts and has to reload a 7B model from volume?

Search Hashnode