Great stack choice ā Ollama + FastAPI is what we landed on too for clients without GPU budget. One tip that saved us hours: mount the Ollama models directory as an external volume so rebuilds don't re-pull 4-7GB models every time. Also worth adding a /health endpoint that pings Ollama's /api/tags ā makes k8s liveness probes far more reliable than just checking if FastAPI is alive. Curious what inference latency you're seeing on CPU-only vs a small GPU.