11-Second Time to First Token on a Healthy vLLM Server
TL;DR
A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA
ingero.hashnode.dev7 min read