KV Cache Is the Real Bottleneck in Long Context Inference
If you have been benchmarking LLM inference and wondering why latency and batch size collapse as context grows, the answer is usually not FLOPs. It is memory, and specifically the key value cache. Dur
mauriziomorri.hashnode.dev3 min read