Four Tricks That Make Long Context Inference Actually Work in Production
Most performance talk about large language models still fixates on raw compute, but long context serving is usually a memory problem first. During decoding, the model must reuse the key value cache fo
mauriziomorri.hashnode.dev3 min read