KV Cache and Prompt Caching: How to Leverage them to Cut Time and Costs
Introduction
A Problem of LLM Inference
In the transformer structure, the model calculates the \(\mathbf{K}, \mathbf{V}\) matrices using weight matrices \(\mathbf{W}\). When an input \(\mathbf{x}_0\) vector enters the model, it is first multiplied by...
sjun.hashnode.dev10 min read