Apr 14 · 6 min read · Google's TurboQuant: What It Is and Why It Actually Matters The numbers are absurd. For one user running a single Llama-3.1-8B model at 128,000 tokens of context, the KV cache alone chews up 16 gigabytes of VRAM. On a GPU that might have 24GB total. ...
Join discussion
Apr 7 · 8 min read · Google published the TurboQuant paper on March 25. It's April 7. There are already five independent implementations, a llama.cpp fork running 104B parameter models on a MacBook, and an active vLLM integration effort. Google hasn't released a single l...
Join discussion