Feed
Pro
Search

Author

Write
Drafts

Bug0 - The AI-native e2e QA regression testing Passmark - The open-source AI framework for regression testing Hackathons Changelog Brand Hashnode gql skill - let your AI agent publish to your Hashnode blog The Foreword by Hashnode - official blog from the Hashnode team @hashnode on X Hashnode on LinkedIn Support - hello+support@hashnode.com Code of Conduct Terms Privacy Sitemap
Sign in

Search Hashnode

Search posts, tags, users, and pages

FeedDiscussion

Jangwook Kim

May 9

Adaptive KV-Cache Quantization: How 'Don't Waste Bits' Cuts On-Device LLM Latency by 17%

Running LLMs on-device means fighting two constraints simultaneously: memory and latency. The KV-cache — the buffer that stores past token representations so the model does not recompute them — is often the bottleneck on both fronts. A paper publishe...

effloow.hashnode.dev7 min read

#kv-cache #llm-inference #on-device-ai #paper-poc #quantization

Responses

No responses yet.