Tag feed

#quantization

42 posts0 followers

Trending tags this week

EECaricatozerotomodel.hashnode.dev

Turbovec: A Practical Manual for Training-Free Vector Quantization in Rust and Python

16h ago · 13 min read · A hands on guide for AI engineers who care about embedding compression, retrieval latency, and not having to rebuild their index every time the data shifts. If you build retrieval augmented generatio

Join discussion

JKJangwook Kimeffloow.hashnode.dev

0

Adaptive KV-Cache Quantization: How 'Don't Waste Bits' Cuts On-Device LLM Latency by 17%

May 9 · 7 min read · Running LLMs on-device means fighting two constraints simultaneously: memory and latency. The KV-cache — the buffer that stores past token representations so the model does not recompute them — is often the bottleneck on both fronts. A paper publishe...

Join discussion

VBVlad Butacuomniforge.online

0

Green AI at the Edge: When Local LLMs Save Electricity and Water

May 4 · 9 min read · AI sustainability keeps getting framed around the wrong question: does a local LLM use less electricity and water than ChatGPT or Claude? The sharper question is: how many frontier-model calls are we

Join discussion

OWOliver Wengedge-insights.hashnode.dev

0

🚀 Edge AI & Efficient Computing | Bi-Weekly News: # 3

Apr 27 · 7 min read · 🎉Welcome to the 3rd post of Edge AI & Efficient Computing! 👋 I’m Yuqin (Oliver) Weng, and I’m incredibly passionate about pushing the boundaries of AI on constrained devices. 🧠I created this space

Join discussion

PMPROTIK MONDALnila.mndl.eu.org

0

The April 2026 Private AI Revolution: Running DeepSeek-R2-7B at 60 tokens/sec on Your Gaming Rig

Apr 26 · 5 min read · The April 2026 Private AI Revolution: Running DeepSeek-R2-7B at 60 tokens/sec on Your Gaming Rig Real numbers: I'm getting 61.4 tokens per second on a consumer RTX 4080 with DeepSeek-R2-7B. Cost? $0.00/month vs $240/month cloud inference. Zero depend...

Join discussion

OMO Melinggothartech.hashnode.dev

0

Edge AI Inference: Running Models at the CDN Layer

Apr 26 · 18 min read · Originally published at Gothar Tech Part of our 2025 software architecture series. Edge AI Inference: Running Models at the CDN Layer The fastest inference call is the one that never crosses an ocean. For two decades, CDNs existed to cache bytes: i...

Join discussion

AWAlan Westalan-west.hashnode.dev

0

Traditional Quantization vs 1.58-Bit Ternary Models: A Practical Comparison

Apr 18 · 6 min read · If you've been running local LLMs, you already know the drill: download a 70B model, quantize it to 4-bit with GPTQ or GGUF, cross your fingers, and hope your GPU doesn't catch fire. It works. It's practical. But there's a fundamentally different app...

Join discussion

KTKartik Thakoreblog.hammer.ai

0

Implementing TurboQuant in llama.cpp: CUDA Scars and What Actually Ships

Apr 6 · 11 min read · Part 1 of 2. Why We Did This Hammer.ai runs a industrial research lab hyper focused on regulated domain document understand at extremely efficient margins. Private equity self funded companies like f

Join discussion

APAyush Pandeyaryannovice.hashnode.dev

0

1-Bit Evolution: Building a Decentralized Memory Layer from the Metal Up

Mar 25 · 5 min read · One weekend in my hostel room, I found myself spiraling into a question that felt both obvious and oddly underexplored: if data is everything, and memory is the channel through which intelligence flow

Join discussion

SRSharvari Rautqubridai.hashnode.dev

0

GLM-4.7-FP8: Architecture, Benchmarks, Capabilities, and Real-World Applications

Mar 19 · 7 min read · GLM-4.7-FP8 is one of the latest models focused on this new generation of developer-centric AI. Developed by Z.ai, GLM-4.7 introduces improvements in agentic coding, reasoning, and tool usage, while t

Join discussion

#quantization

Search Hashnode

#quantization

Trending tags this week

Turbovec: A Practical Manual for Training-Free Vector Quantization in Rust and Python

Adaptive KV-Cache Quantization: How 'Don't Waste Bits' Cuts On-Device LLM Latency by 17%

Green AI at the Edge: When Local LLMs Save Electricity and Water

🚀 Edge AI & Efficient Computing | Bi-Weekly News: # 3

The April 2026 Private AI Revolution: Running DeepSeek-R2-7B at 60 tokens/sec on Your Gaming Rig

Edge AI Inference: Running Models at the CDN Layer

Traditional Quantization vs 1.58-Bit Ternary Models: A Practical Comparison

Implementing TurboQuant in llama.cpp: CUDA Scars and What Actually Ships

1-Bit Evolution: Building a Decentralized Memory Layer from the Metal Up

GLM-4.7-FP8: Architecture, Benchmarks, Capabilities, and Real-World Applications