Tag feed

#speculative-decoding

9 posts0 followers

Trending tags this week

PARSE: Faster LLM Inference via Parallel Prefix Speculative Decoding

6d ago · 5 min read · Speculative decoding became the standard inference speedup technique through 2024 and 2025. The idea: a small draft model generates a sequence of candidate tokens, and a larger target model verifies them in parallel — accepting the longest valid pref...

Join discussion

JKJangwook Kimeffloow.hashnode.dev

0

Gemma 4 MTP Drafters: How Multi-Token Prediction Delivers 2x+ Faster Local Inference

May 9 · 6 min read · On May 5, 2026, Google released Multi-Token Prediction (MTP) drafters for the Gemma 4 family. The headline claim — up to 3x inference speedup — is technically accurate on specific hardware. The more realistic number for most developer setups is 1.7x ...

Join discussion

JKJangwook Kimeffloow.hashnode.dev

0

SpecKV: Adaptive Speculative Decoding with Dynamic Gamma

May 9 · 10 min read · Every production LLM deployment using speculative decoding is likely running a fixed speculation length of γ=4. That number comes from early benchmarks, it has been copy-pasted across blog posts and framework defaults, and almost nobody questions it....

Join discussion

강강문규devsnack.hashnode.dev

0

Gemma 4 MTP Drafter on DGX Spark: 2.89x Speedup for Dense 31B — No Quality Loss

May 6 · 8 min read · An 870 MB drafter model turned Dense 31B from 6.5 → 18.8 tok/s. No model swap, no training, no quality degradation. If you have a DGX Spark, there's no reason not to use this. Key Results Model Fra

Join discussion

강강문규devsnack.hashnode.dev

0

Qwen3.6 on DGX Spark: vLLM + NVFP4 + DFlash vs llama.cpp — 2x Faster at 88–104 tok/s

May 6 · 10 min read · TL;DR — I was happily running Qwen3.6 on llama.cpp. Then I saw claims of 2× speed with vLLM + NVFP4 + DFlash. So I installed it, fought through crashes, and measured it myself. Verdict: it's real. 88–

Join discussion

AYAdam Yanggradient.network

0

Turning Latency into Throughput: Speculative Decoding for the Decentralized Inference

Nov 24, 2025 · 6 min read · https://arxiv.org/abs/2511.11733 The Latency Wall In centralized inference, speed is mostly a function of compute. You optimize by saturating HBM bandwidth, fusing kernels, and keeping GPUs close to their roofline. In decentralized inference, where...

Join discussion

GAGasym A. Valiyevai-engineering-study.hashnode.dev

0

The AI Engineer's Guide to Inference Optimization: Making Models Faster & Cheaper

Aug 1, 2025 · 47 min read · Welcome to a deep dive into one of the most critical and fascinating areas of AI Engineering: Inference Optimization. While building powerful models is one part of the equation, making them run efficiently—faster, cheaper, and at scale—is what makes ...

Join discussion

NNovitaAInovita.hashnode.dev

0

Revolutionizing Large Language Model Inference: Speculative Decoding and Low-Precision Quantization

Dec 20, 2024 · 8 min read · With the rapid advancement of artificial intelligence(AI), large language models (LLMs) have emerged as a cornerstone of natural language processing (NLP). These models demonstrate remarkable capabilities in language generation and understanding, mak...

Join discussion

GDGabi Dobocanblog.telepat.io

0

Continuous Speculative Decoding for Efficient Image Generation

Nov 24, 2024 · 4 min read · Arxiv: https://arxiv.org/abs/2411.11925v1 PDF: https://arxiv.org/pdf/2411.11925v1.pdf Authors: Shiming Xiang, Fei Li, Qi Yang, Kun Ding, Robert Zhang, Zili Wang Published: 2024-11-18 Introduction: Enhancing Autoregressive Image Generation Autoregres...

Join discussion

#speculative-decoding

Search Hashnode

#speculative-decoding

Trending tags this week

PARSE: Faster LLM Inference via Parallel Prefix Speculative Decoding

Gemma 4 MTP Drafters: How Multi-Token Prediction Delivers 2x+ Faster Local Inference

SpecKV: Adaptive Speculative Decoding with Dynamic Gamma

Gemma 4 MTP Drafter on DGX Spark: 2.89x Speedup for Dense 31B — No Quality Loss

Qwen3.6 on DGX Spark: vLLM + NVFP4 + DFlash vs llama.cpp — 2x Faster at 88–104 tok/s

Turning Latency into Throughput: Speculative Decoding for the Decentralized Inference

The AI Engineer's Guide to Inference Optimization: Making Models Faster & Cheaper

Revolutionizing Large Language Model Inference: Speculative Decoding and Low-Precision Quantization

Continuous Speculative Decoding for Efficient Image Generation