Tag feed

#vllm

37 posts1 followers

Trending tags this week

Gateway API Inference Extensions: An AI Aware Load Balancer

May 17 · 10 min read · Introduction If you’ve been anywhere near the Kubernetes ecosystem lately, you’ve probably heard the buzz around Gateway API — and the ongoing “Ingress is going away” narrative. I’ve covered Gateway A

Join discussion

SLShaun Liewshaunliew.hashnode.dev

0

Running Qwen3-VL on DGX Spark: Transformers vs vLLM vs SGLang

May 16 · 14 min read · I've been running Qwen3-VL locally for a while now, mostly with the standard from_pretrained() setup. It works, but it's slow. So, I kept wondering whether switching to vLLM or SGLang would actually m

Join discussion

ITIngero Teamingero.hashnode.dev

0

What Inference-Platform Benchmark Posts Leave Out

May 13 · 9 min read · DCGM stops at host-level GPU counters. Kernel-side eBPF adds the per-rank, per-tenant signals platform writeups never publish. TL;DR Cloudflare’s recent post on hosting Kimi K2.5 and Llama 4 Scout ope

Join discussion

JKJangwook Kimeffloow.hashnode.dev

0

SpecKV: Adaptive Speculative Decoding with Dynamic Gamma

May 9 · 10 min read · Every production LLM deployment using speculative decoding is likely running a fixed speculation length of γ=4. That number comes from early benchmarks, it has been copy-pasted across blog posts and framework defaults, and almost nobody questions it....

Join discussion

ITIngero Teamingero.hashnode.dev

0

Catching a vLLM Latency Spike with eBPF and an Open-Weight LLM

May 8 · 7 min read · TL;DR A vLLM latency spike was debugged using a fully open source stack: eBPF kernel tracing + MiniMax M2.7 (open-weight model via Ollama) + MCP (open protocol). The AI autonomously called 4 tools, i

Join discussion

강강문규devsnack.hashnode.dev

0

Qwen3.6 on DGX Spark: vLLM + NVFP4 + DFlash vs llama.cpp — 2x Faster at 88–104 tok/s

May 6 · 10 min read · TL;DR — I was happily running Qwen3.6 on llama.cpp. Then I saw claims of 2× speed with vLLM + NVFP4 + DFlash. So I installed it, fought through crashes, and measured it myself. Verdict: it's real. 88–

Join discussion

AWAlan Westalan-west.hashnode.dev

0

Running LLMs on Windows: Native vLLM vs WSL vs llama.cpp Compared

May 3 · 5 min read · The Windows local LLM story just got interesting. Someone recently demonstrated Qwen3's 27B model running at 72 tokens per second on an RTX 3090 — natively on Windows. No WSL. No Docker. Just a portable vLLM launcher. If you've been running local mod...

Join discussion

ITIngero Teamingero.hashnode.dev

0

11-Second Time to First Token on a Healthy vLLM Server

Apr 28 · 7 min read · TL;DR A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA

Join discussion

JKJangwook Kimeffloow.hashnode.dev

0

vLLM 0.8: Native Llama 4 MoE Routing Explained

Apr 28 · 11 min read · Mixture-of-Experts models have dominated the open-weight frontier in 2026. Llama 4 Scout (17B-16E), Llama 4 Maverick (17B-128E), DeepSeek V4-Pro (1.6T-49B active), and Qwen3.6-Plus all use sparse expert routing to scale parameters without proportiona...

Join discussion

JKJangwook Kimeffloow.hashnode.dev

0

LLM Inference Engines Compared 2026: vLLM vs SGLang vs TGI vs MAX

Apr 22 · 11 min read · Serving a large language model in production is a solved problem — until your traffic doubles, your structured output pipeline slows to a crawl, or your cloud bill arrives. The choice of inference engine determines how many GPUs you actually need, ho...

Join discussion

#vllm

Search Hashnode

#vllm

Trending tags this week

Gateway API Inference Extensions: An AI Aware Load Balancer

Running Qwen3-VL on DGX Spark: Transformers vs vLLM vs SGLang

What Inference-Platform Benchmark Posts Leave Out

SpecKV: Adaptive Speculative Decoding with Dynamic Gamma

Catching a vLLM Latency Spike with eBPF and an Open-Weight LLM

Qwen3.6 on DGX Spark: vLLM + NVFP4 + DFlash vs llama.cpp — 2x Faster at 88–104 tok/s

Running LLMs on Windows: Native vLLM vs WSL vs llama.cpp Compared

11-Second Time to First Token on a Healthy vLLM Server

vLLM 0.8: Native Llama 4 MoE Routing Explained

LLM Inference Engines Compared 2026: vLLM vs SGLang vs TGI vs MAX