Tag feed

#inference

107 posts0 followers

Explore Hashnode

Alternatives

Trending tags this week

MSManu Shuklaecorpit.hashnode.dev3d ago · 15 min read

Diffusion LLMs in 2026: when NVIDIA Nemotron tri-mode serving beats autoregressive

Diffusion LLMs in 2026: when NVIDIA Nemotron tri-mode serving beats autoregressive Summary. NVIDIA's Nemotron-Labs Diffusion, released in May 2026, is an open-weight language model family at 3B, 8B, a

0

STSakshi Tyagisakshityagi.hashnode.dev4d ago · 6 min read

The Inference Wall: Where LLM Latency and Cost Go

Part 1 of 4 : Serving LLMs in Production. Here's a thing that surprises almost every team the first time they ship an LLM: the model that felt fast in the demo gets expensive and sluggish in productio

1

K

STSakshi Tyagisakshityagi.hashnode.dev4d ago · 5 min read

Self-Host vs. API for LLMs, Actually Costed

Part 4 of 4 Serving LLMs in Production. The first three parts were about making inference fast: where the cost actually hides, batching to fill the GPU, and hauling fewer bytes per word. All of it fee

0

STSakshi Tyagisakshityagi.hashnode.dev4d ago · 6 min read

Making LLM Decode Cheaper: Quantization and More

Part 3 of 4 Serving LLMs in Production. Part 2 filled the idle GPU by serving many users at once. But we still haven't touched the core problem from Part 1: every single word the model writes drags th

0

STSakshi Tyagisakshityagi.hashnode.dev4d ago · 6 min read

Batching and the Idle GPU: Serving LLMs Faster

Part 2 of 4 Serving LLMs in Production. Part 1 left us with an uncomfortable fact: when the model writes its answer the slow, one-word-at-a-time decode phase the GPU's powerful math units sit mostly i

0

SSanjanaunder-the-hood-ai.hashnode.dev6d ago · 19 min read

From Prompt to Token: Inside Modern LLM Inference

Why This Matters Modern LLMs can remember thousands of previous tokens while continuing a conversation. But if every new response needs to consider everything we have said so far , why doesn't the mo

0

IPIvan Portatodea.hashnode.devJul 24 · 20 min read

LLM Inference Explained: What Actually Happens When You Serve a Model

As spending on frontier AI services like ChatGPT, Claude, and Gemini climbs, usage caps hit developers, and open-source models gain popularity, most platform teams eventually need to serve an LLM on t

0

IPIvan Portatodea.hashnode.devJul 14 · 12 min read

HAMi Explained: Why Your GPUs Sit Idle While Your Bill Doesn't

Before the release of Dynamic Resource Allocation and its alpha partitionable-device and consumable-capacity extensions, Kubernetes scheduled GPUs only as whole cards, with no native option to allocat

0

WSwesley schollkonjo.hashnode.devJul 13 · 21 min read

I Couldn't Find a Local LLM Tool Fast Enough, So I Built My Own Called Squish

This requires Apple Silicon (M-series) and macOS 13 (Ventura) or later. If you're on Intel, Linux, or Windows, the numbers in this article will not apply to your hardware. TL;DR: Squish is a local LLM

0

YKYash Karechayash-karecha.hashnode.devJul 5 · 21 min read

I Built My Own ChatGPT Backend to Understand How AI Streaming Actually Works

Every time I use Claude or ChatGPT I get stuck on the same question: how does the text just show up, word by word, like the model is thinking out loud in real time? Why does it feel instant when it's

0

#inference

Search Hashnode

#inference

Explore Hashnode

Trending tags this week

Diffusion LLMs in 2026: when NVIDIA Nemotron tri-mode serving beats autoregressive

The Inference Wall: Where LLM Latency and Cost Go

Self-Host vs. API for LLMs, Actually Costed

Making LLM Decode Cheaper: Quantization and More

Batching and the Idle GPU: Serving LLMs Faster

From Prompt to Token: Inside Modern LLM Inference

LLM Inference Explained: What Actually Happens When You Serve a Model

HAMi Explained: Why Your GPUs Sit Idle While Your Bill Doesn't

I Couldn't Find a Local LLM Tool Fast Enough, So I Built My Own Called Squish

I Built My Own ChatGPT Backend to Understand How AI Streaming Actually Works