Tag feed

#llama-cpp

8 posts0 followers

Trending tags this week

PYPhaneesh Yalavarthiphaneesh.hashnode.dev

Local AI: Running Gemma 4 with llama.cpp and Docker

May 11 · 1 min read · Introduction In the rapidly evolving landscape of Artificial Intelligence, the ability to run Large Language Models (LLMs) locally has become a game-changer for developers and researchers alike. Wheth

Join discussion

강강문규devsnack.hashnode.dev

0

Qwen3.6 on DGX Spark: vLLM + NVFP4 + DFlash vs llama.cpp — 2x Faster at 88–104 tok/s

May 6 · 10 min read · TL;DR — I was happily running Qwen3.6 on llama.cpp. Then I saw claims of 2× speed with vLLM + NVFP4 + DFlash. So I installed it, fought through crashes, and measured it myself. Verdict: it's real. 88–

Join discussion

MBMoussa Bafulatoro.hashnode.dev

0

Same model, same GPU, 4× the context: a weekend of inference-stack dogfooding

Apr 29 · 26 min read · I have an RTX 3090 sitting in a Xeon Silver 4314 box at home. I wanted to: Stand up a local inference stack (vLLM nightly with all the bells: speculative decoding, FlashInfer, prefix caching). Use t

Join discussion

VBVlad Butacuomniforge.online

0

Your Local LLM Is Slow Because of Five Config Flags

Apr 15 · 8 min read · Your model fits in memory. You load it up, send a prompt, and watch it choke halfway through a conversation. Or it runs, but at 3 tokens per second on hardware that should do better. You picked the ri

Join discussion

RSRahul Sehrawatai-zero-to-hero.hashnode.dev

0

Local Models: When llama.cpp Beats the API

Apr 13 · 11 min read · Welcome to Module B6 — The Next Layer. Four posts that sit just past the edge of the mainstream stack. Local models, fine-tuning honestly, multimodal in practice, and the frontier worth following. The module is less tactical than the others — fewer "...

Join discussion

MKMykhailo Kapustinkapustinomm.hashnode.dev

0

How to Run LLMs Locally with LM Studio: Complete Guide 2026

Feb 28 · 12 min read · My name is Mykhailo Kapustin, and I am Co-Founder & CTO at Advanced Scientific Research Projects (ASRP). Over the past decade, I've worked across the full technology stack — from frontend and backend

Join discussion

NKNishchay Kaushiknkaushik.in

0

Running 24/7 Local AI on an Old Android without Overheating

Feb 22 · 4 min read · Last week I wrote about repurposing an old Android phone to run local AI models. In this follow-up I address the biggest obstacle to running the device 24/7: overheating and how I transformed it into

Join discussion

#llama-cpp

Search Hashnode

#llama-cpp

Trending tags this week

Local AI: Running Gemma 4 with llama.cpp and Docker

Qwen3.6 on DGX Spark: vLLM + NVFP4 + DFlash vs llama.cpp — 2x Faster at 88–104 tok/s

Same model, same GPU, 4× the context: a weekend of inference-stack dogfooding

Your Local LLM Is Slow Because of Five Config Flags

Local Models: When llama.cpp Beats the API

How to Run LLMs Locally with LM Studio: Complete Guide 2026

Running 24/7 Local AI on an Old Android without Overheating