Why local LLM inference stalls on Apple Silicon (and how to fix it)
I spent a chunk of last month trying to run a 30B-class model locally on my M2 Max. 64GB of unified memory, a stack of GPU cores, no other apps running. Should be smooth. Instead I got around 3 tokens per second, a fan that sounded like a leaf blower...
alan-west.hashnode.dev6 min read