Same model, same GPU, 4× the context: a weekend of inference-stack dogfooding
I have an RTX 3090 sitting in a Xeon Silver 4314 box at home. I wanted to:
Stand up a local inference stack (vLLM nightly with all the bells: speculative decoding, FlashInfer, prefix caching).
Use t
fulatoro.hashnode.dev26 min read