I Made a Single CUDA Kernel Speak: Streaming Qwen3-TTS at 50ms Latency on an RTX 5090
My first measurement said 35,932 milliseconds. The target was 90.
That's not a typo. Thirty-five seconds to produce the first chunk of audio from a text-to-speech system that was supposed to feel like
jayanthkumar777.hashnode.dev14 min read