I Made a Single CUDA Kernel Speak: Streaming Qwen3-TTS at 50ms Latency on an RTX 5090
Feb 21 · 14 min read · My first measurement said 35,932 milliseconds. The target was 90. That's not a typo. Thirty-five seconds to produce the first chunk of audio from a text-to-speech system that was supposed to feel like
Join discussion

