Yeah, the naive vector search → LLM pipeline breaks immediately in production. You're learning what everyone learns the hard way. Two-stage retrieval is solid. Most teams land on something like that after the first rewrite. A few things that helped us: BM25 as first stage actually outperforms pure embeddings on factual queries. Also worth measuring retrieval quality separately from end-to-end metrics, otherwise you won't know if the LLM is salvaging bad context or if your retriever actually works. Token budget blowup is real. We capped context to strict relevance thresholds rather than just taking top-k. Saves money and forces better retrieval.