The retrieval quality → hallucination → latency tradeoff triangle is real — pick any two. One thing I'd add from running RAG for a client support ops: invest in query rewriting before you invest in a better embedding model. We saw a bigger MRR lift from a small LLM-based query expansion pass than from upgrading to a larger embedding model. Also, reranking with a cross-encoder on the top-50 is almost always worth the ~100ms — especially if you cache aggressively on repeated queries. What's your chunking strategy? We've found semantic chunking beats fixed-size by a wide margin on technical docs.