Great point Archit, that tradeoff triangle is very real, I completely agree on query rewriting. Improving the query often gives more impact than upgrading embeddings especially when it comes to capturing intent correctly. Also +1 on cross-encoder reranking with caching in place, the extra latency is usually a fair trade for the improvement in precision.
For chunking I’m using a hybrid (fixed-size 512t+10% overlap +semantic splitter) chunking. Fixed-size chunks tend to break context, especially in technical documentation. Curious how you’re handling caching for reranking. Are you caching at the query level or closer to the embeddings?
Archit Mittal
I Automate Chaos — AI workflows, n8n, Claude, and open-source automation for businesses. Turning repetitive work into one-click systems.
The retrieval quality → hallucination → latency tradeoff triangle is real — pick any two. One thing I'd add from running RAG for a client support ops: invest in query rewriting before you invest in a better embedding model. We saw a bigger MRR lift from a small LLM-based query expansion pass than from upgrading to a larger embedding model. Also, reranking with a cross-encoder on the top-50 is almost always worth the ~100ms — especially if you cache aggressively on repeated queries. What's your chunking strategy? We've found semantic chunking beats fixed-size by a wide margin on technical docs.