Comment by Archit Mittal on "Production RAG System Design: Retrieval Quality, Hallucination, and Latency in LLMs"

Great point Archit, that tradeoff triangle is very real, I completely agree on query rewriting. Improving the query often gives more impact than upgrading embeddings especially when it comes to capturing intent correctly. Also +1 on cross-encoder reranking with caching in place, the extra latency is usually a fair trade for the improvement in precision.

For chunking I’m using a hybrid (fixed-size 512t+10% overlap +semantic splitter) chunking. Fixed-size chunks tend to break context, especially in technical documentation. Curious how you’re handling caching for reranking. Are you caching at the query level or closer to the embeddings?

Search Hashnode