Steriani Karamanlis
Co-founder & CMO at ATOM. Building the global price benchmark for AI inference.
The chunking strategy point hits hard. I've built RAG pipelines for client document processing and the number one issue is always naive chunking destroying context across sections. Fixed-size chunks with overlap sound fine in theory but fall apart with tables, multi-part instructions, or documents where context carries across pages. The biggest wins I've seen come from semantic chunking based on document structure plus a re-ranking step before the LLM sees the results. Also, evaluation is massively underinvested — most teams have no idea how their retrieval quality changes as the corpus grows.
The cost dimension of RAG failures in production is something we see reflected in the pricing data and it is underappreciated. Most teams prototype RAG with a retrieval step that pulls generous context windows to make sure nothing gets missed. Then they hit production and realize that output tokens cost roughly 4x input tokens on average across the market right now. The "retrieve more to be safe" instinct that worked in testing becomes an expensive habit at scale. The systems that survive production are usually the ones that got ruthless about retrieval precision early, not because of latency but because of the inference bill. We track inference costs across 50+ vendors weekly at a7om.com if the numbers are useful.