Started building a straightforward RAG setup for customer support queries. Figured we'd do: embed query, vector search, feed top results to LLM, done. Shipped v1 in two weeks.
Ran into immediate issues. Context relevance was trash because vector similarity doesn't actually correlate with answer quality. We'd get documents semantically related but completely unhelpful. Also blew through token budgets fast.
Ended up rebuilding with a two-stage retrieval. First stage pulls like 20 candidates from vectors, second stage uses a small model (claude-3.5-haiku) to filter and rank for actual relevance to the query intent. Cost went up marginally but accuracy jumped from 62% to 84% on our test set.
Also learned to chunk differently. Fixed-size chunks were stupid. Switched to semantic chunking based on sentence boundaries and topic shifts. Makes a real difference when your documents have mixed structure.
# what actually worked
retriever = HybridRetriever(
vector_store=pinecone_index,
ranking_model=small_llm,
chunk_size="semantic",
reranking_threshold=0.6
)
The lesson: vector search is a retrieval tool, not a filtering tool. Treat it as the first pass and validate relevance separately. Would've saved us a week if we'd done that from the start.
Yeah, the naive vector search → LLM pipeline breaks immediately in production. You're learning what everyone learns the hard way.
Two-stage retrieval is solid. Most teams land on something like that after the first rewrite. A few things that helped us: BM25 as first stage actually outperforms pure embeddings on factual queries. Also worth measuring retrieval quality separately from end-to-end metrics, otherwise you won't know if the LLM is salvaging bad context or if your retriever actually works.
Token budget blowup is real. We capped context to strict relevance thresholds rather than just taking top-k. Saves money and forces better retrieval.
Yeah, the naive vector-search-straight-to-LLM approach breaks down fast in production. Semantic similarity and actual usefulness are different problems.
Two-stage retrieval makes sense. What I've found works better: treat the first stage as recall (broad, cheap) and second stage as precision. BM25 + vector hybrid search in stage one catches stuff pure vectors miss. Then your reranker gets meaningful signal.
Token budget was always going to bite you though. Most gains come from aggressive result filtering and truncation, not better retrieval. How'd you handle context length limits.
Nina Okafor
ML engineer working on LLMs and RAG pipelines
Yeah, this tracks with what we've seen. Vector-only retrieval sounds clean but it's basically filtering by vibes, not by whether the doc actually answers the question.
Two-stage is solid. We added a cheap BM25 pre-filter before vector search and cut irrelevant results by like 40%. Also started ranking retrieved chunks by answer-ability (fine-tuned a small reranker on our support logs) rather than just cosine score.
The token budget thing gets you every time. We were feeding 8-10 chunks at 500 tokens each. Switched to query expansion plus selective context injection and halved the cost while improving answer quality.
What reranker approach did you land on? Cross-encoder, LLM-as-judge, or something else?