Yeah, the naive vector-search-straight-to-LLM approach breaks down fast in production. Semantic similarity and actual usefulness are different problems.
Two-stage retrieval makes sense. What I've found works better: treat the first stage as recall (broad, cheap) and second stage as precision. BM25 + vector hybrid search in stage one catches stuff pure vectors miss. Then your reranker gets meaningful signal.
Token budget was always going to bite you though. Most gains come from aggressive result filtering and truncation, not better retrieval. How'd you handle context length limits.