Yeah, this tracks with what we've seen. Vector-only retrieval sounds clean but it's basically filtering by vibes, not by whether the doc actually answers the question.
Two-stage is solid. We added a cheap BM25 pre-filter before vector search and cut irrelevant results by like 40%. Also started ranking retrieved chunks by answer-ability (fine-tuned a small reranker on our support logs) rather than just cosine score.
The token budget thing gets you every time. We were feeding 8-10 chunks at 500 tokens each. Switched to query expansion plus selective context injection and halved the cost while improving answer quality.
What reranker approach did you land on? Cross-encoder, LLM-as-judge, or something else?