Everyone's treating RAG like it needs orchestration, vector databases, retrieval scoring, re-ranking. I've watched teams spend three months on a "robust" pipeline that could've been solved in a week with boring stuff.
Here's what actually works: dump your docs in S3, chunk them dumb (fixed size, overlap), embed with the openai API, store vectors in postgres with pgvector, retrieve top-k, stuff into context window, done. No exotic tooling.
The problem is people optimize for the wrong thing. They think retrieval quality matters most. It doesn't. Your LLM can handle mediocre retrieval if your prompt is tight. What actually kills RAG is latency and cost. Every millisecond and API call compounds.
I watched a team ditch their fancy llamaindex setup for a lambda that does embedding -> postgres query -> claude in ~800ms. Cost dropped by 80%. Context quality stayed identical because they just focused on better prompts instead of tweaking retrieval logic.
Start simple. Add complexity when you have actual evidence it's needed. Most don't.
No responses yet.