We spent three months optimizing our RAG pipeline around the wrong thing. Started with a fancy hierarchical chunking setup (recursive splitters, overlap tuning, the whole thing) paired with postgres + pgvector. retrieval was slow, latency got worse as the corpus grew.
Switched approaches entirely. dropped hierarchical stuff, went back to simple fixed-size chunks (1024 tokens, 128 overlap), but completely changed how we index. instead of indexing every chunk independently, we index semantic blocks—basically one embedding per logical section, not per sliding window.
the difference is nuts. retrieval went from p99 600ms to 180ms on the same dataset. postgres pgvector stayed the same. the win was chunking strategy, not the database.
# what we were doing (slow)
chunks = recursive_split(doc, chunk_size=1024)
for chunk in chunks:
embed_and_store(chunk)
# what worked (fast)
sections = extract_semantic_sections(doc)
for section in sections:
embedding = embed(section.full_text)
store(embedding, metadata=section.boundaries)
if you're tuning RAG latency, look at your chunking first. the vector DB is usually fine. we're still on postgres. it's doing the job.
No responses yet.