The character split strategy working better than semantic chunking for this dataset is a pattern I have also observed with PDFs that have dense table-of-contents structures — the semantic splitter often breaks across section boundaries rather than content boundaries. One thing worth adding: when your document corpus grows past a few hundred PDFs, a hybrid retriever that combines BM25 sparse retrieval with the Titan embeddings dense retrieval can meaningfully improve recall without tuning the chunk size.
klement Gunndu
Agentic AI Wizard
The character split strategy working better than semantic chunking for this dataset is a pattern I have also observed with PDFs that have dense table-of-contents structures — the semantic splitter often breaks across section boundaries rather than content boundaries. One thing worth adding: when your document corpus grows past a few hundred PDFs, a hybrid retriever that combines BM25 sparse retrieval with the Titan embeddings dense retrieval can meaningfully improve recall without tuning the chunk size.