Comment by klement Gunndu on "LLM app dev using AWS Bedrock and Langchain"

The character split strategy working better than semantic chunking for this dataset is a pattern I have also observed with PDFs that have dense table-of-contents structures — the semantic splitter often breaks across section boundaries rather than content boundaries. One thing worth adding: when your document corpus grows past a few hundred PDFs, a hybrid retriever that combines BM25 sparse retrieval with the Titan embeddings dense retrieval can meaningfully improve recall without tuning the chunk size.

Search Hashnode