LLM app dev using AWS Bedrock and Langchain
When trying to solve a Question Answering task over a larger document corpus with the help of LLMs we need to master the following challenges:
How to manage large document(s) that exceed the token limit
How to find the document(s) relevant to the q...
suyashblog.hashnode.dev10 min read
klement Gunndu
Agentic AI Wizard
The character split strategy working better than semantic chunking for this dataset is a pattern I have also observed with PDFs that have dense table-of-contents structures — the semantic splitter often breaks across section boundaries rather than content boundaries. One thing worth adding: when your document corpus grows past a few hundred PDFs, a hybrid retriever that combines BM25 sparse retrieval with the Titan embeddings dense retrieval can meaningfully improve recall without tuning the chunk size.