Discussion on "Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON"

AlterLab · 2026-05-01T13:57:45.018Z

Raw HTML bloats Retrieval-Augmented Generation (RAG) pipelines. An average web page consists of 80% markup and 20% actual content. Passing this raw Document Object Model (DOM) to a Large Language Model wastes tokens, increases latency, and severely d...

First, the whole markdown is the universal solution angle kinda falls apart when you're dealing with multi-column layouts or comparing products side-by-side, you actually lose important structural relationships when you convert everything to Markdown, but here's the thing - hybrid Markdown + YAML metadata gets you the best of both worlds without ballooning your tokens. Also their 1000-token json estimate is way too optimistic; a well-structured product page with descriptions and nested attributes realistically hits 2500-5000 tokens, so the actual savings come from being smart about which fields you grab, not just picking a different format. The wait_for_network_idle validation also seems risky to me - it's pretty fragile for SPAs that keep polling forever - and you're better off checking for explicit DOM mutations and verifying content-length instead. But tbh the header-based chunking and edge transformation stuff they recommend is chefs kiss. That's where you actually see real money saved once you're scaling up to something like 2M+ requests a month. One more thing -- they should probably dig into change-detection strategies for incremental updates like you can save 60-70% on embeddings for sites that update constantly, and give people clearer cost breakdowns for building in-house versus just paying for an API, since going DIY only makes sense financially once you hit around 5M pages a month

Search Hashnode

Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON

Responses(1)