TEThe Engineered Notesinginexys.hashnode.dev·May 15 · 13 min readThe Empty Quadrant: Mapping the Design Space of Frontend PDF ExtractionA user asked me a sharp question yesterday: Looking at your extraction pipeline, pdfjs + geometryWorker + lattice + visualGridMapper, what makes this any different from any other extraction approach 00
TEThe Engineered Notesinginexys.hashnode.dev·May 13 · 5 min readHow to Stop PDF Parsers from Hallucinating Tables out of Thin AirPDF extraction is usually blind. If you've ever tried to write a script to scrape a PDF, you know exactly what I mean. You run the PDF through a generic text extractor, and instead of a clean table, y00
TEThe Engineered Notesinginexys.hashnode.dev·Apr 10 · 8 min readCleaning Broken HTML Tables from PDFs, Scrapes, and Legacy Exports in Vanilla JSHTML tables are liars. If you haven't worked deeply with HTML tables, you might think a table is just a simple 2D array: table[row][col]. The moment an HTML table introduces a colspan or a rowspan, th00