Tag feed

#scraping

316 posts60 followers

Explore Hashnode

Alternatives

Trending tags this week

HSHANZALA SALEEMverid.hashnode.devJul 12 · 16 min read

API Monitoring vs Scraping: Why the Loop Wins

If you have ever set a cron job to hit a competitor's pricing page, parse the HTML, diff it against yesterday's copy, and email yourself when something looks different, you already know the punchline

0

NTNityanand Thakurblog.lsnnt.devJun 4 · 3 min read

Building a Massive Q&A Dataset from Sarthaks.com

Long before starting my college journey, I wanted to scrape Sarthaks.com. I tried multiple times, but I kept failing. The biggest problem was rate limiting. Every time I tried traditional scraping met

2

O

AAlterLabalterlab.hashnode.devMay 10 · 6 min read

RAG Pipelines: Why Markdown Extraction Beats HTML for Token Efficiency

Feeding raw HTML into a Retrieval-Augmented Generation (RAG) pipeline is computationally expensive and highly inefficient. Large Language Models (LLMs) operate on tokens, and HTML DOM structures are notoriously token-heavy. When you pipe raw HTML int...

1

O

Qqcraoqcrao.hashnode.devMay 10 · 5 min read

How I scraped 50k YouTube subtitles in 2 weeks for $7 (and the legal gray zones)

When I started building TubeVocab, I had a chicken-and-egg problem. I needed a corpus of YouTube subtitles to mine ESL vocabulary from — but the official YouTube Data API v3 doesn't return subtitle bo

1

O

AAlterLabalterlab.hashnode.devMay 9 · 9 min read

Build Web-Aware AI Agents in n8n Using Clean Markdown Extraction

The Token Economics of HTML vs. Markdown Autonomous AI agents require access to real-time web data to make informed decisions. However, the standard approach of feeding raw HTML directly into a Large Language Model (LLM) is a critical architectural f...

1

O

AAlterLabalterlab.hashnode.devMay 5 · 8 min read

Evaluating Web Scraping APIs for RAG Pipelines

Building a Retrieval-Augmented Generation (RAG) pipeline requires feeding raw web data into a vector database. But web data is messy, HTML is bloated, and public endpoints aggressively rate-limit incoming traffic. Selecting the right web scraping API...

0

GKGeorge Kiokotheaientrepreneur.hashnode.devMay 5 · 6 min read

I just 3x'd the price on my LinkedIn scraper. Here's the math.

I pulled up my Apify dashboard this morning before coffee. The number I was hoping for was a positive margin. The number I got was negative 33 percent. $20.94 in compute and proxy cost. $15.63 in revenue. I was paying users to run my actor. The actor...

1

O

AAlterLabalterlab.hashnode.devMay 1 · 5 min read

Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON

Raw HTML bloats Retrieval-Augmented Generation (RAG) pipelines. An average web page consists of 80% markup and 20% actual content. Passing this raw Document Object Model (DOM) to a Large Language Model wastes tokens, increases latency, and severely d...

1

O

AAlterLabalterlab.hashnode.devApr 30 · 4 min read

How to Scrape Glassdoor Data: Complete Guide for 2026

Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping. Extracting job market data requires navigating complex front-end architectures. Public job boards like Glassdoo...

1

O

AAlterLabalterlab.hashnode.devApr 30 · 8 min read

How to Scrape Airbnb Data: Complete Guide for 2026

Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping. When building data pipelines to monitor the short-term rental market, raw HTML extraction is only the first ste...

0

#scraping

Search Hashnode

#scraping

Explore Hashnode

Trending tags this week

API Monitoring vs Scraping: Why the Loop Wins

Building a Massive Q&A Dataset from Sarthaks.com

RAG Pipelines: Why Markdown Extraction Beats HTML for Token Efficiency

How I scraped 50k YouTube subtitles in 2 weeks for $7 (and the legal gray zones)

Build Web-Aware AI Agents in n8n Using Clean Markdown Extraction

Evaluating Web Scraping APIs for RAG Pipelines

I just 3x'd the price on my LinkedIn scraper. Here's the math.

Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON

How to Scrape Glassdoor Data: Complete Guide for 2026

How to Scrape Airbnb Data: Complete Guide for 2026