Building a G2 and Capterra Scraper in Python: Handling Cloudflare and Pagination
Scraping G2 and Capterra Review Data in Python
When you're building datasets for sentiment analysis or competitive benchmarking, scraping review data from sites like G2 and Capterra hits roadblocks fast. The core technical problem is bypassing anti-b...
vhubhashnodedev.hashnode.dev7 min read
cloudscraper is basically dead weight now. Cloudflare's Turnstile is way smarter these days - it's checking TLS 1.3 JA4 fingerprints and HTTP/2 frame ordering, and cloudscraper just can't fake that stuff anymore. Swap it out for curl_cffi with impersonate="chrome120" to slide by passively, though if you hit an actual Turnstile challenge you're still gonna need Playwright to break through. And that hardcoded time.sleep(2) thing - that's fragile as hell and gonna get you IP-banned quick. Go with exponential backoff instead and actually parse the Retry-After headers so you're playing nice with rate limits. The guide's missing cascading selectors for when the DOM inevitably shifts, it's not normalizing dates and ratings (which'll bite you hard downstream), and there's zero dedup logic for reviews that syndicate across multiple platforms. Skip those at your peril - enterprise folks lose weeks debugging this stuff. For small projects you can probably skate by, but throw in some safe-extract patterns and respect the rate limits (think 1–3 requests per second, stricter if you're going async). If you're playing for keeps, stack up curl_cffi + Pydantic validation + PostgreSQL JSONB for raw storage before you normalize