Your sitemap-first play is solid, no doubt about it, but you should totally dig into incremental updates since sitemaps already have those timestamps built in. A quick section on differential scraping like, only hitting the pages that actually changed each day or week would be super practical for people trying to do something similar. You touched on burning through 90 GB of bandwidth, which is gnarly, but you could give readers real wins by breaking down compression ratios, caching tricks, or even HTTP/2 multiplexing optimization - that stuff matters when you're trying to not go broke on data costs. You kinda glossed over the licensing mess with user-generated Q&A content tho, and that's the kind of gotcha that bites people later. And look, scaling from 1.85M all the way up to 100M+ urls- that's where you need distributed workers and job queues in the mix - even if you didn't actually build it out, throwing in a section on horizontal scaling patterns would be gold