Hardware setup for running concurrent Selenium scraping jobs
In my previous post I discussed how to create and run Selenium scraping jobs with Node.js. In this post I'll discuss how to run multiple concurrent jobs on a single server.
The Scraping Server
Let's talk about a single scraping server. I'm using a m3.large
Amazon instance type. That converts to 2 CPU's, 7.5 GB of RAM, and a SSD hard drive.
As you probably know, Chrome sucks your operating system's memory. I dedicated about 1GB of RAM for each Selenium Node so that Chrome has that amount of memory to launch and scrape. The ideal setup I found on this instance type was to have a total of six Selenium Nodes running one scraping job each. This allows for a total of six scraping jobs running concurrently.
Conclusion
Using Selenium grid, with a hub of six nodes was ideal for a m3.large
instance type on Amazon's platform. This allowed for six concurrent scraping jobs to be run at a time. Next up I discuss how I created a job queue in order to process tens of thousands of jobs.