In my previous post, I discussed the ideal hardware setup to run multiple concurrent scraping jobs. In this post I'll describe my setup for processing 1000's of scraping jobs.
The Producer / Consumer Pattern
In this post I won't go into coding specifics, since you may want to do it differently, but I'll discuss the patterns I used to process 1000's of scraping jobs. I used the producer / consumer design pattern, where my "producer" server assigned new scraping jobs to available "consumer" scraping servers. In the middle of the producer and consumers was a job queue stored in a database.
I have a single amazon instance whose job was to poll my database job queue for any jobs whose
pending. When the server found a job, I'd query my consumer scraping servers and see how many jobs each was running. If one had less than my ideal six currently running, as mentioned in my previous post, I'd assign this new scraping job to that server and set its
I have X amount of consumer scraping servers depending on the current amount of jobs that needed to be processed. These scraping servers are polling the database job queue for any jobs assigned to them with the
queued. When it found a queued job, the server would set the job's
processing. It then fires up a new thread as discussed in my previous post, does the scraping job, and then sets the
status of the job to
Selenium is really bad with memory leaks. Because of this you constantly have to restart and reboot your servers. In my case, every 20 jobs, I killed Selenium Grid, Xvfb, and the parent node process, and restarted them all. Then at every 80 jobs, I rebooted the entire server completely. This process is described more in detail here. Doing this allowed me to process 1000's of jobs a day with no memory problems.
With this design pattern, I was able to process ~30k scraping jobs a day. Selenium is great for scraping, as long as you clean up your mess along the way. I hope these posts will be of help to you whenever you may need to rule the web.
Tuesday, 22 November 2016, 08:50 pm