In my previous post, I discussed the ideal hardware setup to run multiple concurrent scraping jobs. In this post I'll describe my setup for processing 1000's of scraping jobs.

The Producer / Consumer Pattern

In this post I won't go into coding specifics, since you may want to do it differently, but I'll discuss the patterns I used to process 1000's of scraping jobs. I used the producer / consumer design pattern, where my "producer" server assigned new scraping jobs to available "consumer" scraping servers. In the middle of the producer and consumers was a job queue stored in a database.

The Producer

I have a single amazon instance whose job was to poll my database job queue for any jobs whose status was pending. When the server found a job, I'd query my consumer scraping servers and see how many jobs each was running. If one had less than my ideal six currently running, as mentioned in my previous post, I'd assign this new scraping job to that server and set its status to queued.

The Consumer

I have X amount of consumer scraping servers depending on the current amount of jobs that needed to be processed. These scraping servers are polling the database job queue for any jobs assigned to them with the status of queued. When it found a queued job, the server would set the job's status to processing. It then fires up a new thread as discussed in my previous post, does the scraping job, and then sets the status of the job to completed.

Common Gotchas

Selenium is really bad with memory leaks. Because of this you constantly have to restart and reboot your servers. In my case, every 20 jobs, I killed Selenium Grid, Xvfb, and the parent node process, and restarted them all. Then at every 80 jobs, I rebooted the entire server completely. This process is described more in detail here. Doing this allowed me to process 1000's of jobs a day with no memory problems.

Conclusion

With this design pattern, I was able to process ~30k scraping jobs a day. Selenium is great for scraping, as long as you clean up your mess along the way. I hope these posts will be of help to you whenever you may need to rule the web.

Write your comment…

Be the first one to comment

The Author Card

Sean Hill's photo

Sean Hill

Full Stack Developer at Pingplot Inc.

Joined

Mar 12, 2016