In my previous post I discussed how to get Selenium Grid running on a Linux environment. In this post I'll discuss how I create and run scraping jobs with Node.js.

The Parent thread

In order for the scraping server to scrape websites, I have my server listening for new scraping jobs. I'll explain more about this in another post, but I'll discuss now how I create and run new jobs.

Whenever my main node process thread finds a new scraping job, I create another node thread forked from the parent thread.

// Fork a new node process
var threadFile = path.resolve('/path/to/thread');
var thread = require('child_process').fork(threadFile, [], { silent: true });

thread.stderr.on('data', function (data) {
  console.log('Thread err:', data.toString());

thread.stdout.on('data', function (data) {
  console.log('Thread log:', data.toString());

// Send data to that new thread
  any: 'data',
  you: 'want'

// Wait for the response from the thread
thread.on('message', function (response){
  console.log('Thread response:', response);

The Child thread

Making a new child thread allows the parent thread event queue to remain clear for new available jobs to come through.

Here's a sample of what the thread file could look like.

// Scraper Thread

// Uncaught Exceptions immediately stop the thread
process.on('uncaughtException', function(err){
    err: err

// Fires when the thread is created and sent data from the parent thread
process.on('message', function (data) {

  console.log('Data from parent:', data);

  var webdriverio = require('webdriverio');
  var options = { desiredCapabilities: { browserName: 'chrome' } };
  var client = webdriverio.remote(options);

    .setValue('#search_form_input_homepage', 'WebdriverIO')
    .getTitle().then(function(title) {
      console.log('Title is: ' + title);
    .then(function () {
        success: true,
        data: {
          its: 'done'



Node.js is fantastic, but you don't want to create a bottle neck on your main process. Creating a child process for each job allows you to scrape more efficiently. In my next post I'll discuss the ideal hardware setup to run multiple concurrent scraping jobs.

Mar 12, 2016