Sign in
Log inSign up

Creating and running Selenium scraping jobs with Node.js

Sean Hill's photo
Sean Hill
·Nov 22, 2016

In my previous post I discussed how to get Selenium Grid running on a Linux environment. In this post I'll discuss how I create and run scraping jobs with Node.js.

The Parent thread

In order for the scraping server to scrape websites, I have my server listening for new scraping jobs. I'll explain more about this in another post, but I'll discuss now how I create and run new jobs.

Whenever my main node process thread finds a new scraping job, I create another node thread forked from the parent thread.

// Fork a new node process
var threadFile = path.resolve('/path/to/thread');
var thread = require('child_process').fork(threadFile, [], { silent: true });

thread.stderr.on('data', function (data) {
  console.log('Thread err:', data.toString());
});

thread.stdout.on('data', function (data) {
  console.log('Thread log:', data.toString());
});

// Send data to that new thread
thread.send({
  any: 'data',
  you: 'want'
});

// Wait for the response from the thread
thread.on('message', function (response){
  thread.kill();
  console.log('Thread response:', response);
});

The Child thread

Making a new child thread allows the parent thread event queue to remain clear for new available jobs to come through.

Here's a sample of what the thread file could look like.

// Scraper Thread

// Uncaught Exceptions immediately stop the thread
process.on('uncaughtException', function(err){
  process.send({
    err: err
  });    
});

// Fires when the thread is created and sent data from the parent thread
process.on('message', function (data) {

  console.log('Data from parent:', data);

  var webdriverio = require('webdriverio');
  var options = { desiredCapabilities: { browserName: 'chrome' } };
  var client = webdriverio.remote(options);

  client
    .init()
    .url('duckduckgo.com/')
    .setValue('#search_form_input_homepage', 'WebdriverIO')
    .click('#search_button_homepage')
    .getTitle().then(function(title) {
      console.log('Title is: ' + title);
    })
    .end()
    .then(function () {
      process.send({
        success: true,
        data: {
          its: 'done'
        }
      });
    });

});

Conclusion

Node.js is fantastic, but you don't want to create a bottle neck on your main process. Creating a child process for each job allows you to scrape more efficiently. In my next post I'll discuss the ideal hardware setup to run multiple concurrent scraping jobs.