In my previous post I discussed how to get Selenium Grid running on a Linux environment. In this post I'll discuss how I create and run scraping jobs with Node.js.

The Parent thread

In order for the scraping server to scrape websites, I have my server listening for new scraping jobs. I'll explain more about this in another post, but I'll discuss now how I create and run new jobs.

Whenever my main node process thread finds a new scraping job, I create another node thread forked from the parent thread.

// Fork a new node process
var threadFile = path.resolve('/path/to/thread');
var thread = require('child_process').fork(threadFile, [], { silent: true });

thread.stderr.on('data', function (data) {
  console.log('Thread err:', data.toString());
});

thread.stdout.on('data', function (data) {
  console.log('Thread log:', data.toString());
});

// Send data to that new thread
thread.send({
  any: 'data',
  you: 'want'
});

// Wait for the response from the thread
thread.on('message', function (response){
  thread.kill();
  console.log('Thread response:', response);
});

The Child thread

Making a new child thread allows the parent thread event queue to remain clear for new available jobs to come through.

Here's a sample of what the thread file could look like.

// Scraper Thread

// Uncaught Exceptions immediately stop the thread
process.on('uncaughtException', function(err){
  process.send({
    err: err
  });    
});

// Fires when the thread is created and sent data from the parent thread
process.on('message', function (data) {

  console.log('Data from parent:', data);

  var webdriverio = require('webdriverio');
  var options = { desiredCapabilities: { browserName: 'chrome' } };
  var client = webdriverio.remote(options);

  client
    .init()
    .url('https://duckduckgo.com/')
    .setValue('#search_form_input_homepage', 'WebdriverIO')
    .click('#search_button_homepage')
    .getTitle().then(function(title) {
      console.log('Title is: ' + title);
    })
    .end()
    .then(function () {
      process.send({
        success: true,
        data: {
          its: 'done'
        }
      });
    });

});

Conclusion

Node.js is fantastic, but you don't want to create a bottle neck on your main process. Creating a child process for each job allows you to scrape more efficiently. In my next post I'll discuss the ideal hardware setup to run multiple concurrent scraping jobs.

Write your comment…

Be the first one to comment