Creating and running Selenium scraping jobs with Node.js
In my previous post I discussed how to get Selenium Grid running on a Linux environment. In this post I'll discuss how I create and run scraping jobs with Node.js.
The Parent thread
In order for the scraping server to scrape websites, I have my server listening for new scraping jobs. I'll explain more about this in another post, but I'll discuss now how I create and run new jobs.
Whenever my main node process thread finds a new scraping job, I create another node thread forked from the parent thread.
// Fork a new node process
var threadFile = path.resolve('/path/to/thread');
var thread = require('child_process').fork(threadFile, [], { silent: true });
thread.stderr.on('data', function (data) {
console.log('Thread err:', data.toString());
});
thread.stdout.on('data', function (data) {
console.log('Thread log:', data.toString());
});
// Send data to that new thread
thread.send({
any: 'data',
you: 'want'
});
// Wait for the response from the thread
thread.on('message', function (response){
thread.kill();
console.log('Thread response:', response);
});
The Child thread
Making a new child thread allows the parent thread event queue to remain clear for new available jobs to come through.
Here's a sample of what the thread file could look like.
// Scraper Thread
// Uncaught Exceptions immediately stop the thread
process.on('uncaughtException', function(err){
process.send({
err: err
});
});
// Fires when the thread is created and sent data from the parent thread
process.on('message', function (data) {
console.log('Data from parent:', data);
var webdriverio = require('webdriverio');
var options = { desiredCapabilities: { browserName: 'chrome' } };
var client = webdriverio.remote(options);
client
.init()
.url('duckduckgo.com/')
.setValue('#search_form_input_homepage', 'WebdriverIO')
.click('#search_button_homepage')
.getTitle().then(function(title) {
console.log('Title is: ' + title);
})
.end()
.then(function () {
process.send({
success: true,
data: {
its: 'done'
}
});
});
});
Conclusion
Node.js is fantastic, but you don't want to create a bottle neck on your main process. Creating a child process for each job allows you to scrape more efficiently. In my next post I'll discuss the ideal hardware setup to run multiple concurrent scraping jobs.