Early 2014 I built a massive scraper (nicknamed The Kraken) focused on fetching data from thousands of online portals. When I first began, I researched for days looking for a sturdy scrapping framework that would suit my needs. I tried my hands at PhantomJS, but soon found it wasn't strong enough for what I needed to accomplish.

I tried Nightmare, a high-level browser automation library, which at the time was just a handy wrapper around PhantomJS. It was a little better, but I still found myself writing code to pause while waiting for JavaScript to load on the page. I continued looking around and came across Selenium and Webdriver.io. I gave it a shot, and was amazed when a Chrome instance popped up and I could actually see with my eyes what was being automated. Even better, my commands would wait until the page had fully loaded all the needed JavaScript to execute.

I had found my framework. Selenium would be my savior I thought.

And it was, while developing locally on my computer. For instance, getting data from an online portal was as simple as this.

var webdriverio = require('webdriverio');
var options = { desiredCapabilities: { browserName: 'chrome' } };
var client = webdriverio.remote(options);

client
    .init()
    .url('http://onlineportal.com')
    .login()
    .getData()
    .then(function (data) {
        console.log(data);
    })
    .end();

Magically, a Chrome instance would popup, log into the online portal, fetch my desired data and return it in a clean manner, it was superb.

Then came the day of releasing this beast into the wild. I fired up a Linux instance, deployed the code, and tried to run a test. CAN I GET A BOOM BOOM? Linux natively didn't have Chrome installed or anyway to display the instance. After installing Google Chrome on the machine and Xvfb to emulate a virtual screen, I tried again. This time it failed again saying there was no screen to attach to on the device. I fired up Xvfb, and tried again, Hurrah! It worked.

Now I thought, I can't be watching my Linux box all the time, making sure Xvfb is running 24/7. So I came up with this handy script in Node.js that boots up Selenium Grid, and Xvfb.

Conclusion

Selenium is a mature scraping framework that I chose to scrape tens of thousands of jobs a day. Next up, I'll dive into the details of this script to see what each section is doing.

Write your comment…

Be the first one to comment

The Author Card

Sean Hill's photo

Sean Hill

Full Stack Developer at Pingplot Inc.

Joined

Mar 12, 2016