My FeedDiscussionsHeadless CMS
New
Sign in
Log inSign up
Learn more about Hashnode Headless CMSHashnode Headless CMS
Collaborate seamlessly with Hashnode Headless CMS for Enterprise.
Upgrade ✨Learn more

Finding a sturdy scraping framework

Sean Hill's photo
Sean Hill
·Nov 22, 2016

Early 2014 I built a massive scraper (nicknamed The Kraken) focused on fetching data from thousands of online portals. When I first began, I researched for days looking for a sturdy scrapping framework that would suit my needs. I tried my hands at PhantomJS, but soon found it wasn't strong enough for what I needed to accomplish.

I tried Nightmare, a high-level browser automation library, which at the time was just a handy wrapper around PhantomJS. It was a little better, but I still found myself writing code to pause while waiting for JavaScript to load on the page. I continued looking around and came across Selenium and Webdriver.io. I gave it a shot, and was amazed when a Chrome instance popped up and I could actually see with my eyes what was being automated. Even better, my commands would wait until the page had fully loaded all the needed JavaScript to execute.

I had found my framework. Selenium would be my savior I thought.

And it was, while developing locally on my computer. For instance, getting data from an online portal was as simple as this.

var webdriverio = require('webdriverio');
var options = { desiredCapabilities: { browserName: 'chrome' } };
var client = webdriverio.remote(options);

client
    .init()
    .url('http://onlineportal.com')
    .login()
    .getData()
    .then(function (data) {
        console.log(data);
    })
    .end();

Magically, a Chrome instance would popup, log into the online portal, fetch my desired data and return it in a clean manner, it was superb.

Then came the day of releasing this beast into the wild. I fired up a Linux instance, deployed the code, and tried to run a test. CAN I GET A BOOM BOOM? Linux natively didn't have Chrome installed or anyway to display the instance. After installing Google Chrome on the machine and Xvfb to emulate a virtual screen, I tried again. This time it failed again saying there was no screen to attach to on the device. I fired up Xvfb, and tried again, Hurrah! It worked.

Now I thought, I can't be watching my Linux box all the time, making sure Xvfb is running 24/7. So I came up with this handy script in Node.js that boots up Selenium Grid, and Xvfb.

Conclusion

Selenium is a mature scraping framework that I chose to scrape tens of thousands of jobs a day. Next up, I'll dive into the details of this script to see what each section is doing.