Finding a sturdy scraping framework
Early 2014 I built a massive scraper (nicknamed The Kraken) focused on fetching data from thousands of online portals. When I first began, I researched for days looking for a sturdy scrapping framework that would suit my needs. I tried my hands at PhantomJS, but soon found it wasn't strong enough for what I needed to accomplish.
I tried Nightmare, a high-level browser automation library, which at the time was just a handy wrapper around PhantomJS. It was a little better, but I still found myself writing code to pause while waiting for JavaScript to load on the page. I continued looking around and came across Selenium and Webdriver.io. I gave it a shot, and was amazed when a Chrome instance popped up and I could actually see with my eyes what was being automated. Even better, my commands would wait until the page had fully loaded all the needed JavaScript to execute.
I had found my framework. Selenium would be my savior I thought.
And it was, while developing locally on my computer. For instance, getting data from an online portal was as simple as this.
var webdriverio = require('webdriverio');
var options = { desiredCapabilities: { browserName: 'chrome' } };
var client = webdriverio.remote(options);
client
.init()
.url('onlineportal.com')
.login()
.getData()
.then(function (data) {
console.log(data);
})
.end();
Magically, a Chrome instance would popup, log into the online portal, fetch my desired data and return it in a clean manner, it was superb.
Then came the day of releasing this beast into the wild. I fired up a Linux instance, deployed the code, and tried to run a test. CAN I GET A BOOM BOOM? Linux natively didn't have Chrome installed or anyway to display the instance. After installing Google Chrome on the machine and Xvfb to emulate a virtual screen, I tried again. This time it failed again saying there was no screen to attach to on the device. I fired up Xvfb, and tried again, Hurrah! It worked.
Now I thought, I can't be watching my Linux box all the time, making sure Xvfb is running 24/7. So I came up with this handy script in Node.js that boots up Selenium Grid, and Xvfb.
Conclusion
Selenium is a mature scraping framework that I chose to scrape tens of thousands of jobs a day. Next up, I'll dive into the details of this script to see what each section is doing.