I have heard beautiful soup library in python is great but cheerio is great option as well. What language and tool is apt for this task?
There are many languages that compete for the top spot. It’s not possible for anybody to claim that so and so language is the best language for web scraping .
Node.js is a language that users prefer when it comes to crawling web pages that use dynamic coding, although it supports distributed crawling. On the other hand, Node.js is a great choice if you are working on basic kind of web scraping projects. It would not advisable if your need is to scrape large-scale data.
I think a lot depends on which language you find comfortable. I am comfortable with both Python's beautifulsoup and scrapy. I prefer selenium that works for java, python, javascript, .net etc. For example, You can also use Selenium to scrape emails on page for instance.
In case if you want to scrape the pages using javascript, then you have options like PhantomJS and CasperJS. You can do scraping tasks like getting page source code using phantomjs and then scrolling through the specific data.
My suggestion is pick one language. Find out common frameworks in between the two. And then from there onwards you can easily switch language in same framework and accomplish specific scraping tasks.
Use JamDrop. JamDrop is a javascript library that allows you to do any web scraping processes on the client side. Its as simple as scrapeElementById(yourUrl, yourID);. Visit their github page for more: github.com/codejoy1234/JamDrop/blob/master/jamdro… Scroll to the very bottom to find code examples on web scraping.
All you need is Php + Curl + experience using complex regular expressions
Jsoup, is better but python it's easier to program.
Java is better but is difficult for new programmers.
I have compared the two with static pages and Java wins by 33.4%
For dynamic pages Java earns by 57.8%
I have used many web scraping software. My personal favorites are "fminer" and "content grabber". Fminer gives us facility of drawing action diagram which will be imitated by software again and again on websites while content grabber is unique in a way that it gives us facility of exporting scraping agents. By http://prowebscraping.com
I used Python with pyquery (a jquery-like library which builds upon BS). But then I turned to Go with goquery (same thing) in order to boost up parallelism and enjoy a modern type safe language. Both were adequate but I think I'll stick to Go for personal taste.
I wanna know if there is any web scraping package using async await ES7 syntax, it would made the code very easy and readable
I've done a lot of web scraping with Mojolicious' Mojo::UserAgent:
http://mojolicious.org/perldoc/Mojo/UserAgent
as the example shows it's as simple as:
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
say $ua->get('blogs.perl.org')->res->dom->find('h2 > a')->map('text')->join("\n");
If you need to execute JS, then CasperJS looks good.
It seems the general consensus is in favor of Python, and I agree for a number of reasons. You mentioned Python's Beautiful Soup, which is likely one of the more publicly used and developed scraping libraries. As some of the other comments have mentioned, it takes a lot of the pain out of the process for you, and with the size of the Python (and Beautiful Soup) community you will likely never run into a problem that there isn't already a solution to. Aside from Beautiful Soup, Scrapy is another widely used crawling/scraping platform used in Python.
Python also plays well with Mongodb, in that there are a lot of packages to use them together - there are some for Beautiful Soup that bring MongoDB with it. And MongoDB is probably the database you'll work with if you're planning on web scraping, at least thats the DB I've seen clients use the most when doing scraping work.
Ruby isn't a bad choice either, with Nokogiri and Mechanize. Both of which are Ruby gems that are seen a lot (in regards to Ruby) in web scraping applications.
Both Ruby and Python are simple enough to jump into, you would be wrong in choosing either. Although, I'd suggest Python in the end.
If you feel like it, share your web scraping project with us, I'd be interested in seeing it!
As most of people here said, python is good for scraping static websites. For those rendered by javascript, i would recommend PhantomJS or CasperJS that do the job very well.
Personnally I used mechanize (which was firstly a perl library before being coded in python) some years ago and it was simply awesome. It is not only a "scraper"; you can put informations on websites, because you can emulate a whole Web browser... However It does not seems to be still under active development, so perhaps you should continue to use beautiful soup which just works fine. About mechanize: http://wwwsearch.sourceforge.net/mechanize/ pythonforbeginners.com/cheatsheet/python-mechaniz…
Best.
Although I'm only using their results, and not working directly on the scraping side of the data acquisition, co workers rely extensively on Scrapy (And the related http://scrapinghub.com/ services for distributed proxies and everything to work around the blacklisting issues, but you can use scrapy on your own infrastructure for low volumes)
Do make sure you have permission to scrape the website in question.
I make every effort to monitor for this activity at my day job, going so far as to strait up banning the IP address and sometimes, entire subnet of the offender. Bandwidth isn't cheap on large websites; nor is having our product stolen and stored somewhere else for all to download for free.
I have used both beautiful soup (python) as well as cheerio (node). Both worked well. I'd recommend go with the language you're most familiar with -- python or node, and then pick the tool that's available. If you are equally familiar with both, then, choose Beautiful Soup because it's slightly richer as far as I remember. If you are familiar with neither, then too, pick python since it's easier to learn than node.
I think it really depends on what kinds of website you are scraping. If those websites using static type page (like php+mysql), you can use any language (like python) to download the page and parse the data. However, if those websites using javascript to dynamically load the data, you have to use some tools to simulate the javascript engine.
I recently use CasperJS (http://casperjs.org/) to parse a big database website, which loads data through javascript. You can find many video tutorials online like this one.
With Python, Beautiful Soup works very well to correct broken HTML, used it many years ago and I'm sure it's improved a lot since.
With Java I'm using Jsoup, same great experience and ease of use. Getting Java to scrape many web pages using httpClient is much easier to build due to the threading model.
Last time I used Python for heavy duty scraping, I had to dig deep to find thread-safe classes in order to get it to scrape many pages at the same time, got it working and scraping 7-10GB of BBC.co.uk content per night (this was 7 years ago, so it was considered a great scraper considering the limitations of the server it was running on), but it was somewhat messy due to Python not having that many thread-safe classes at the time.
Avi khisa
Professional Digital Marketer and SEO expert
very good knowledgeable content. this really helps with web scraping our website i really appreciate it. thanks for sharing your knowledge