My FeedDiscussionsHashnode Enterprise
New
Sign in
Log inSign up
Learn more about Hashnode Headless CMSHashnode Headless CMS
Collaborate seamlessly with Hashnode Headless CMS for Enterprise.
Upgrade ✨Learn more
Scraping an Amazon Product page in Python

Scraping an Amazon Product page in Python

Dirk Hoekstra's photo
Dirk Hoekstra
·Dec 8, 2020·

3 min read

In this tutorial, we'll set up a simple Python web scraper that scrapes data from an Amazon product page!

Amazon Detail Page

Getting the product page HTML

The first step is to get the HTML of a product page.

Let's use this beautiful backpack as a test page.

To get the HTML I'm going to use scraperbox.com. This way I can avoid the Amazon robot-check captchas.

import urllib.parse
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Urlencode the URL
url = urllib.parse.quote_plus("amazon.com/LAMAZE-L27901-Lamaze-Peek-A-Boo…")

# Create the query URL.
query = "api.scraperbox.com/scrape"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&url=%s" % url
query += "&javascript_enabled=true"

# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
html = raw_response.decode("utf-8")
print(html)

This program calls the ScraperBox API. That API will spin up a real Chrome browser and return the HTML of an Amazon detail page.

Be sure to replace YOUR_API_TOKEN with your scraperbox API token!

When running this program it shows the Product page HTML, nice!

python3 amazon_scraper.py
<!doctype>
<HTML>
// The amazon HTML

Extracting data from the HTML

The next step is to extract data from the Amazon page.

To do this I'm going to use the Beautiful Soup Python package.

Add the top of the python file I import the package.

from bs4 import BeautifulSoup

Then, I create the soup object.

soup = BeautifulSoup(html, 'html.parser')

Next, we should find the element that we want to extract from the HTML.

I open up the backpack page. Next, I right-click on the page title and select inspect.

This opens up the Chrome dev tools and we can see that the title element has the #title id.

Amazon detail page with dev tools open

We can use this id to get the title.

# Other code removed for simplicity.

soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('#title')
print("Title = %s" % title.text.strip())

And when running the program it shows the correct title 🎉

python3
title = Lamaze Peek-A-Boo Forest, Fun Interactive Baby Book with Inspiring Rhymes and Stories

Conclusion

We've set up a basic scraper application that gets the title of an Amazon product.

You could easily expand this scraper to get way more details about the product!

Happy coding!