download notebook
view notebook w/ solutions
Web-scraping
files needed = (none)
So far, we have only considered cases where we have a data set available to investigate. While there are certainly a lot of data sets out there to explore, many interesting questions require us to develop new data sets from scratch. The internet presents some interesting possibilities: It is full of data. Web-scraping is the means by which we collect that data into a single database (or more specifically, DataFrame) for analysis. It can also be a useful method to reverse-engineer databases from a single website using their web interface.
Web-scraping lives in a legal gray area. Most sites would not like you to scrape their data: This data is often a valuable asset of the firm. Irresponsible web-scraping can also disrupt the website and cause other problems for the firm. We will discuss below how to know if a firm allows scraping and we will practice on a website that allows scraping.
There are many potential methods to conduct web scraping. A few:
- Manually copy and paste data from a webpage into a file on your computer.
- Use software. For example, there's a Chrome extension to automate web-scraping. While this approach is easy and non-technical, it's also clunky and very limited.
- Write the code yourself. This provides more flexibility. There are python libraries that facilitate this.
Guess which route we are taking?
Check Website Permissions
Robots! The internet we interact with is only the tip of a huge iceberg—underneath lies tons of information we generally cannot access. Search engine crawlers such as Google only have access to this surface web. Websites disallow crawling (or crawling certain parts of the site) by stating it in their robots.txt
file, which is read by crawlers (or spiders) that would like to read the site. If a site disallows crawling, search engines won't go there, and the sites won't be indexed. If you decide to extract data from the web, respecting the wishes of robots.txt
is necessary to avoid legal ramifications and to be a good internet citizen.
The robots file is located at http://www.website.com/robots.txt and it's the very first location of the website that a search engine will visit. The file may:
- Allow full access
User-agent: * \ Disallow:
If you find this in the robots.txt file of a website you are trying to crawl, you are in luck. This means all pages on the site are crawlable by bots, including the one you'll be using.
- Block all access
User-agent: * \ Disallow: /
You should steer clear from a site with this in its robots.txt
. It states that no part of the site should be visited by an automated crawler and violating this could mean legal trouble.
- Partial access
User-agent: * \ Disallow: /folder/
User-agent: * \ Disallow: /file.html
Some sites disallow crawling particular sections or files on their site. In such cases, you should direct your bots to leave the blocked areas untouched.
- Crawl rate limiting
Crawl-delay: 11
This is used to limit crawlers from hitting the site too frequently, as frequent hits by crawlers could place unwanted stress on the server and make the site slow for human visitors. In this case, the site can be crawled with a delay of 11 seconds.
- Visit time
Visit-time: 0400-0845
This tells the crawlers about hours when crawling is allowed. In this example, the site can be crawled between 04:00 and 08:45 UTC. Sites do this to avoid load from bots during their peak hours.
- Request rate
Request-rate: 1/10
Some websites do not entertain bots trying to fetch multiple pages simultaneously. The request rate is used to limit this behavior. 1/10 means the site allows crawlers to request one page every 10 seconds.
For this lecture, we'll be good citizens and work with websites which allow crawling designed to record data.
An Example: Delta.com
Enter www.delta.com/robots.txt in your browser and you observe that Delta allows robots on some pages but not others. For example, it does not allow bots to visit pages generated by "flight searches" which is to say it won't allow bots to record flight availability or pricing information.
Disallow: /flight-search/book-a-flight?cacheKeySuffix= \
Disallow: /flight-search/search?&tripType= \
Disallow: /flight-search/search?action=
HTML Basics
Scraping is part coding and part detective work. We need to look at the code that underlies a webpage to understand how it is structured. Then we can figure out how to scrape it.
Each webpage that you view in your browser is actually structured in HyperText Markup Language (HTML). HTML code has two parts:
- the head which includes the title and any imports for styling and JavaScript
- the body which includes the content that gets displayed as a webpage.
We’ll be interested in the body of the webpage. HTML code is comprised of tags, where a tag is described by an opening <
and closing \>
angular bracket with the name of the tag between; e.g., <div></div>
, <p>Some text</p>
etc.
The useful tags for us will be:
<div>
: This tag groups together elements into a single entity. This tag can act as the parent for a lot of different elements, so style changes which are applied here will also reflect in child elements.
<a>
: URL links are described in this tag. The webpage that will get loaded when the link is clicked on is given in its property href
.
<p>
: Used when information is displayed on the webpage as a block of text (\(\approx\) "paragraph").
<span>
: This tag is used when information is to be displayed inline. Moreover, when two such tags are placed side by side, they'll appear in the same line, unlike the paragraph tag.
<table>
: Tables are displayed in HTML with the help of this tag; i.e., data are displayed in cells formed by the intersection of rows and columns.
Detective work
Every website has a different structure and there are a few important things to think about when building a web scraper:
- What is the structure of the data contained on the page?
- How do we get to those web pages?
- Will you need to gather more data from the next page?
- Is the structure repeatable?
Scraping
We will scrape a webpage built specifically for practicing scraping: toscrape.com. We will scrape the fake bookstore page.
Go to the site. Right click on part of the page and choose "inspect" if you are using the Chrome browser. This will open a new window pane that displays the HTML underneath the page.
Inspect different parts of the site and try to find the parts that list a single book's information. We will need to find the tags that contain the information that we want.
This can take some time...
Practice: Finding the data in the page
Our goal is to scrape the prices, titles, and ratings
- Use the inspector to find the
article class
that contains the book: "A Light in the Attic." What is the class name? - Within the
article class
, find the book's title. What html element contains the title? - Within the
article class
, find the book's price. What html element contains the price? - Within the
article class
, find the book's rating. What html element contains the rating?
Packages
We need some new packages:
-
requests
: We use this library to open the url from which we would like to extract the data. This is a standard library included with your python distribution so you won't have to install it. -
BeautifulSoup
: This library helps us to get the HTML structure of the page that we want to work with. We can then, use its functions to access specific elements and extract relevant information. Anaconda includes this package.
If you need to install it using anaconda:
conda install beautifulsoup4
import pandas as pd
import requests
from bs4 import BeautifulSoup
Retrieve the webpage
This code takes a url as input, opens the url using the .get()
method, and parses (i.e., breaks up) the corresponding HTML files into a usable data type.
We saw this package when we retrieved data from the Census API, too.
url = 'http://books.toscrape.com/catalogue/category/books_1/index.html'
# Retrieve the content at the url
results = requests.get(url)
# Make the content we grabbed easy to read by using BeautifulSoup.
content = BeautifulSoup(results.text, 'html.parser')
type(content)
So now we have a BeautifulSoup object that holds the parsed HTML. We can use the .prettify()
method to take a look.
print(content.prettify())
Get all the books on the page
Each book's information lives in an article tag of the form
<article class="product_pod">
We use the .find_all()
method and tell it to find all the article
tags with class
of product_pod
. We pass find_all()
the type of element to look for and a dictionary of attribute values.
books = content.find_all('article', {'class' : 'product_pod'})
type(books)
The ResultSet object is iterable. We will be able to loop over all the books in this object and extract the data we need.
As usual, we start by experimenting with a single book to understand how to extract the data we need. Once we have figured out how to do it once, it will be easy to construct a loop to get all the books' data.
Let's experiment with the first book.
# The first article with class="product_pod".
books[0]
1. Get the title of the book
After studying the HTML code above, I see that the book title is contained in the link (an a
tag) in the third-level header (the h3
tag).
We use the .
notation to retrieve the first instance of the h3
tag (there is only one instance). Think of the tags nested within the books[0]
tag as attributes of books[0]
. As usual in python, we use the .
to access attributes.
We use the square brackets to references a key within a tag—like we would do with a dict.
So the code
books[0].h3.a['title']
asks for the first h3
tag in books[0]
and the first a
tag within the h3
tag. From the a
tag, we ask for the value associated with the key 'title'>
books[0].h3.a
books[0].h3.a['title']
2. Get the price of the book
There is more than one div
tag here, so we use .find()
to find the one with class 'product_price'. Inside of this div
, the first p
tag has the price (although the p
is class product_color...) so we use the .
to grab it.
Within the p
tag, we need the text attribute.
# This works, too. It is more robust to a change in the order of the `p` tags.
# print(books[0].find('div', class_='product_price').find('p', class_='price_color').text)
books[0].find('div', {'class' : 'product_price'}).p.text
We need to take care of the non-numeric data. We will take care of that once we have scraped all the data.
3. Get the book's rating
The star rating is encoded in the p
tag with a class of 'star-rating X' where X could be Zero, One, Two, etc...
# Find the `p` that contain the string 'star-rating'.
x = books[0].find('p', {'class':'star-rating'})
x['class'][1]
Create the DataFrame
- Loop over the books and extract the information we want. Store the information in lists.
- Convert the set of list to a DataFrame.
titles, prices, stars = [], [], []
for book in books:
titles.append(book.h3.a['title'])
prices.append(book.find('div', {'class' : 'product_price'}).p.text)
stars.append(book.find('p', {'class' : 'star-rating'})['class'][1])
print(titles[19], prices[19], stars[19], sep = '; ')
books_df = pd.DataFrame({
'title': titles,
'price': prices,
'rating': stars,
})
books_df.head(2)
Data cleanup
I see two issues. The non-numeric characters in 'price' and in rating. Let's fix them.
Let's start by using a regex to clean up the price data.
#books_df['price'] = books_df['price'].str.slice(2,).astype(float)
# If you are familiar with regex, you could try this...
# This code extracts the string. Still needs to be converted to a float.
# books_df['price'].str.extract(r'(\d+.\d+)')
books_df['price'] = books_df['price'].str.extract(r'(\d+.\d+)').astype(float)
books_df.dtypes
I am going to brute-force the ratings column. I wonder if there is a python library somewhere that reads alphabetic numbers and converts them to integers... Since there are only six potential values, this isn't too costly.
convert = {'Zero':0, 'One':1, 'Two':2, 'Three':3, 'Four':4, 'Five':5}
books_df['rating'] = books_df['rating'].replace(convert).astype(int)
print(books_df.head(), '\n')
print(books_df.dtypes)
Practice
- http://books.toscrape.com/ has a directory of book genres in the left-hand sidebar. Scrape the the topics list and create a pandas Series that contains the topics.
HTML jargon: ul
is 'unordered list' and li
is 'list item'. See how much easier markdown is to write?
Steps:
A. Get the page content. Any page will do, since they all have the sidebar. I scraped page 1 again.
B. Get the ul
with the list of headings.
C. Use find_all()
to get all the list items (li
).
D. From each list item, extract the genre text. I stored mine in a list.
E. Turn the list into a pandas Series.
F. Clean up the text. Try str.strip()
to remove the whitespace.
Finish early? Jump down to the bottom of the notebook for some tougher scrapes.
Retrieving all the books
We can extend the above code to grab more books by leveraging the pattern the url. Click on the 'next' button at the bottom of the Books to Scrape webpage. Look at the url. It reads
http://books.toscrape.com/catalogue/category/books_1/page-2.html
Each additional page increments the 'page-x' part of the url. We simply need to loop over all the pages and scrape away.
Be a good citizen!
Since we'll be making several calls to the Books to Scrape server, we want to be respectful and insert a pause between calls. Failing to do this would result in your computer generating a series of quick calls to the host server. If the server is not clever, it will respond to each call and your scraper will monopolize the server, limiting others' ability to access the information. This is the basis for a Denial of Service (DoS) attack. If the server is clever, it will block your IP and your scrape will end.
We'll use the following packages to delay and randomize the url calls:
from time import sleep
from random import randint
Specifically, before each url call in the loop, we'll pause the program for a random period of time; i.e.,
sleep(randint(2,10))
The randomness is to mimic human behavior in the event the host web server is looking for bots.
from time import sleep
from random import randint
def get_page(page, titles, prices, stars):
# Get the contents of the webpage.
print('scraping page', page)
url = 'http://books.toscrape.com/catalogue/category/books_1/page-' + str(page) + '.html'
content = BeautifulSoup(requests.get(url).text, 'html.parser')
books = content.find_all('article', class_='product_pod')
# Extract the data we need.
for book in books:
titles.append(book.h3.a['title'])
prices.append(book.find('div', {'class':'product_price'}).p.text)
stars.append(book.find('p', {'class':'star-rating'})['class'][1])
return titles, prices, stars
titles, prices, stars = [], [], []
#for page in range(1, 51):
for page in range(1, 3):
# Pause for a random time between 2 and 10 seconds. Look less like a bot.
wait_time = randint(2,5)
print('waiting',wait_time,'seconds...', end='')
sleep(wait_time)
titles, prices, stars = get_page(page, titles, prices, stars)
# Create the DataFrame and clean it up.
books_df = pd.DataFrame({'title': titles, 'price': prices, 'rating': stars})
books_df['price'] = books_df['price'].str.slice(2,).astype(float)
convert = {'Zero':0, 'One':1, 'Two':2, 'Three':3, 'Four':4, 'Five':5}
books_df['rating'] = books_df['rating'].replace(convert).astype(int)
books_df.shape
books_df.tail(2)
If you have time
This one is a bit harder because you need to traverse a table.
-
Go to http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html. It is the product page for a particular book.
-
From this page, scrape and print out the UPC and the number of reviews. Print them out.
HTML jargon: tr
is 'table row'.
Steps:
A. Get the page content.
B. Get all the table rows (tr
).
C. Check the table rows for the ones that contain "UPC" and "Number of Reviews." Extract the data and print it out.
url = 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
results = requests.get(url)
# Make the content we grabbed easy to read by using BeautifulSoup.
content = BeautifulSoup(results.text, 'html.parser')
# The data we want are in the table at the bottom of the page.
# Get all the table rows on the page.
rows = content.find_all('tr')
type(rows)
# Loop over the rows. When you find the 'th' (table header) that
# matches the item we want, print out the associated 'td' (table cell).
for r in rows:
if r.th.text == 'UPC':
print('The UPC is: ', r.td.text, '.', sep='')
if r.th.text == 'Number of reviews':
print('The number of reviews is: ', r.td.text, '.', sep='')