download notebook
view notebook w/ solutions
Web APIs for data
files needed = (None! It's all coming from the cloud.)
We have been loading data from files using pd.read_csv()
and pd.read_excel()
. Another way to input data into pandas is by directly downloading data from a web server through an application programming interface or api.
The wikipedia page isn't that insightful, but an api is a way to directly query a webserver and (in our case) ask for data. An api provides several advantages
- You only download the data you need
- You do not need to distribute data files with your code
- You have access to the 'freshest data'
There are downsides, to using apis, too.
- You need to be online to retrieve the data
- The group hosting the data may 'revise' the data, making it difficult to replicate you results
- The api may change, breaking your code.
On the whole, I find apis very convenient and useful. Let's dig in. We will cover
- Using pandas datareader
- Building our own queries and using the requests package
The packages
The package pandas_datareader
collects functions that interact with several popular data sources to access their apis. These include
- ~~Google finance~~
- Morningstar
- St. Louis Fed's Fred (one of my favorites)
- The World Bank
- Eurostat
- Quandl
API keys
Many data providers do not want some anonymous account connecting to the api and downloading data. These providers ask you to create an account and you are given an api key that you pass along with your request. Sometimes keys are free, sometimes they are not.
In this notebook, we will go through a few examples that do not require api keys. At the end of the notebook are a few examples that need api keys. If you sign up for (free) keys, you can try the examples out. We will not go through them in class, because your api key should only be used by you.
A quick detour: Installing packages with pip
We use the Anaconda distribution, which bundles python with many other useful packages. pandas_datareader
, however, is not one that is installed by default. [Remember, we need to install a package before we can import it. We only need to install the package once. We need to import a package every time we want to use it.]
We will install the package using 'pip' the python package manager. Before we install a package, let's see what packages we already have installed.
- Open an Anaconda prompt (open the start menu and type: 'ana')
- Run the command
pip list
Do you see pandas-datareader in the list? If not, it is not installed and not ready to be imported.
import pandas_datareader
### Installing a package
- At the Anaconda prompt: run the command
pip install --user pandas_datareader
and hit enter - In your jupyter notebook, restart the kernel from the Kernel menu
That should do it. It might take a minute, and fill the command window with text, but in the end it should have installed. You will probably see a message about updating pip. We can safely ignore it. We can confirm that the package is installed by running pip list
at the Anaconda prompt again. [The option --user
is important. It tells pip to install the package into our user drive, in which we have permission to save files.]
Now that the package is installed, we can import it into our program like usual.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for plotting
import datetime as dt # for time and date
# We are importing the data method from the package and calling it web.
import pandas_datareader.data as web
myfigsize=(15,8)
Data through pandas datareader
FRED
The FRED database is hosted by the St. Louis FRB. It houses lots of economic and financial data. It is US-centric but has some international data, too.
To use the FRED api you need to know the variable codes. The easiest way to do it to search on the FRED website.
The pandas_datareader documentation for FRED is here.
# These codes are for real US gdp and the working-age population.
# The first code seems intuitive; the second does not.
codes = ['GDPCA', 'LFWA64TTUSA647N']
# If you do not spec an end date it returns up to the most recent date.
# You can also pass datetime objects as start and end.
fred = web.DataReader(codes, 'fred', start='1960-01-01')
fred
Fred already assigned the date variable to the index. You can check to see that it is a datetime index. Nice.
# Give the variables some reasonable names
fred = fred.rename(columns={'GDPCA':'gdp', 'LFWA64TTUSA647N':'wap'})
# Let's plot real gdp per working age person. The data are in billions, so we multiply out GDP.
fred['gdp_wap'] = fred['gdp']*1000000000/fred['wap']
fred.head(2)
fig, ax = plt.subplots(figsize=myfigsize)
ax.plot(fred.index, fred['gdp_wap'], color='red')
ax.set_ylabel('2012 dollars')
ax.set_title('U.S. real GDP per working-age person')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Adding the last observation notation
When I am working with an api, I sometimes like to keep track of the latest observation, since it can change when I rerun the code in the future. Let's get the last date from our DataFrame.
# First, peel off the last observation's index. Note that this is a DatetimeIndex object.
fred.tail(1).index
# Second, the date attribute of the index gives us an array with the datetime in it.
fred.tail(1).index.date
# Third, there is only one observation, so access it using [0].
fred.tail(1).index.date[0]
# All at once.
fred.tail(1).index.date[0]
Now I have a datetime object. We know what to do with those. Now my figure will automatically update the text with the correct date.
The data is annual, so the year is enough.
fig, ax = plt.subplots(figsize=myfigsize)
ax.plot(fred.index, fred['gdp_wap'], color='red')
ax.set_ylabel('2012 dollars', fontsize=14)
ax.set_title('U.S. real GDP per working-age person', fontsize=16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.text(dt.datetime(2011,1,1), 35000, 'Latest data: {}'.format(fred.tail(1).index.date[0].strftime('%Y')), fontsize=14)
plt.show()
Stooq for indexes
Stooq provides historical pricing for indexes (docs). Let's grab the data for the Dow Jones Industrial Index. The ticker symbol for the Dow is '^DJI'.
# If I do not pass a start or end, it gives the last 5 years.
dj = web.DataReader('^DJI', 'stooq') # The Dow
sp = web.DataReader('^SPX', 'stooq') # The S&P 500
print(sp.head(2))
print(sp.tail(2))
print(dj.head(2))
print(dj.tail(2))
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(dj.index, dj['Close'], color='blue')
ax.set_ylabel('closing price')
ax.set_title('Dow Jones Industrials')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Practice: APIs
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.
How has inflation in the United States evolved over the last 60 years? Let's investigate.
- Go the FRED website and find the code for the 'Consumer price index for all urban consumers: All items less food and energy'
-
Use the api to get the data from 1960 to the most recent. What frequency is the data?
-
Create a variable in your DataFrame that holds the growth rate of the CPI—the inflation rate. Compute it in percentage terms.
Remember .pct_change()
?
-
Plot it. What patterns do you see?
-
Challenging. We computed the month-to-month inflation rate above. This is not the inflation rate we usually report. Can you compute and plot the year-over-year inflation rate? For example, the inflation rate for 1962-05-01 would be the cpi in 1962-05-01 divided by the cpi in 1961-05-01.
[Hint: Check the documentation for pct_change()
.]
- Annotate the decrease in inflaton around 1983 as 'Volker disinflation'
Data without datareader
Some data providers have apis that do not have packages written for them. These apis take a bit more work, including figuring out how to provide the search terms.
Let's look at the Census Bureau's trade data api. The Census (and many of the national statistical agencies) has an api that takes a bit of work to set up.
The documentation: https://www.census.gov/foreign-trade/reference/guides/Guide%20to%20International%20Trade%20Datasets.pdf
We are going to retrieve data on the value of U.S. imports of goods, classified by the Harmonized Sytem, from Canada.
We start with an endpoint. Think of it as the address of the server that we want to ask for data (query). Then we build up a string with the requests of the data we need.
# This is the harmonized system endpoint. I learned this from reading the documentation.
base_url = 'https://api.census.gov/data/timeseries/intltrade/imports/hs?'
For this particular api (they are all different...read the docs) we need to specify which variables we want, at what level of disagregation we want the data reported, which countries we want imports from, and from what time period.
variables = 'GEN_VAL_MO,I_COMMODITY' # general merchandise value, classification number
level = 'HS10' # 10-digit harmonized system classification
country = '1220' # Canada
dates = '2020-01' # 2021-01
Now we build the url that asks for the data. This is a string.
- We use
=
to specify the parameters - We use
&
to add together the different conditional statements
query = base_url + 'get=' + variables + '&COMM_LVL=' + level + '&CTY_CODE=' + country + '&time=' + dates
query
Copy url (without the quotes) and put it into a web browser. What do you get?
We use the requests package to get the url contents through python (docs).
import requests
response = requests.get(query)
type(response)
The Response
object has an attribute .status_code
which we can check. Status codes are list here.
response.status_code
Success.
Okay, now we have the data...how do we get it into a DataFrame? Response.content
looks promising...
response.content
That gave use the data formatted for text—what we saw in the browser. Let's try again.
JSON stands for JavaScript Object Notation and is a format that is commonly used to store data transmitted on the internet. The .json()
method of the response object tries to convert the content of the webpage into a standard format.
data = response.json()
data
This looks like a nested list. The first sub-list is the column headers. Each following sub-list is a row of data.
data[0]
data[1]
imports = pd.DataFrame(data[1:], columns=data[0])
imports.head()
That took more work than pandas datareader, but wasn't terrible. Now that we know how to use this api, we can easily download exactly what we need in a programatic way.
Practice: APIs on hard mode
Let's try some more downloads from Census.
- Try this string in a browser:
'https://api.census.gov/data/timeseries/intltrade/imports/hs?get=GEN_VAL_MO,CTY_NAME&CTY_CODE=2010&time=from+2015-01'
What kind of data is this? What is the unit of observation?
- Use
requests.get()
to retrieve the contents of the webpage at the url. -
Check the status code, was this successful?
-
Use the
.json()
method of the response to turn the data into nested lists. - Turn the nested lists into a DataFrame.
-
Check the dtypes. Make the dates datetime and the GEN_VAL_MO float.
-
Plot imports from Mexico with time on the x-axis. Does anything stand out?
-
Retrieve the content at this url using
requests.get()
. What is the status code?
'https://api.census.gov/data/timeseries/intltrade/imports/hs?get=GEN_VAL_MO,CTY_NAME&CTY_CODE=5070'
- What does the status code mean? Paste the link into the browser.
- Then, try the
.content
attribute of your response object.
The examples below requires an api key
Where you see the code your_key
is where you should enter your own api key. You need to go to their websites and sign up. The examples should work then.
Stock prices with iex
According to the docs:
The Investors Exchange (IEX) provides a wide range of data through an API. Historical stock prices are available for up to 5 years.
IEX is one of those companies that requires an api key. These keys can be pricey because their target market are business and app developers. They offer a free key, though for experimentation. Very cool of them.
You can sign up here: https://iexcloud.io/cloud-login#/register/
# Sometimes you pass your key in the DataReader method. In this case, you save your key as an environmental variable.
# This means you do not need to keep specifying your key each time you use IEX.
import os
os.environ["IEX_API_KEY"] = 'your_key'
# Harley Davidson prices. The ticker is 'HOG'
start = dt.datetime(2016, 9, 1)
end = dt.datetime(2022, 3, 21)
harley = web.DataReader('HOG', 'iex', start, end)
harley.tail()
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(harley.index, harley['close'], color='blue')
ax.set_ylabel('closing price')
ax.set_title('Harley Davidson stock prices')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
# Ahhh! Not a good looking figure.
# We need to set the index to a datetime object so mpl can get the axis right...
# FRED did this for us.
harley.index = pd.to_datetime(harley.index)
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(harley.index, harley['close'], color='blue')
ax.set_ylabel('closing price')
ax.set_title('Harley Davidson stock prices')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Quandl (now named Nasdaq Data Link)
Quandl is another data aggregation site. Like FRED or IEX, it collects data from several sources and makes them available.
The premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 400,000 people, including analysts from the world’s top hedge funds, asset managers and investment banks.
Quandl has free and pay-to-use data. You can sign up for a free api key. Quandl has also developed their own python package.
The U.S. Department of Energy data is available through this facility (and is free). The quandl code for natural gas futures is EIA/NG_RNGC3_M
. Let's take a look.
# You will need to provide your quandl key.
my_quandl_key = 'your_key'
gas = web.DataReader('EIA/NG_RNGC3_M', 'quandl', api_key=my_quandl_key)
print(gas.head(2))
print(gas.tail(2))
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(gas.index, gas['Value'], color='blue')
ax.set_ylabel('dollars per million BTU')
ax.set_title('3-month natural gas future prices (monthly)')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()