download notebook
view notebook w/ solutions
Plotting 1
files needed = ('gdp_components.csv')
We have a handle on python now: we understand the data structures and enough about working with them to move on to stuff more directly relevant to data analysis. We know how to get data into Pandas from files, how to manipulate DataFrames and how to do basic statistics.
Let's get started on making figures, arguably the best way to convey information about our data.
The packages
import pandas as pd
# load the pyplot set of tools from the package matplotlib.
import matplotlib.pyplot as plt
# This following is a jupyter magic command. It tells jupyter to insert
# the plots into the notebook rather than a new window.
%matplotlib inline
Matplotlib is a popular package that bundles tools for creating visualizations. The documentation is here. We will look at some specific plot types in class, but you can learn about many different types in the thumbnail gallery. [Warning: not all the figures in the thumbnail gallery are good figures. Yikes!]
Copy the 'gdp_components.csv' file into your cwd (or load it using a file path to its location) and load it into pandas.
gdp = pd.read_csv('gdp_components.csv', index_col=0) # load data from file, make date the index
print(gdp.head(2)) # print the first and last few rows to make sure all is well
print('\n', gdp.tail(2))
I don't like these variable names.
gdp.rename(columns = {'GDPA':'gdp', 'GPDIA':'inv', 'GCEA':'gov', 'EXPGSA':'ex', 'IMPGSA':'im' }, inplace=True)
gdp.head()
Let's get plotting. matplotlib graphics are based around two new object types.
- The figure object: think of this as the canvas we will draw figures onto
- The axes object: think of this as the figure itself and all the components
To create a new figure, we call the subplots()
method of plt
. Notice the use of multiple assignment.
# passing no arguments gets us one fig object and one axes object
fig, ax = plt.subplots()
print(type(fig))
print(type(ax))
We apply methods to the axes to actually plot the data. Here is a scatter plot. [Try ax.
and hit TAB...]
fig, ax = plt.subplots()
# A line plot of gdp vs. time.
ax.plot(gdp.index, gdp['gdp'])
#plt.show()
First, note that the plot is a Line2D object. This is absolutely not important for us, but when you see jupyter print out <matplotlib.lines.Line2D at ...>
that is what it is telling us. Everything in python is an object. To suppress this part of the output, use plt.show()
as the last line of your code cell.
Second, a line plot needs two columns of data, one for the x-coordinate and one for the y-coordinate. I am using gdp
for the y-coordinate and the years for the x-coordinate. I set years as the index variable, so to retrieve it I used the .index
attribute.
Third, this plot needs some work. I do not like this line color. More importantly, I am missing labels and a title. These are extremely important.
fig, ax = plt.subplots()
ax.plot(gdp.index, gdp['gdp'], # line plot of gdp vs. time
color='red' # set the line color to red
)
ax.set_ylabel('billions of dollars') # add the y-axis label
ax.set_xlabel('year') # add the x-axis label
ax.set_title('U.S. Gross Domestic Product')
plt.show()
This is looking pretty good. While I am a fanatic when it comes to labeling things, I probably wouldn't label the x-axis. You have to have some faith in the reader.
I also do not like 'boxing' my plots. There is a philosophy about visualizations that says: Every mark on your figure should convey information. If it does not, then it is clutter and should be removed. I am not sure who developed this philosophy (Marie Kondo?) but I think it is a useful benchmark.
fig, ax = plt.subplots()
ax.plot(gdp.index, gdp['gdp'], # line plot of gdp vs. time
color='red' # set the line color to red
)
ax.set_ylabel('billions of dollars') # add the y-axis label
# ax.set_xlabel('year') # add the x-axis label
ax.set_title('U.S. Gross Domestic Product')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
plt.show()
Practice: Plots
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.
-
Copy the code from the last plot and add a second line that plots 'gov'. To do this, just add a new line of code to the existing code.
ax.plot(gdp.index, gdp['gov']])
-
Modify your code to give the figure a better title
- Modify your code to make government consumption blue
- Modify your code to add the argument
alpha=0.5
to the plot method for gov. What does it change? If you want to learn more try 'alpha composite' in Google. - Modify your code to make the 'gov' line dashed. Try the argument
linestyle='--'
. What is linestyle '-.' or ':' ?
A few more options to get us started
We have two lines on our figure. Which one is which? Not labeling our line is malpractice. Two approaches
- Add a legend. We use the 'label' option to give each line a label for the legend.
- Add text directly to the figure
Both are good options. I prefer the second for simple plots.
# The first option. Add labels to your plot commands, then call ax.legend.
fig, ax = plt.subplots()
ax.plot(gdp.index, gdp['gdp'], # line plot of gdp vs. time
color='red', # set the line color to red
label = 'GDP'
)
ax.plot(gdp.index, gdp['gov'], # line plot of government spending vs. time
color='blue', # set the line color to blue
alpha = 0.5,
linestyle = ':',
label = 'Gov. Spending'
)
ax.set_ylabel('billions of dollars') # add the y-axis label
ax.set_title('U.S. Gross Domestic Product and Government Spending')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
ax.legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
plt.show()
Ah, I feel much better now that I know which line is which. Whatever we put in the label
argument in ax.plot()
ends up in the legend. There are lots of ways to customize the legend.
Here is the second approach. You have to experiment with the x-y coordinates to get the text placement correct.
# The second option. Add text using the .text() method. Note that I can leave the labels in the plot commands.
fig, ax = plt.subplots()
ax.plot(gdp.index, gdp['gdp'],color='red', label = 'GDP')
ax.plot(gdp.index, gdp['gov'], color='blue', alpha = 0.5, linestyle = ':', label = 'Gov. Spending')
ax.set_ylabel('billions of dollars')
ax.set_title('U.S. Gross Domestic Product and Government Spending')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
# Put text on the figure.
ax.text(1989, 8500, 'GDP') # text(x, y, string)
ax.text(1999, 4500, 'Gov. Spending') # text(x, y, string)
plt.show()
Getting plots out of your notebook
While I love jupyter notebooks, my research output is usually an article distributed as a pdf or presentation slides.
fig, ax = plt.subplots()
ax.plot(gdp.index, gdp['gdp'],color='red', label = 'GDP')
ax.plot(gdp.index, gdp['gov'], color='blue', alpha = 0.5, linestyle = ':', label = 'Gov. Spending')
ax.set_ylabel('billions of dollars')
ax.set_title('U.S. Gross Domestic Product and Government Spending')
# Remove the top and right lines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# Put text on the figure.
ax.text(1989, 8500, 'GDP')
ax.text(1999, 4500, 'Gov. Spending')
# Create a pdf and save to cwd
plt.savefig('gdp.pdf', bbox_inches='tight')
# Create a png and save to the folder that contains the cwd
plt.savefig('../gdp.png')
plt.show()
Check your directories. Can you find the two figures?
When saving a pdf, I use the bbox_inches='tight'
argument to kill extra whitespace around the figure. You can also set things like orientation, dpi, and metadata. Check the documentation if you need to tweak your output.
More plot types: Histograms
The line plot is the tip of the iceberg. Matplotlib supports many plot types. Let's take a look at histograms.
A histogram shows us the empirical distribution of a variable. We use it to visualize:
- The range of a variable. What the approximate min and max?
- The mode of the variable. Is the distribution single-peaked?
- The skew of a variable. Does the variable trend towards one end of the range?
- The mean — if it is obvious.
Let's work through an example. How variable is US gdp growth?
# First, we compute the yearly growth rates of GDP.
# pct_change() creates growth rates, NOT percent change.
# We need to mulitply by 100. Not a self-documenting name.
gdp['gdp_growth'] = gdp['gdp'].pct_change()*100
gdp.head()
We could have used the diff()
or the shift()
methods to do something similar, but wow, pct_change
is so luxe.
A quick plot to take a look.
fig, ax = plt.subplots()
ax.plot(gdp.index, gdp['gdp_growth'], color='red', label = 'GDP Growth')
ax.set_ylabel('percent growth') # add the y-axis label
ax.set_title('U.S. Gross Domestic Product Growth Rates')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
# Add a horizontal line at y=0
ax.axhline(y=0, color='black', linewidth=0.75)
plt.show()
The great depression and the WWII buildup really stick out.
Notice that I added a line at zero. My thinking is that this line adds information: The reader can easily see that growth rates are mostly positive and that the great depression was really bad.
It is also obvious that the volatility of gdp has fallen over time, but let's approach this a bit differently.
fig, ax = plt.subplots()
# Create a histogram of GDP growth rates.
# hist() does not like NaN. (I'm a bit surprised.)
# I use the dropna() method to kill off the missing value
ax.hist(gdp['gdp_growth'].dropna(), bins=20, color='red', alpha=0.75)
ax.set_ylabel('Frequency')
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1929-2017)')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
plt.show()
Practice: Histograms
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.
- Break the data up into two periods: 1929–1985 and 1985–2017. (I would use
.loc
). - Compute the mean and the standard deviation for the gdp growth rate in each sample.
- Create a separate histogram for each sample. Make the early period histogram blue and the late histogram black. Make any changes to them that you deem appropriate.
- Use
.text()
to add the mean and std to a blank area of the histograms. Just type the values of the mean and std into the text. - Save the two histograms as pdfs. Give them reasonable names.
Challenging. Can you find a way to store the value of the mean and std to a variable and print the variable out on the histogram? This way our figure will automatically update with new or different data. Redo part 4.
Subplots
We can generate several axes in one figure using the subplots()
method.
fig, ax = plt.subplots(1,2) # one row, two columns of axes
print(type(ax))
So ax
is now an array that holds the axes for each plot. Each axes works just like before. Now we just have to tell python which axes to act on.
# Set a variable for plot color so I can change it everywhere easily
my_plot_color = 'blue'
# I am using the figsize parameter here. It takes (width, height) in inches.
fig, ax = plt.subplots(1, 2, figsize=(10,4)) # one row, two columns of axes
# The fist plot
ax[0].plot(gdp.index, gdp['gdp_growth'], color=my_plot_color, label = 'GDP Growth')
ax[0].axhline(y=0, color='black', linewidth=0.75) # Add a horizontal line at y=0
ax[0].set_xlabel('year')
ax[0].set_title('GDP growth rates')
#ax[0].spines['right'].set_visible(False) # get ride of the line on the right
#ax[0].spines['top'].set_visible(False) # get rid of the line on top
# The second plot
ax[1].hist(gdp['gdp_growth'].dropna(), bins=20, color=my_plot_color, alpha=0.25)
ax[1].set_xlabel('annual growth rate')
ax[1].set_title('Histogram of GDP growth rates')
#ax[1].spines['right'].set_visible(False) # get ride of the line on the right
#ax[1].spines['top'].set_visible(False) # get rid of the line on top
for a in ax:
a.spines['top'].set_visible(False)
a.spines['right'].set_visible(False)
plt.savefig('double.pdf')
plt.show()
You can imagine how useful this can be. We can loop over sets of axes and automate making plots if we have several variables.
I changed a couple other things here, too.
- I used the
figsize
parameter to subplot. This is a tuple of figure width and height in inches. (Inches! Take that rest of the world!) The height and width are of the printed figure. You will notice that jupyter notebook scaled it down for display. This is useful when you are preparing graphics for a publication and you need to meet an exact figure size. - I made the line color a variable, so it is easy to change all the line colors at one. For example, I like red figures when I am giving presentations, but black figures when I am creating pdfs that will be printed out on a black and white printer.
Bonus practice problem
- Modify the code above to add a vertical line to the histogram in the second subfigure. The line should be located at the mean. Try searching for 'vertical line matplotlib'.