download notebook
view notebook w/o solutions

Plotting 1

files needed = ('gdp_components.csv')

We have a handle on python now: we understand the data structures and enough about working with them to move on to stuff more directly relevant to data analysis. We know how to get data into Pandas from files, how to manipulate DataFrames and how to do basic statistics.

Let's get started on making figures, arguably the best way to convey information about our data.

The packages

import pandas as pd             
# load the pyplot set of tools from the package matplotlib. 
import matplotlib.pyplot as plt   

# This following is a jupyter magic command. It tells jupyter to insert 
# the plots into the notebook rather than a new window.
%matplotlib inline      

Matplotlib is a popular package that bundles tools for creating visualizations. The documentation is here. We will look at some specific plot types in class, but you can learn about many different types in the thumbnail gallery. [Warning: not all the figures in the thumbnail gallery are good figures. Yikes!]

Copy the 'gdp_components.csv' file into your cwd (or load it using a file path to its location) and load it into pandas.

gdp = pd.read_csv('gdp_components.csv', index_col=0)  # load data from file, make date the index

print(gdp.head(2))                                    # print the first and last few rows to make sure all is well
print('\n', gdp.tail(2))
         GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                         
1929  104.556  17.170   9.622   5.939   5.556
1930   92.160  11.428  10.273   4.444   4.121

            GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
DATE                                                   
2016  18707.189  3169.887  3290.979  2217.576  2738.146
2017  19485.394  3367.965  3374.444  2350.175  2928.596

I don't like these variable names.

gdp.rename(columns = {'GDPA':'gdp', 'GPDIA':'inv', 'GCEA':'gov', 'EXPGSA':'ex', 'IMPGSA':'im' }, inplace=True)
gdp.head()
gdp inv gov ex im
DATE
1929 104.556 17.170 9.622 5.939 5.556
1930 92.160 11.428 10.273 4.444 4.121
1931 77.391 6.549 10.169 2.906 2.905
1932 59.522 1.819 8.946 1.975 1.932
1933 57.154 2.276 8.875 1.987 1.929

Let's get plotting. matplotlib graphics are based around two new object types.

  1. The figure object: think of this as the canvas we will draw figures onto
  2. The axes object: think of this as the figure itself and all the components

To create a new figure, we call the subplots() method of plt. Notice the use of multiple assignment.

# passing no arguments gets us one fig object and one axes object
fig, ax = plt.subplots()    

png

print(type(fig))

print(type(ax))
<class 'matplotlib.figure.Figure'>
<class 'matplotlib.axes._subplots.AxesSubplot'>

We apply methods to the axes to actually plot the data. Here is a scatter plot. [Try ax. and hit TAB...]

fig, ax = plt.subplots() 

# A line plot of gdp vs. time.
ax.plot(gdp.index, gdp['gdp'])       

#plt.show()
[<matplotlib.lines.Line2D at 0x11b02bd0190>]

png

First, note that the plot is a Line2D object. This is absolutely not important for us, but when you see jupyter print out <matplotlib.lines.Line2D at ...> that is what it is telling us. Everything in python is an object. To suppress this part of the output, use plt.show() as the last line of your code cell.

Second, a line plot needs two columns of data, one for the x-coordinate and one for the y-coordinate. I am using gdp for the y-coordinate and the years for the x-coordinate. I set years as the index variable, so to retrieve it I used the .index attribute.

Third, this plot needs some work. I do not like this line color. More importantly, I am missing labels and a title. These are extremely important.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red'                   # set the line color to red
       )                  

ax.set_ylabel('billions of dollars')  # add the y-axis label
ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product')

plt.show()

png

This is looking pretty good. While I am a fanatic when it comes to labeling things, I probably wouldn't label the x-axis. You have to have some faith in the reader.

I also do not like 'boxing' my plots. There is a philosophy about visualizations that says: Every mark on your figure should convey information. If it does not, then it is clutter and should be removed. I am not sure who developed this philosophy (Marie Kondo?) but I think it is a useful benchmark.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red'                   # set the line color to red
       )  

ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

plt.show()

png

Practice: Plots

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.

  1. Copy the code from the last plot and add a second line that plots 'gov'. To do this, just add a new line of code to the existing code. ax.plot(gdp.index, gdp['gov']])
fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red'                   # set the line color to red
       )  

ax.plot(gdp.index, gdp['gov'],        # line plot of gdp vs. time
        color='blue',                   # set the line color to blue
        alpha = 0.59,
        linestyle = ':'
       )  
ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

plt.show()

png

  1. Modify your code to give the figure a better title
  2. Modify your code to make government consumption blue
  3. Modify your code to add the argument alpha=0.5 to the plot method for gov. What does it change? If you want to learn more try 'alpha composite' in Google.
  4. Modify your code to make the 'gov' line dashed. Try the argument linestyle='--'. What is linestyle '-.' or ':' ?

A few more options to get us started

We have two lines on our figure. Which one is which? Not labeling our line is malpractice. Two approaches

  1. Add a legend. We use the 'label' option to give each line a label for the legend.
  2. Add text directly to the figure

Both are good options. I prefer the second for simple plots.

# The first option. Add labels to your plot commands, then call ax.legend.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red',                   # set the line color to red
       label = 'GDP'
       )  

ax.plot(gdp.index, gdp['gov'],        # line plot of government spending vs. time
        color='blue',                   # set the line color to blue
        alpha = 0.5,
        linestyle = ':',
        label = 'Gov. Spending'
       )  
ax.set_ylabel('billions of dollars')  # add the y-axis label

ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.legend(frameon=False)                           # Show the legend. frameon=False kills the box around the legend

plt.show()

png

Ah, I feel much better now that I know which line is which. Whatever we put in the label argument in ax.plot() ends up in the legend. There are lots of ways to customize the legend.

Here is the second approach. You have to experiment with the x-y coordinates to get the text placement correct.

# The second option. Add text using the .text() method. Note that I can leave the labels in the plot commands.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],color='red', label = 'GDP')  

ax.plot(gdp.index, gdp['gov'], color='blue', alpha = 0.5, linestyle = ':', label = 'Gov. Spending')  

ax.set_ylabel('billions of dollars')  
ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top


# Put text on the figure. 
ax.text(1989, 8500, 'GDP')            # text(x, y, string)
ax.text(1999, 4500, 'Gov. Spending')  # text(x, y, string)

plt.show()

png

Getting plots out of your notebook

While I love jupyter notebooks, my research output is usually an article distributed as a pdf or presentation slides.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],color='red', label = 'GDP')  

ax.plot(gdp.index, gdp['gov'], color='blue', alpha = 0.5, linestyle = ':', label = 'Gov. Spending')  

ax.set_ylabel('billions of dollars')  
ax.set_title('U.S. Gross Domestic Product and Government Spending')

# Remove the top and right lines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)  


# Put text on the figure. 
ax.text(1989, 8500, 'GDP')           
ax.text(1999, 4500, 'Gov. Spending') 

# Create a pdf and save to cwd 
plt.savefig('gdp.pdf', bbox_inches='tight')        

# Create a png and save to the folder that contains the cwd
plt.savefig('../gdp.png')                          

plt.show()

png

Check your directories. Can you find the two figures?

When saving a pdf, I use the bbox_inches='tight' argument to kill extra whitespace around the figure. You can also set things like orientation, dpi, and metadata. Check the documentation if you need to tweak your output.

More plot types: Histograms

The line plot is the tip of the iceberg. Matplotlib supports many plot types. Let's take a look at histograms.

A histogram shows us the empirical distribution of a variable. We use it to visualize:

  • The range of a variable. What the approximate min and max?
  • The mode of the variable. Is the distribution single-peaked?
  • The skew of a variable. Does the variable trend towards one end of the range?
  • The mean — if it is obvious.

Let's work through an example. How variable is US gdp growth?

# First, we compute the yearly growth rates of GDP.
# pct_change() creates growth rates, NOT percent change. 
# We need to mulitply by 100. Not a self-documenting name.

gdp['gdp_growth'] = gdp['gdp'].pct_change()*100 
gdp.head()
gdp inv gov ex im gdp_growth
DATE
1929 104.556 17.170 9.622 5.939 5.556 NaN
1930 92.160 11.428 10.273 4.444 4.121 -11.855848
1931 77.391 6.549 10.169 2.906 2.905 -16.025391
1932 59.522 1.819 8.946 1.975 1.932 -23.089248
1933 57.154 2.276 8.875 1.987 1.929 -3.978361

We could have used the diff() or the shift() methods to do something similar, but wow, pct_change is so luxe.

A quick plot to take a look.

fig, ax = plt.subplots() 

ax.plot(gdp.index, gdp['gdp_growth'], color='red', label = 'GDP Growth')  

ax.set_ylabel('percent growth')  # add the y-axis label
ax.set_title('U.S. Gross Domestic Product Growth Rates')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

# Add a horizontal line at y=0
ax.axhline(y=0, color='black', linewidth=0.75)  

plt.show()

png

The great depression and the WWII buildup really stick out.

Notice that I added a line at zero. My thinking is that this line adds information: The reader can easily see that growth rates are mostly positive and that the great depression was really bad.

It is also obvious that the volatility of gdp has fallen over time, but let's approach this a bit differently.

fig, ax = plt.subplots() 

# Create a histogram of GDP growth rates.
# hist() does not like NaN. (I'm a bit surprised.) 
# I use the dropna() method to kill off the missing value

ax.hist(gdp['gdp_growth'].dropna(), bins=20, color='red', alpha=0.75)       


ax.set_ylabel('Frequency') 
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1929-2017)')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top



plt.show()

png

Practice: Histograms

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.

  1. Break the data up into two periods: 1929–1985 and 1985–2017. (I would use .loc).
  2. Compute the mean and the standard deviation for the gdp growth rate in each sample.
  3. Create a separate histogram for each sample. Make the early period histogram blue and the late histogram black. Make any changes to them that you deem appropriate.
  4. Use .text() to add the mean and std to a blank area of the histograms. Just type the values of the mean and std into the text.
  5. Save the two histograms as pdfs. Give them reasonable names.

Challenging. Can you find a way to store the value of the mean and std to a variable and print the variable out on the histogram? This way our figure will automatically update with new or different data. Redo part 4.

gdp_early = gdp.loc[gdp.index < 1986, 'gdp_growth']
gdp_late = gdp.loc[gdp.index > 1985, 'gdp_growth']

print('The (mean, std) in the early sample are ({0:3.2f}, {1:3.2f}).'.format(gdp_early.mean(), gdp_early.std()))
print('The (mean, std) in the late sample are ({0:3.2f}, {1:3.2f}).'.format(gdp_late.mean(), gdp_late.std()))
The (mean, std) in the early sample are (7.22, 8.39).
The (mean, std) in the late sample are (4.82, 1.89).
fig, ax = plt.subplots() 

ax.hist(gdp_early.dropna(), bins=20, color='blue', alpha=0.75)

ax.set_ylabel('Frequency')  
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1929-1985)')

ax.spines['right'].set_visible(False) 
ax.spines['top'].set_visible(False)   

ax.text(-20, 15, 'mean = 7.22, std = 8.39')            # text(x, y, string)

plt.savefig('gdp_hist_early.pdf', bbox_inches='tight')          # Create a pdf and save to cwd 

plt.show()

png

fig, ax = plt.subplots() 

ax.hist(gdp_late.dropna(), bins=20, color='black', alpha=0.75) 

ax.set_ylabel('Frequency') 
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1985-2017)')

ax.spines['right'].set_visible(False) 
ax.spines['top'].set_visible(False)   

ax.text(-1.5, 3, 'mean = 4.82, std = 1.89')

plt.savefig('gdp_hist_late.pdf', bbox_inches='tight') 

plt.show()

png

# Take on the challenge! I'll just do the early period, but the late period works the same way. 

fig, ax = plt.subplots() 

ax.hist(gdp_early.dropna(), bins=20, color='blue', alpha=0.75)   

ax.set_ylabel('Frequency')
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1929-1985)')

ax.spines['right'].set_visible(False) 
ax.spines['top'].set_visible(False)   

# text(x, y, string)
ax.text(-20, 15, 'mean = {0:.2}, std = {1:.2}'.format(gdp_early.mean(), gdp_early.std()))           

# bbox_inches='tight' removes extra whitespace around the figure
plt.savefig('gdp_hist_early.pdf', bbox_inches='tight')

plt.show()

png

Subplots

We can generate several axes in one figure using the subplots() method.

fig, ax = plt.subplots(1,2)  # one row, two columns of axes

png

print(type(ax))
<class 'numpy.ndarray'>

So ax is now an array that holds the axes for each plot. Each axes works just like before. Now we just have to tell python which axes to act on.

# Set a variable for plot color so I can change it everywhere easily
my_plot_color = 'blue'

# I am using the figsize parameter here. It takes (width, height) in inches. 
fig, ax = plt.subplots(1, 2, figsize=(10,4))  # one row, two columns of axes

# The fist plot
ax[0].plot(gdp.index, gdp['gdp_growth'], color=my_plot_color, label = 'GDP Growth')
ax[0].axhline(y=0, color='black', linewidth=0.75)  # Add a horizontal line at y=0
ax[0].set_xlabel('year')
ax[0].set_title('GDP growth rates')
#ax[0].spines['right'].set_visible(False) # get ride of the line on the right
#ax[0].spines['top'].set_visible(False)   # get rid of the line on top

# The second plot
ax[1].hist(gdp['gdp_growth'].dropna(), bins=20, color=my_plot_color, alpha=0.25)        
ax[1].set_xlabel('annual growth rate')
ax[1].set_title('Histogram of GDP growth rates')
#ax[1].spines['right'].set_visible(False) # get ride of the line on the right
#ax[1].spines['top'].set_visible(False)   # get rid of the line on top

for a in ax:
    a.spines['top'].set_visible(False)
    a.spines['right'].set_visible(False)

plt.savefig('double.pdf')
plt.show()

png

You can imagine how useful this can be. We can loop over sets of axes and automate making plots if we have several variables.

I changed a couple other things here, too.

  1. I used the figsize parameter to subplot. This is a tuple of figure width and height in inches. (Inches! Take that rest of the world!) The height and width are of the printed figure. You will notice that jupyter notebook scaled it down for display. This is useful when you are preparing graphics for a publication and you need to meet an exact figure size.
  2. I made the line color a variable, so it is easy to change all the line colors at one. For example, I like red figures when I am giving presentations, but black figures when I am creating pdfs that will be printed out on a black and white printer.

Bonus practice problem

  1. Modify the code above to add a vertical line to the histogram in the second subfigure. The line should be located at the mean. Try searching for 'vertical line matplotlib'.