download notebook
view notebook w/ solutions
More on plotting
files needed = ('chile.xlsx', 'broadband_size.xlsx', 'auto_data.dta')
The seaborn package
We have been using matplotlib to plot. This is a very low-level package, meaning that we have a lot of fine-grained control of the elements of our plots. I like fine-grained control because it means I can makes things look exactly how I want it. It also means that we have to type a lot of code to create a figure.
In this notebook we introduce the seaborn package. Seaborn is written on top of matplotlib and automates some of the creation of plots. This can be very helpful, but remember do not trust the defaults.
Seaborn also includes some plot types that are not easy to do in matplotlib. Great!
We will cover:
* regplot()
which adds the line-of-best-fit to a scatter plot
* jointplot()
which adds the marginal distributions to the axes
* Scaling scatter markers with a third variable
* Facet plots
Along the way, we will take a look at the World Development Indicators.
import pandas as pd # our go-to for data handling
import matplotlib.pyplot as plt # make plots, but doesn't automate much
import seaborn as sns # some new plot types, more automation
pd.set_option('precision', 3) # this tells pandas to print out 3 decimal places when we print a DataFrame
World Development Indicators
The World Bank's World Development Indicators is a great source of economic and social data for many countries and many years. I have already downloaded output and consumption data for a few countries and saved it as 'wdi.csv'.
The database is great: lots of variables, countries and time. I have extracted a small data file for Chile. The files that come out of the download facility are a mess. Let's clean them up.
# I looked at the workbook and noticed the footer and that nas are '..'.
wdi = pd.read_excel('chile.xlsx', na_values='..', skipfooter=5)
wdi.head()
There is a lot to not like about this DataFrame.
- There are unneeded variables.
- The year data are a mix of numbers and letters.
- The unit of observation is a year, so the I want the years in the index.
1. Drop unneeded variables
wdi = wdi.drop(['Country Name', 'Country Code', 'Series Code'], axis=1)
wdi.head()
2. Clean up the years and make numbers
Cleaning up the dates is bit complicated but not impossible. I will slice each column name and take the first four characters. Then convert those first four characters to an int
. The first column name is not a date, so I need to skip that.
temp = [wdi.columns[0]] + [int(c[0:4]) for c in wdi.columns[1:]]
temp
wdi.columns = temp
wdi.head()
3. Years in the index
Looking better! Let's set the 'Series Name' as the index and then transpose the DataFrame. We will learn more about reshaping data in the future.
wdi = wdi.set_index('Series Name')
wdi
wdi = wdi.transpose()
wdi.head()
wdi.columns = ['gdp_real', 'cons_real']
wdi.head()
Lastly, let's compute the growth rates of real gdp and consumption.
wdi_gr = wdi.pct_change()*100
wdi_gr.head()
sns.regplot( )
Our go-to scatterplot in matplotlib does not offer an easy way to add a simple line of best fit. We can separately estimate the regression and then plot the fitted values, but seaborn provides a simple way to get that same look.
The regplot( )
function (docs) adds a fitted regression and confidence interval to a scatter.
my_fig, my_ax = plt.subplots(figsize=(10,5))
sns.regplot(x='gdp_real', # column to put on x axis
y='cons_real', # column to put on y axis
data=wdi_gr, # the data
ax = my_ax, # an axis object
color = 'black',
ci = 0) # confidence interval, 0 supresses it
# Easier than matplotlib!
sns.despine(ax = my_ax)
# Since this is all in a matplotlib axis/figure, our usual labeling applies.
my_ax.set_title('Consumption and output growth in Chile')
my_ax.set_ylabel('real consumption growth (%)')
my_ax.set_xlabel('real output growth (%)')
plt.show()
Looking good! The default regression spec is OLS, but you can specify a more complicated model. Interestingly, you cannot recover the coefficients (slope, intercept) of the estimated regression. The person who developed seaborn is apparently quite adamant about this.
Notice the different syntax seaborn exposes.
sns.regplot(x, y, data, ax)
We do not pass the columns of data directly, like we would with matplotlib. Instead, we pass the DataFrame and the column names of the variables we want to plot.
We can pass regplot()
an axis object, ax = my_ax
in the code above, to attach the plot to a specific axis. If we omit the ax argument, regplot()
will create an axis object for us.
Confidence interval
Let's specify a 95 percent confidence interval.
my_fig, my_ax = plt.subplots(figsize=(10,5))
sns.regplot(x='gdp_real', # column to put on x axis
y='cons_real', # column to put on y axis
data=wdi_gr, # the data
ax = my_ax, # an axis object
color = 'red',
ci = 95) # confidence interval, 0 supresses it
# Easier than matplotlib!
sns.despine(ax = my_ax)
# Since this is all in a matplotlib axis/figure, our usual labeling applies.
my_ax.set_title('Consumption and output growth in Chile')
my_ax.set_ylabel('real consumption growth (%)')
my_ax.set_xlabel('real output growth (%)')
plt.show()
sns.jointplot( )
jointplot( )
adds the marginal distributions of the plotted variables to the axis of a regplot (docs).
This may be useful for visualizing the marginal distributions, but is it important that your reader see the marginal distributions? Is it telling them something important? Potentially, but remember, just because you can do something, doesn't mean you always should.
I occasionally use this plot when I am doing preliminary data exploration.
# Rather than call plt.subplots, let seaborn create the fig and axes.
# h is the axis-like object.
h = sns.jointplot(x='gdp_real',
y='cons_real',
kind='reg', # specify a regplot in the main plot area
data=wdi_gr,
ci=95,
color = 'black',
height = 5 # we can still control the figure size
)
h.set_axis_labels('real gdp growth', 'real consumption growth')
# what is h?
print('h is a', type(h))
plt.show()
Notice that I did not start my figure by creating fig and axes objects. Instead, I let seaborn create the fig and axes for me. The return from sns.jointplot()
is a JointGrid object created by seaborn. This is a more complicated figure, so it needs more complicated axes ojects.
Practice
The OECD has a project studying broadband internet coverage across countries. It tracks data on numbers of subscribers, speed, and prices.
- Load 'broadband_size.xlsx'. It contains data on broadband accounts per 100 people, GDP per capita, and population (in thousands) for several countries. Are all your variables okay?
-
Give the columns some reasonable names.
-
Create a
.regplot()
with broadband penetration on the y axis and GDP per capita on the x axis. Add the 95 percent confidence interval. Apply the principles of graphical excellence to your figure. -
The relationship doesn't look very linear to me. Replot your solution from 3. but try adding the
logx=True
option to regplot to regress y = log(x). As always, consult the documentation if you need help.
Bubble plot (and passing keywords)
A bubble plot is a scatter plot in which the size of the data markers (usually a circle) varies with a third variable.
We can actually make these plots in matplotlib. The syntax is
ax.plot(x, y, s)
where s
is the variable corresponding to marker size. Since seaborn is built on top of matplotlib, we can pass scatter keyword arguments to .regplot( )
and these get passed through to the underlying scatter.
If we pass a single number to s
it changes the size of all the bubbles. If we pass it a Series of data, then each bubble gets scaled according to its value in the series.
The syntax for the option is scatter_kws={'s': data_var}
. This sets the s
argument of scatter to data_var
.
fig, ax = plt.subplots(figsize=(10,5))
sns.regplot(x='gdp_cap', y='broad_pen', data=broad, # the data
ax = ax, # an axis object
scatter_kws={'s': broad['pop']/1000}, # make the marker proportional to population
#scatter_kws={'s': 25},
color = 'blue', # make it blue
ci = 95, # confidence interval: pass it the percent
logx = True) # semi-log regression
# We need to let the reader know what the bubble sizes represent.
ax.text(50000, 20, 'Marker size proportional to population size')
sns.despine(ax = ax)
ax.set_title('Broadband penetration and income')
ax.set_ylabel('broadband subscribers per 100 people')
ax.set_xlabel('GDP per capita')
plt.show()
Notice that I have scaled population by 1000. The issue is that s
is interpreted as points^2 (points squared) [docs]. The idea is that the area of the marker increases proportional to the square of the width. There is a good discussion of it at stack overflow.
If you try to use s
and your whole figure turns the color of your marker, you probably need to scale your measure for s
.
Another example of the scatter_kws useage is to customize the scatter colors and alpha.
fig, ax = plt.subplots(figsize=(10,5))
# To keep the call to regplot from getting out of control, I define the scatter keywords dict here.
my_kws={'s': broad['pop']/1000, 'alpha':0.25, 'color':'black'}
sns.regplot(x='gdp_cap', y='broad_pen', data=broad, # the data
ax = ax, # an axis object
scatter_kws = my_kws, # pass parameters to scatter
color = 'blue', # make it blue
ci = 95, # confidence interval: pass it the percent
logx = True) # semi-log regression
# We need to let the reader know what the bubble sizes represent.
ax.text(50000, 20, 'Marker size proportional to population size')
sns.despine(ax = ax)
ax.set_title('Broadband penetration and income')
ax.set_ylabel('broadband subscribers per 100 people')
ax.set_xlabel('GDP per capita')
plt.show()
Facet plots
Facet plots are grids of plots with the same x- and y-axes. Each plot in the grid is a different subset of the sample. Seaborn gives us simple way to make these plots.
We often use facet plots in initial exploratory analysis. If we do not know what we are looking for, a facet plot is a good way to start "eye-balling" relationships. Once we have some ideas, we can narrow down our focus and use more precise tools. In general, we do not include large grids of figures in our finished analysis. They contain too much unnecessary information.
Load the file 'auto_data.dta' which contains data on automobile characteristics in the European market. These data are from Miravete, Moral, and Thurk.
df = pd.read_stata('auto_data.dta')
df.head(3)
# Recode the FUEL variable so I can easily understand it.
df['FUEL'] = df['FUEL'].replace({0:'gasoline', 1:'diesel'})
df.sample(5)
Looking at the data, we see that a unit of observation is a model at a point in time. We see prices and quantities sold and characteristics about the model. Let's cut the data down to VW and try some plots.
vw = df[df['FIRM']=='Volkswagen']
vw.sample(8)
Q: How are vehicle weight and fuel efficiency related? Does it vary by fuel type? Does it vary by brand?
- Volkswagen has four brands during this period: Audi, Seat, Skoda, and Volkswagen.
- There are two fuel types: gasoline and diesel.
Let's make a grid of plots where the rows are the brands and columns are the fuel types. This is a 4x2 grid.
In each plot, we will scatter weight vs. mpg.
g = sns.FacetGrid(vw, row='BRAND', col='FUEL')
g.map(plt.scatter, 'WEIGHT', 'MPG', color='blue')
plt.show()
What is a point in a plot? It is a model-year.
What do we see?
- Diesel vehicles tend to get higher mpg
- Within a fuel type, the range of mpg are similar
- Weight and mpg are usually negatively correlated (Skoda diesel?)
- There is heterogeneity in the number of models per brand
Facet plot syntax
We first create the grid using FacetGrid()
. We specify which DataFrame we are plotting and which variables we want for the rows and columns. These variables should be categorical and should have relatively few potential values. Otherwise, the grid would get very large and it would be hard to interpret.
g = sns.FacetGrid(vw, row='BRAND', col='FUEL')
Next, we map a plot type to the grid using map()
. We can make many types of plots. In this case we have used the scatter()
plot from matplotlib. Notice the plt
that precedes the scatter
. We can also pass any keyword arguments that plt.scatter
accepts.
g.map(plt.scatter, 'WEIGHT', 'MPG', color='blue')
Now lets try a different plot type, regplot()
from Seaborn. It's easier to see the relationship between weight and mpg.
g = sns.FacetGrid(vw, row='BRAND', col='FUEL', height=4)
g.map(sns.regplot, 'WEIGHT', 'MPG', color='red', ci=90)
plt.show()
Q: Are more powerful cars more expensive? Does it depend on fuel type? Brand?
We are not limited to just one type of data in each plot. We can use color to differentiate further. In the next figure we add Ford, PSA, and Fiat to the firms in our DataFrame. Each firm has several brands and each brand has several models.
- Columns are still fuel type
- Rows are now firms (VW, Ford, PSA, Fiat)
- Hue (color) is brand (Ford's Volvo; Fiat's Alfa Romeo, etc.)
g = sns.FacetGrid(to_plot, hue='BRAND', col='FUEL', row='FIRM')
In each plot we have * Price vs HP * I'm using a scatter plot
g.map(plt.scatter, 'PRICE', 'HP')
# Create a dataframe with just these firms
firms = ['Ford', 'PSA', 'Volkswagen', 'Fiat']
to_plot = df[df['FIRM'].isin(firms)]
g = sns.FacetGrid(to_plot, hue='BRAND', col='FUEL', row='FIRM')
g.map(plt.scatter, 'PRICE', 'HP')
g.add_legend()
plt.show()
Practice: Facet Plots
- Q: How is size related to price? Does it differ by firm? By brand? By fuel type?
Use a facet plot to explore these questions. Restrict the DataFrame to include only Ford, PSA, Volkswagen, and Fiat.
- Let's explore a related concept, the
pairplot
[docs]. Try
g=sns.pairplot(df, vars=['PRICE', 'HP', 'WEIGHT'])
What does pairplot
do? Why do we only need to look at the upper or lower triangle of the figure? What is on the diagonal?
- How do these relationships differ by fuel type? Use the 'hue' option (and the documentation).