download notebook
view notebook w/ solutions

Reshaping DataFrames

Files needed = ('dogs.csv', 'WEOOct2021all.csv', 'zillow.csv')

We are learning how to deal with complex datasets in pandas. We began by tackling the multiIndex and now we turn to panel data and its various representations.

Panel data have (at least) three dimensions, for example, a panel might record

  1. several variables (income, age, spending on gasoline)
  2. regarding several people
  3. over time

The multiIndex gives us a natural way to handle this data. In this notebook, we work on ways to get our datasets into shapes that are amenable to analysis.

Reshaping is important.

A standard structure for a DataFrame has observations as rows and columns are variables. Many methods expect DataFrames to be organized this way but raw data are often not organized in this manner. This DataFrame structure is often referred to as tidy data. This article by Wickham lays out the ideas. The first three sections are great reading. The rest of the paper shows how to tidy up data in R. We can do the same things in python. With enough suffering, you can do it in STATA, too.

import pandas as pd                 # load pandas and shorten it to pd
import datetime as dt               # load datetime and shorten it to dt
import matplotlib.pyplot as plt     # for making figures
import seaborn as sns

Some panel data

Load the 'dogs.csv' file. The different dimensions of the data are: variables (walks, snacks); dogs (Buster, Su, Jax); and time. The column value are the data associated with the dog-variable-time triplet.

[For reference, here are Buster and Su and Jax the wild dog.]

# load a data file with the number of walks and snacks my dogs have had 

dogs = pd.read_csv('dogs.csv') 
dogs
dogs.dtypes

Long vs. wide (or stacked vs. unstacked)

The data in the file we just loaded is formatted as long: There are lots of rows and not many columns. Moving between long and wide (lots of columns, fewer rows) is a common task in setting up panel data sets.

Pandas calls long data stacked and wide data unstacked. We use the .stack() and .unstack() methods for moving between long and wide with multiIndexed data. Stack and unstack do not work in place. They always return a copy, so we need to assign it to a variable.

The typical workflow is:

  1. Set up the index
  2. Stack or unstack

Set up the index

.stack() and .unstack() work from the index. Since our data may be many-dimensional, multiIndexes are often useful. We start by getting the index set so that we can reshape our data.

# Time veriables should be datetimes 
dogs['time'] = pd.to_datetime(dogs['time'])

# move everything but the data values to the index
dogs = dogs.set_index(['dog', 'time', 'var'])
dogs = dogs.sort_index()
dogs

Unstack

We unstack the data to put the variables into columns. This creates a multiIndex over the columns.

We can unstack several variables. As usual, we pass a list of variable names.

dogs.unstack(['dog', 'var'])

For our data, though, we only want to unstack 'var'.

# We pass unstack the name of the index level to take out of the index and spread across the rows. 
dogs_us = dogs.unstack('var') 
dogs_us

The data are now wide (or, at least, wider) and tidy. We have shortened the number of rows and increased the number of columns. This is my preferred way to shape this data because it puts the unit of observation in the row index. In this case, the unit of observation is a dog-day.

Notice that Buster has an entry for snacks for 10/3, but not for walks. When pandas unstacked the data, it inserted NaNs for the missing observation.

The column index is now a multiIndex:

dogs_us.columns

Removing a level from a multiIndex

The outermost level of the column multiIndex is 'value' and is not very useful. Let's drop it.

Note that the level does not have a name associated with it (None is not a name) so we need to use the number associated with the level.

dogs_us.columns = dogs_us.columns.droplevel(level = 0)
dogs_us
dogs_us.columns

Creating new variables

I want to know the walk-to-snacks ratio to make sure they are getting enough exercise. We want to compute the walk-snack ratio for each dog. This works exactly like you think it does.

dogs_us['ws_ratio'] = dogs_us['walks']/dogs_us['snacks']
dogs_us

As usual, pandas handles our missing observations gracefully.

Plotting with panels

Let's plot the walk-snack ratio. The plotting is straightforward, but we need to use our multiIndex skills to select the data.

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(dogs_us.loc['Buster'].index, dogs_us.loc['Buster', 'ws_ratio'], color='red', marker = 'o', label='Buster')
ax.plot(dogs_us.loc['Su'].index, dogs_us.loc['Su', 'ws_ratio'], color='blue', marker = 'o', label='Su')
ax.plot(dogs_us.loc['Jax'].index, dogs_us.loc['Jax', 'ws_ratio'], color='black', marker = 'o', label='Jax')


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_title('Walk-snack ratio', fontsize=20)
ax.legend(frameon=False)

plt.show()

Looking at my code for this figure I see that I basically repeated the same code three times. Not good! I should automate the plotting or use a package like seaborn to automate it for me.

for dog, c in zip(['Buster', 'Su', 'Jax'], ['red', 'blue', 'black']):
    ax.plot(dogs_us.loc[dog].index, dogs_us.loc[dog, 'ws_ratio'], color=c, marker = 'o', label=dog)
fig, ax = plt.subplots(figsize=(15,6))

for dog, c in zip(['Buster', 'Su', 'Jax'], ['red', 'blue', 'black']):
    ax.plot(dogs_us.loc[dog].index, dogs_us.loc[dog, 'ws_ratio'], color=c, marker = 'o', label=dog)


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_title('Walk-snack ratio', fontsize=20)
ax.legend(frameon=False)

plt.show()

Stack

We can stack the data to put the variables back on the rows.

When we stack the data, pandas defaults to dropping the NaNs, which saves space. We can override this if we choose with the dropna=False option.

# stack() shifts the columns to rows. 
dogs_s = dogs_us.stack(dropna=True)       
dogs_s

This is a Series object—it only has one column of data and the rest is the index.

Let's get this ready to save as long data, like it was when we read it in.

  1. Reset the index to move them back to columns. Now we have a DataFrame rather than a Series.
  2. Give the 'value' column a name.
dogs_s = dogs_s.reset_index()
dogs_s.head()
dogs_s.rename(columns={0 :'value'}, inplace=True)
dogs_s.head()
# Now save it. 
dogs_s.to_csv('dog_data_updated.csv', index=False)  #index = False does not print the index

Practice #1

Let's review multiIndexing with some real world data. The data are messy and will require some cleaning up and 'wrangling.'

We will work with the IMF's World Economic Outlook, which contains historical data and the IMF's forecasts for many countries and variables. The file is the 'By Countries' file. Do not, however download it. Use the one that I have provided. Here's why:

  • Even on the webpage, it says the file is tab delimited, yet, the file extension is 'xls'. If you tried to load it with .read_excel() it would not work.
  • There are a lot of missing values, header and footer hassles. I would expect you all to handle this stuff in a coding practice, but it would take too much in-class time.

To make our lives simpler, I turned the file into a csv file that is reasonably clean, but still has the wrong shape. If you want to know how to handle all of this directly in pandas drop me a line.

The data are in the file 'WEOOct2021all.csv'.

There are two ways to do this practice.

The hard way. The final DataFrame should have a row index with two levels: 'ISO' and the date. The column index should have three levels: 'Variable', 'Description', and 'Units'.

You figure out how to get it that way.

The less hard way. Follow the steps below.

Remember, the typical workflow is:

  1. Set up the index
  2. Stack or unstack

  3. Load the file 'WEOOct2021all.csv' into a DataFrame. Check your dtypes.

  4. Set the index to be 'Variable', 'ISO', 'Description', and 'Units'. Now the index will have everything in it that is not data. (workflow step #1)

  5. Stack the data. (workflow step #2)

Great. Now are data are stacked, or long. Let's get the DataFrame into the shape we want: Observations in the row index and variables in the column index.

  1. Unstack 'Variable', 'Description', and 'Units', in that order. We want the variable names on the outermost level.

We are almost ready to go. One problem remains...

  1. Check the index. What kind of variable are the dates?
  2. Reset the index and convert the dates to datetime.
  3. Set the index back to the country name and year.

Practice #2: Try at home

The data are from Zillow. It is housing inventory by metro area—i.e., the number of unique listings that were active at any time in a given month.

Now that you have had some practice, let's do this the hard way.

  1. The final DataFrame should have a row index with two levels: 'RegionName' and the date. The column index has only one level.

You figure out how to get it that way.

The data are in the file 'zillow.csv'

https://documenter.getpostman.com/view/9197254/UVsFz93V

  1. Once you get your DataFrame in good shape, try plotting the inventory level for Madison, WI. What patterns do you notice? How does Madison compare to the United States as a whole?