download notebook
view notebook w/ solutions

Cleaning and transforming data

Files needed = ('atussum_2017.dat', 'movies.csv')

In this notebook we will continue practicing techniques for manipulating data into forms that are amenable to analysis. We will cover:

.replace() and unique() for recoding variables
.map() and .applymap() for working element-wise on DataFrames
String methods for working with strings in DataFrames

We will also learn about the American Time Use Survey.

import pandas as pd                 # pandas for data handling
import matplotlib.pyplot as plt     # matplotlib for plots
import numpy as np                  # numpy for numerical methods

American Time Use Survey (ATUS)

The Bureau of Labor Statistics oversees the American Time Use Survey, which asks a sample of Americans to complete detailed diaries keeping track of each minute of their day.

Follow this link www.bls.gov/tus/data/datafiles_2017.htm to the page for the 2017 survey. Download the ATUS 2017 Activity summary file (zip) file located in the 2017 Basic ATUS Data Files section of the page. Alternatively, download it directly www.bls.gov/tus/datafiles/atussum_2017.zip.

Unzip the file. We are looking for atussum_2017.dat. It is a coma separated file (even though it has a '.dat' extension). Let's get it loaded.

Variables

This data set has 421 variables! That's too many for us today. Let's just keep a few.

The demographic variables are all uppercase letters. The time variables are of the form 'txxyyzz' where xx is the major category code, yy is the second-tier code, and zz is the third-tier code. (docs)

Let's keep some demographic data and some data about working and sleeping.

variables = {'TEAGE':'age', 'TESEX':'sex', 'PTDTRACE':'race', 'PEEDUCA':'edu', 'GTMETSTA':'metro', 'TELFS':'employ', 
 'TUDIARYDAY':'day', 't050101':'work_main', 't050102':'work_other', 't010101':'sleep', 't050201':'work_soc', 't010102':'no_sleep'}

atus_small = pd.read_csv('atussum_2017.dat', usecols=variables.keys())

atus_small.rename(columns=variables, inplace=True)

atus_small.info()

Okay, let's start transforming the data to get it ready for analysis.

A word on loops

We have seen many places where a loop saved us time and effort. It might seem natural to use a loop to perform an operation on each element of a column or a DataFrame.

In general, however, we want to avoid this. Instead, we have used pandas vectorized operations such as

x['new_var'] = x['var_1'] / x['var_2']

to perform element-wise division. Using pandas' native operations is much (much much) faster than looping over the rows of a DataFrame. A lot of optimization has been written into the native functions that is not there when we loop ourselves. If you find yourself looping over the rows of a DataFrame, take a step back and think about other ways forward.

Fortunately, pandas provides methods that let us do very complex and very general operations to a DataFrame without resorting to a loop. We consider a few of these below.

replace( )

The sex variable is coded 1 for male and 2 for female. I do not want to have to remember that!

The replace( ) method replaces one value for another. One syntax is

atus_small['sex'] = atus_small['sex'].replace(1, 'male')

but a more powerful one passes a dict or a list.

atus_small['sex'] = atus_small['sex'].replace({1:'male', 2:'female'})

sex_codes = {1:'male', 2:'female'}
atus_small['sex'] = atus_small['sex'].replace(sex_codes)
atus_small.head()

unique( )

Let's code race, too. What codes are in our data? The method unique( ) returns the unique values.

races = atus_small['race'].unique()
races.sort()
print(races)

# I looked up the codes in the documentation...

race_codes = {1:'white', 2:'black', 3:'native_am', 4:'asian', 5:'hawaiian', 6:'wh_bl', 
              7:'wh_na', 8:'wh_as', 9:'wh_ha', 10:'bl_na', 11:'bl_as', 13:'na_as', 16:'wh_bl_as', 18:'wh_as_ha'}

atus_small['race'] = atus_small['race'].replace(race_codes)

atus_small.head()

Apply a function to a Series [a single column]: map( )

We can apply functions to the individual elements in a column, too. The map() functions handles this for us. It applies the given function to each element of the column. These can be built-in functions, or user-defined functions. This is quite powerful. We can write a function that performs a complicated transformation and apply to each element of a column in one simple line.

Let's define a function that converts minutes to hours.

def minutes_to_hours(x):
    return x/60

Now apply map( ) to the work column.

[This example is a bit contrived. We could have just as easily done atus_small['work_main']/60. But that wouldn't have given us practice with map( ).]

atus_small['work_main'] = atus_small['work_main'].map(minutes_to_hours)
atus_small.head()

Apply a function to a DataFrame: applymap( )

If we want to apply the same function to each element in several columns of a DataFrame (rather than a Series) we use applymap( ). It works the same way, applying the function to each element.

[I am not sure why the developers of pandas created separate methods for a Series vs a DataFrame. My best guess is that each method takes advantage of Series- or Dataframe-specific functionality for better performance.]

# We can map to several columns at once.
atus_small[['work_other', 'sleep', 'no_sleep']] = atus_small[['work_other', 'sleep', 'no_sleep']].applymap(minutes_to_hours)
atus_small.head()

.map() and .applymap() are very powerful methods. They can do a lot more than we have covered and they can be used to do very complex manipulations. We won't need those advanced capabilities, but they are out there if you someday do.

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.

edu in 'atus_small' holds the highest level of education obtained. It can take values between 31 and 46. Covert the numeric values to labels.
39 = 'high school'
40 = 'some college'
41 & 42 are 'associate'
43 = 'bachelor'
44 = 'master'
45 = 'prof'
46 = 'phd'
Print out a list of the unique values in the column 'edu'.
Apply the minutes_to_hours function to the work_soc variable. This variable is the time spent 'Socializing, relaxing, and leisure as part of job'. I have no idea what kind of jobs makes you relax and take leisure.
Create a column named 'work' that is the sum of 'work_main' and 'work_other'.
Create a histogram of 'work' for only those observations with work>0.

Extra practice

If you finish early, try rewriting your solution to part 1, but code values less than 39 as "less than high school".

String methods

We have seen some of these before, when we learned about strings. These are analogous to the string methods in standard python, but they have been optimized for DataFrames. These vectorized string methods, work element-wise over an entire column. The method call looks like

data['var'].str.method()

where .str.method( ) is the method we are applying. A list of vectorized string methods is on page 218 in McKinney and in the documentation. Below, we try a few out.

MovieLens data set

We are going to work with the MovieLens ml-latest-small dataset. The GroupLens organization released this data. It is meant to help build recommendation algorithms, like the ones you see in Netflix or Spotify. [In 2006, Netflix started a contest, with a $1 mil. reward, for an algorithm that could beat their own.] They have other ratings datasets, too, on music, jokes, and books.

An observation is a movie.

movies = pd.read_csv('movies.csv')
movies.sample(10)

str.contains( )

The genres are mixed together. Let's get all the comedies. The .contains( ) method returns a bool Series with True for observations in which the string contains the search term.

movies['genres'].str.contains('Comedy')

print(movies.shape)
comedies = movies[movies['genres'].str.contains('Comedy')]
print(comedies.shape)

str.split( )

This method splits the string up at the delimiter that is passed to .split( ). It returns a list of each chunk that falls between the delimiter.

This could be useful processing name data that come in the form: last,first or city,state.

# The movie genres are separated with the '|' character. 
# Remember, DataFrames can have columns of lists...

movies['genre_split'] = movies['genres'].str.split('|')
movies.head()

movies.loc[0,'genre_split']

str.join ( )

Put strings together. Separate the pieces with a delimiter of your choosing.

movies['with_colons'] = movies['genre_split'].str.join('::')
movies.sample(5)

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

Our data does not have a column for the year the movie was released. Let's create one. The year the movie was released is in the title string.

Reload 'movies.csv'
Use .str.strip() (docs) to remove any leading or trailing spaces from 'title'.
Extract the four-digit year from the titles and put them into a new column named 'year'.

Notice that the year, including the parentheses, is always the last 6 digits of the title. You might try str.slice() and work with negative indexes to count from the end of 'title'.

If there is any extra space at the end of a title, it will mess up my algorithm! That's why we strip the extra spaces first.

There are 12 movies that do not have a year in their title. Find them in your DataFrame. You might try the str.isdigit() (doc) method to see if the year you extracted in step 2. is numeric.
For the movies without a year, set the 'year' variable to np.nan. Use .unique() to check that all the 'year' values are either numbers (as strings) or nan.

np.nan is the "not a number" object. Pandas will interpret that it as a missing value.

Convert the year variable to datetime. Check your dtypes to verify.

Extra practice (try at home)

If you finished early, compute the number of movies in the DataFrame for each year. You might try .groupby().

Which year has the most observations?

Here is something I have not figured out yet...

How many movies are there in each decade? i.e., 1900–1909, 1910–1919, 1920–1929...