download notebook
view notebook w/o solutions

Cleaning and transforming data

Files needed = ('atussum_2017.dat', 'movies.csv')

In this notebook we will continue practicing techniques for manipulating data into forms that are amenable to analysis. We will cover:

.replace() and unique() for recoding variables
.map() and .applymap() for working element-wise on DataFrames
String methods for working with strings in DataFrames

We will also learn about the American Time Use Survey.

import pandas as pd                 # pandas for data handling
import matplotlib.pyplot as plt     # matplotlib for plots
import numpy as np                  # numpy for numerical methods

American Time Use Survey (ATUS)

The Bureau of Labor Statistics oversees the American Time Use Survey, which asks a sample of Americans to complete detailed diaries keeping track of each minute of their day.

Follow this link www.bls.gov/tus/data/datafiles_2017.htm to the page for the 2017 survey. Download the ATUS 2017 Activity summary file (zip) file located in the 2017 Basic ATUS Data Files section of the page. Alternatively, download it directly www.bls.gov/tus/datafiles/atussum_2017.zip.

Unzip the file. We are looking for atussum_2017.dat. It is a coma separated file (even though it has a '.dat' extension). Let's get it loaded.

Variables

This data set has 421 variables! That's too many for us today. Let's just keep a few.

The demographic variables are all uppercase letters. The time variables are of the form 'txxyyzz' where xx is the major category code, yy is the second-tier code, and zz is the third-tier code. (docs)

Let's keep some demographic data and some data about working and sleeping.

variables = {'TEAGE':'age', 'TESEX':'sex', 'PTDTRACE':'race', 'PEEDUCA':'edu', 'GTMETSTA':'metro', 'TELFS':'employ', 
 'TUDIARYDAY':'day', 't050101':'work_main', 't050102':'work_other', 't010101':'sleep', 't050201':'work_soc', 't010102':'no_sleep'}

atus_small = pd.read_csv('atussum_2017.dat', usecols=variables.keys())

atus_small.rename(columns=variables, inplace=True)

atus_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10223 entries, 0 to 10222
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   age         10223 non-null  int64
 1   sex         10223 non-null  int64
 2   edu         10223 non-null  int64
 3   race        10223 non-null  int64
 4   metro       10223 non-null  int64
 5   employ      10223 non-null  int64
 6   day         10223 non-null  int64
 7   sleep       10223 non-null  int64
 8   no_sleep    10223 non-null  int64
 9   work_main   10223 non-null  int64
 10  work_other  10223 non-null  int64
 11  work_soc    10223 non-null  int64
dtypes: int64(12)
memory usage: 958.5 KB

Okay, let's start transforming the data to get it ready for analysis.

A word on loops

We have seen many places where a loop saved us time and effort. It might seem natural to use a loop to perform an operation on each element of a column or a DataFrame.

In general, however, we want to avoid this. Instead, we have used pandas vectorized operations such as

x['new_var'] = x['var_1'] / x['var_2']

to perform element-wise division. Using pandas' native operations is much (much much) faster than looping over the rows of a DataFrame. A lot of optimization has been written into the native functions that is not there when we loop ourselves. If you find yourself looping over the rows of a DataFrame, take a step back and think about other ways forward.

Fortunately, pandas provides methods that let us do very complex and very general operations to a DataFrame without resorting to a loop. We consider a few of these below.

replace( )

The sex variable is coded 1 for male and 2 for female. I do not want to have to remember that!

The replace( ) method replaces one value for another. One syntax is

atus_small['sex'] = atus_small['sex'].replace(1, 'male')

but a more powerful one passes a dict or a list.

atus_small['sex'] = atus_small['sex'].replace({1:'male', 2:'female'})

sex_codes = {1:'male', 2:'female'}
atus_small['sex'] = atus_small['sex'].replace(sex_codes)
atus_small.head()

	age	sex	edu	race	metro	employ	day	sleep	work_main
0	34	female	39	2	1	1	1	728	450
1	28	female	40	1	1	5	7	385	0
2	15	male	35	3	1	5	4	570	0
3	46	male	39	1	1	1	2	525	480
4	85	male	44	1	1	1	7	756	0

unique( )

Let's code race, too. What codes are in our data? The method unique( ) returns the unique values.

races = atus_small['race'].unique()
races.sort()
print(races)

[ 1  2  3  4  5  6  7  8  9 10 11 13 16 18]

# I looked up the codes in the documentation...

race_codes = {1:'white', 2:'black', 3:'native_am', 4:'asian', 5:'hawaiian', 6:'wh_bl', 
              7:'wh_na', 8:'wh_as', 9:'wh_ha', 10:'bl_na', 11:'bl_as', 13:'na_as', 16:'wh_bl_as', 18:'wh_as_ha'}

atus_small['race'] = atus_small['race'].replace(race_codes)

atus_small.head()

	age	sex	edu	race	metro	employ	day	sleep	work_main
0	34	female	39	black	1	1	1	728	450
1	28	female	40	white	1	5	7	385	0
2	15	male	35	native_am	1	5	4	570	0
3	46	male	39	white	1	1	2	525	480
4	85	male	44	white	1	1	7	756	0

Apply a function to a Series [a single column]: map( )

We can apply functions to the individual elements in a column, too. The map() functions handles this for us. It applies the given function to each element of the column. These can be built-in functions, or user-defined functions. This is quite powerful. We can write a function that performs a complicated transformation and apply to each element of a column in one simple line.

Let's define a function that converts minutes to hours.

def minutes_to_hours(x):
    return x/60

Now apply map( ) to the work column.

[This example is a bit contrived. We could have just as easily done atus_small['work_main']/60. But that wouldn't have given us practice with map( ).]

atus_small['work_main'] = atus_small['work_main'].map(minutes_to_hours)
atus_small.head()

	age	sex	edu	race	metro	employ	day	sleep	work_main
0	34	female	39	black	1	1	1	728	7.5
1	28	female	40	white	1	5	7	385	0.0
2	15	male	35	native_am	1	5	4	570	0.0
3	46	male	39	white	1	1	2	525	8.0
4	85	male	44	white	1	1	7	756	0.0

Apply a function to a DataFrame: applymap( )

If we want to apply the same function to each element in several columns of a DataFrame (rather than a Series) we use applymap( ). It works the same way, applying the function to each element.

[I am not sure why the developers of pandas created separate methods for a Series vs a DataFrame. My best guess is that each method takes advantage of Series- or Dataframe-specific functionality for better performance.]

# We can map to several columns at once.
atus_small[['work_other', 'sleep', 'no_sleep']] = atus_small[['work_other', 'sleep', 'no_sleep']].applymap(minutes_to_hours)
atus_small.head()

	age	sex	edu	race	metro	employ	day	sleep	work_main
0	34	female	39	black	1	1	1	12.133333	7.5
1	28	female	40	white	1	5	7	6.416667	0.0
2	15	male	35	native_am	1	5	4	9.500000	0.0
3	46	male	39	white	1	1	2	8.750000	8.0
4	85	male	44	white	1	1	7	12.600000	0.0

.map() and .applymap() are very powerful methods. They can do a lot more than we have covered and they can be used to do very complex manipulations. We won't need those advanced capabilities, but they are out there if you someday do.

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.

edu in 'atus_small' holds the highest level of education obtained. It can take values between 31 and 46. Covert the numeric values to labels.
39 = 'high school'
40 = 'some college'
41 & 42 are 'associate'
43 = 'bachelor'
44 = 'master'
45 = 'prof'
46 = 'phd'

educ_codes = {39:'high school', 40:'some college', 41:'associate', 42:'associate', 
              43: 'bachelor', 44: 'master', 45:'prof', 46:'phd'}

atus_small['edu'] = atus_small['edu'].replace(educ_codes)
atus_small.sample(10)

	age	sex	edu	race	metro	employ	day	sleep	no_sleep	work_main
5348	77	female	some college	white	1	5	6	10.500000	0.0	0.000000
4819	52	female	some college	white	1	4	5	9.500000	1.0	0.000000
2639	48	female	bachelor	white	1	1	5	7.916667	0.0	0.000000
175	65	female	master	white	1	5	2	12.866667	0.0	0.000000
9725	66	female	bachelor	white	1	2	5	5.316667	0.0	0.000000
2365	33	female	master	white	1	1	4	8.500000	0.0	13.533333
1773	61	female	bachelor	black	1	5	5	6.500000	0.0	0.000000
3595	49	female	bachelor	white	1	1	3	8.833333	0.0	8.583333
3085	85	female	some college	white	2	5	7	12.583333	0.0	0.000000
3647	61	female	associate	white	1	1	6	8.833333	0.0	9.000000

# Another way to do it is to write a function like this

def recode(x):
    educ_codes = {39:'high school', 40:'some college', 41:'associate', 42:'associate', 
              43: 'bachelor', 44: 'master', 45:'prof', 46:'phd'}
    return educ_codes[x]

# and use .map() 
# atus_small['edu'] = atus_small['edu'].map(recode)

Print out a list of the unique values in the column 'edu'.

atus_small['edu'].unique()

array(['high school', 'some college', 35, 'master', 'bachelor', 36,
       'prof', 'associate', 'phd', 37, 34, 33, 38, 31, 32], dtype=object)

Apply the minutes_to_hours function to the work_soc variable. This variable is the time spent 'Socializing, relaxing, and leisure as part of job'. I have no idea what kind of jobs makes you relax and take leisure.

atus_small['work_soc'] = atus_small['work_soc'].map(minutes_to_hours)
atus_small.head()

	age	sex	edu	race	metro	employ	day	sleep	work_main
0	34	female	high school	black	1	1	1	12.133333	7.5
1	28	female	some college	white	1	5	7	6.416667	0.0
2	15	male	35	native_am	1	5	4	9.500000	0.0
3	46	male	high school	white	1	1	2	8.750000	8.0
4	85	male	master	white	1	1	7	12.600000	0.0

Create a column named 'work' that is the sum of 'work_main' and 'work_other'.

atus_small['work'] = atus_small['work_main'] + atus_small['work_other']

atus_small.sample(10)

	age	sex	edu	race	metro	employ	day	sleep	no_sleep	work_main	work
9025	60	male	35	black	1	5	6	4.500000	0.00	0.0	0.0
7969	32	male	bachelor	asian	1	1	7	10.500000	0.75	0.0	0.0
8267	53	male	associate	white	1	1	7	9.416667	0.00	1.0	1.0
8706	49	male	high school	white	1	1	1	10.000000	0.00	0.0	0.0
3293	29	male	some college	white	1	5	5	9.000000	0.00	0.0	0.0
9549	28	male	some college	white	1	1	1	8.500000	0.00	0.0	0.0
3738	31	female	high school	black	1	4	7	8.500000	0.00	0.0	0.0
8301	48	female	bachelor	white	1	1	7	5.166667	0.00	0.5	0.5
5719	19	female	high school	white	1	1	7	8.750000	0.00	0.0	0.0
4924	41	female	associate	white	2	1	7	7.250000	0.00	0.0	0.0

Create a histogram of 'work' for only those observations with work>0.

fig, ax = plt.subplots(figsize=(10,6))


ax.hist(atus_small.loc[ atus_small['work']>0, 'work'], bins=40, alpha = 0.4, color = 'red')

ax.set_xlabel('hours of work')
ax.set_title('Distribution of hours worked, conditional on working at least one minute')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

png

Extra practice

If you finish early, try rewriting your solution to part 1, but code values less than 39 as "less than high school".

String methods

We have seen some of these before, when we learned about strings. These are analogous to the string methods in standard python, but they have been optimized for DataFrames. These vectorized string methods, work element-wise over an entire column. The method call looks like

data['var'].str.method()

where .str.method( ) is the method we are applying. A list of vectorized string methods is on page 218 in McKinney and in the documentation. Below, we try a few out.

MovieLens data set

We are going to work with the MovieLens ml-latest-small dataset. The GroupLens organization released this data. It is meant to help build recommendation algorithms, like the ones you see in Netflix or Spotify. [In 2006, Netflix started a contest, with a $1 mil. reward, for an algorithm that could beat their own.] They have other ratings datasets, too, on music, jokes, and books.

An observation is a movie.

movies = pd.read_csv('movies.csv')
movies.sample(10)

	movieId	title	genres
7317	77233	Union: The Business Behind Getting High, The (...	Comedy\|Documentary
9213	151763	Death Note Rewrite: Genshisuru Kami (2007)	Mystery
8661	121097	To Grandmother's House We Go (1992)	Adventure\|Children\|Comedy
1928	2557	I Stand Alone (Seul contre tous) (1998)	Drama\|Thriller
4916	7371	Dogville (2003)	Drama\|Mystery\|Thriller
233	271	Losing Isaiah (1995)	Drama
285	327	Tank Girl (1995)	Action\|Comedy\|Sci-Fi
6945	65350	General Died at Dawn, The (1936)	Adventure\|Crime\|Thriller
7797	92094	Einstein and Eddington (2008)	Drama
1877	2495	Fantastic Planet, The (Planète sauvage, La) (1...	Animation\|Sci-Fi

str.contains( )

The genres are mixed together. Let's get all the comedies. The .contains( ) method returns a bool Series with True for observations in which the string contains the search term.

movies['genres'].str.contains('Comedy')

0        True
1       False
2        True
3        True
4        True
        ...  
9737     True
9738     True
9739    False
9740    False
9741     True
Name: genres, Length: 9742, dtype: bool

print(movies.shape)
comedies = movies[movies['genres'].str.contains('Comedy')]
print(comedies.shape)

(9742, 3)
(3756, 3)

str.split( )

This method splits the string up at the delimiter that is passed to .split( ). It returns a list of each chunk that falls between the delimiter.

This could be useful processing name data that come in the form: last,first or city,state.

# The movie genres are separated with the '|' character. 
# Remember, DataFrames can have columns of lists...

movies['genre_split'] = movies['genres'].str.split('|')
movies.head()

	movieId	title	genres	genre_split
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	[Adventure, Children, Fantasy]
2	3	Grumpier Old Men (1995)	Comedy\|Romance	[Comedy, Romance]
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	[Comedy, Drama, Romance]
4	5	Father of the Bride Part II (1995)	Comedy	[Comedy]

movies.loc[0,'genre_split']

['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']

str.join ( )

Put strings together. Separate the pieces with a delimiter of your choosing.

movies['with_colons'] = movies['genre_split'].str.join('::')
movies.sample(5)

	movieId	title	genres	genre_split	with_colons
8720	126420	American Heist (2015)	Action	[Action]	Action
7721	90430	Carnage (2011)	Comedy\|Drama	[Comedy, Drama]	Comedy::Drama
9710	187595	Solo: A Star Wars Story (2018)	Action\|Adventure\|Children\|Sci-Fi	[Action, Adventure, Children, Sci-Fi]	Action::Adventure::Children::Sci-Fi
3686	5076	Adventures of Huck Finn, The (1993)	Adventure\|Children\|Comedy\|Drama	[Adventure, Children, Comedy, Drama]	Adventure::Children::Comedy::Drama
8412	110669	Honest Liar, An (2014)	Comedy\|Documentary	[Comedy, Documentary]	Comedy::Documentary

Practice

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

Our data does not have a column for the year the movie was released. Let's create one. The year the movie was released is in the title string.

Reload 'movies.csv'
Use .str.strip() (docs) to remove any leading or trailing spaces from 'title'.
Extract the four-digit year from the titles and put them into a new column named 'year'.

Notice that the year, including the parentheses, is always the last 6 digits of the title. You might try str.slice() and work with negative indexes to count from the end of 'title'.

If there is any extra space at the end of a title, it will mess up my algorithm! That's why we strip the extra spaces first.

# The negative notation counts backwards. I am relying on the fact that each title ends in '(year)'. 
# Not the most robust method.
movies = pd.read_csv('movies.csv')

movies['title'] = movies['title'].str.strip()
movies['year'] = movies['title'].str.slice(-5, -1)
movies.head(2)

	movieId	title	genres	year
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	1995

There are 12 movies that do not have a year in their title. Find them in your DataFrame. You might try the str.isdigit() (doc) method to see if the year you extracted in step 2. is numeric.

movies.loc[~movies['year'].str.isdigit()]

	movieId	title	genres	year
6059	40697	Babylon 5	Sci-Fi	lon
9031	140956	Ready Player One	Action\|Sci-Fi\|Thriller	r On
9091	143410	Hyena Road	(no genres listed)	Roa
9138	147250	The Adventures of Sherlock Holmes and Doctor W...	(no genres listed)	atso
9179	149334	Nocturnal Animals	Drama\|Thriller	imal
9259	156605	Paterson	(no genres listed)	erso
9367	162414	Moonlight	Drama	ligh
9448	167570	The OA	(no genres listed)	he O
9514	171495	Cosmos	(no genres listed)	osmo
9515	171631	Maria Bamford: Old Baby	(no genres listed)	Bab
9525	171891	Generation Iron 2	(no genres listed)	ron
9611	176601	Black Mirror	(no genres listed)	irro

For the movies without a year, set the 'year' variable to np.nan. Use .unique() to check that all the 'year' values are either numbers (as strings) or nan.

np.nan is the "not a number" object. Pandas will interpret that it as a missing value.

movies.loc[~movies['year'].str.isdigit(), 'year'] = np.nan
movies['year'].unique()

array(['1995', '1994', '1996', '1976', '1992', '1967', '1993', '1964',
       '1977', '1965', '1982', '1990', '1991', '1989', '1937', '1940',
       '1969', '1981', '1973', '1970', '1955', '1959', '1968', '1988',
       '1997', '1972', '1943', '1952', '1951', '1957', '1961', '1958',
       '1954', '1934', '1944', '1960', '1963', '1942', '1941', '1953',
       '1939', '1950', '1946', '1945', '1938', '1947', '1935', '1936',
       '1956', '1949', '1932', '1975', '1974', '1971', '1979', '1987',
       '1986', '1980', '1978', '1985', '1966', '1962', '1983', '1984',
       '1948', '1933', '1931', '1922', '1998', '1929', '1930', '1927',
       '1928', '1999', '2000', '1926', '1919', '1921', '1925', '1923',
       '2001', '2002', '2003', '1920', '1915', '1924', '2004', '1916',
       '1917', '2005', '2006', '1902', nan, '1903', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018', '1908'], dtype=object)

Convert the year variable to datetime. Check your dtypes to verify.

movies['year'] = pd.to_datetime(movies['year'])
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   movieId  9742 non-null   int64         
 1   title    9742 non-null   object        
 2   genres   9742 non-null   object        
 3   year     9730 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 304.6+ KB

Extra practice (try at home)

If you finished early, compute the number of movies in the DataFrame for each year. You might try .groupby().

Which year has the most observations?

movie_counts = movies[['year', 'movieId']].groupby('year').count()
most = movie_counts['movieId'].sort_values().tail(1).index.date[0].year
f'The year {most} has the most movies in the dataset.'

'The year 2002 has the most movies in the dataset.'

Here is something I have not figured out yet...

How many movies are there in each decade? i.e., 1900–1909, 1910–1919, 1920–1929...

# I tried this, but I want the first bin to be 1900-01-01...
movie_counts.resample('10AS').sum()

# I know that I can use pd.cut(), but I think there is some way to get this to work by just resampling. 
# If you figure it out, let me know!

	movieId
year
1902-01-01	3
1912-01-01	10
1922-01-01	53
1932-01-01	160
1942-01-01	197
1952-01-01	307
1962-01-01	410
1972-01-01	601
1982-01-01	1290
1992-01-01	2495
2002-01-01	2774
2012-01-01	1430