download notebook
view notebook w/o solutions

Merging

Files needed = (dogs.csv, steps.csv and ml-latest-small.zip from here)

We will often find ourselves with variables spread across different datasets and files. We merge datasets together by matching up the two datasets on one or more variables. For example, I might have GDP data by country from the Penn World Tables, and demographic data by country from the World Bank. We would merge these two datasets and match up the observations by country.

Some of the most powerful analysis comes from combining data from different sources.

import pandas as pd
import matplotlib.pyplot as plt

The data

Let's load the 'dogs.csv' file that tracks the snacks and walks for Buster, Su and Jax.

dogs = pd.read_csv('dogs.csv', parse_dates=['time'])
dogs.dtypes

dog              object
var              object
time     datetime64[ns]
value             int64
dtype: object

dogs.sample(5)

	dog	var	time	value
18	Buster	snacks	2018-10-07	2
15	Buster	walks	2018-10-07	3
17	Buster	snacks	2018-10-06	3
2	Buster	snacks	2018-10-01	4
33	Jax	snacks	2018-10-04	5

The data are stacked. I move everything but the values into the index, and then unstack 'var'.

dogs = dogs.set_index(['dog', 'time', 'var'])
dogs = dogs.unstack('var')
dogs

		value
	var	snacks	walks
dog	time
Buster	2018-10-01	4.0	2.0
	2018-10-02	3.0	2.0
	2018-10-03	4.0	NaN
	2018-10-04	5.0	2.0
	2018-10-05	4.0	3.0
	2018-10-06	3.0	1.0
	2018-10-07	2.0	3.0
Jax	2018-10-04	5.0	3.0
	2018-10-05	5.0	2.0
	2018-10-06	5.0	2.0
	2018-10-07	7.0	3.0
Su	2018-10-01	3.0	2.0
	2018-10-02	4.0	2.0
	2018-10-03	4.0	2.0
	2018-10-04	3.0	3.0
	2018-10-05	3.0	2.0
	2018-10-06	4.0	4.0
	2018-10-07	2.0	1.0

I do not like the nuisance index level 'value' so I remove it.

Then I reset the index.

dogs = dogs.droplevel(0, axis=1)
dogs = dogs.reset_index()
dogs

var	dog	time	snacks	walks
0	Buster	2018-10-01	4.0	2.0
1	Buster	2018-10-02	3.0	2.0
2	Buster	2018-10-03	4.0	NaN
3	Buster	2018-10-04	5.0	2.0
4	Buster	2018-10-05	4.0	3.0
5	Buster	2018-10-06	3.0	1.0
6	Buster	2018-10-07	2.0	3.0
7	Jax	2018-10-04	5.0	3.0
8	Jax	2018-10-05	5.0	2.0
9	Jax	2018-10-06	5.0	2.0
10	Jax	2018-10-07	7.0	3.0
11	Su	2018-10-01	3.0	2.0
12	Su	2018-10-02	4.0	2.0
13	Su	2018-10-03	4.0	2.0
14	Su	2018-10-04	3.0	3.0
15	Su	2018-10-05	3.0	2.0
16	Su	2018-10-06	4.0	4.0
17	Su	2018-10-07	2.0	1.0

Steps data

I bought activity trackers for the team and their steps are recorded in an app. I downloaded their step data, and would like to add it to the dogs DataFrame. Let's load the steps file into a DataFrame and merge the two DataFrames.

steps = pd.read_csv('steps.csv', parse_dates=['date'])
steps.sample(5)

	date	dog	steps
21	2018-10-05	Jax	510
24	2018-10-08	Jax	605
6	2018-10-07	Buster	405
3	2018-10-04	Buster	550
5	2018-10-06	Buster	377

Steps is already in good shape.

Merge the two DataFrames into one DataFrame

We want to match the two DataFrames together according to the dog-date pairs. In database-ese, we refer to variables we are matching as keys. So, in our case, the keys are dog and date.

We also need to tell pandas how to treat keys that are not present in both databases. The different types of 'join' (more database-ese) are

inner: keep the intersection of the keys
left: keep all the keys from the left DataFrame
right: keep all the keys from right DataFrame
outer: keep all the keys from both DataFrames

We specify the join type with the how parameter. The default is inner, but for sanity's sake, be explicit about your join.

Before we merge, let's review the two DataFrames.

print(dogs.head(2), '\n\n')
print(steps.head(2), '\n\n')
print('dogs has {0} rows and steps has {1} rows'.format(dogs.shape, steps.shape))

var     dog       time  snacks  walks
0    Buster 2018-10-01     4.0    2.0
1    Buster 2018-10-02     3.0    2.0


        date     dog  steps
0 2018-10-01  Buster    503
1 2018-10-02  Buster    457


dogs has (18, 4) rows and steps has (26, 3) rows

Both DataFrames have a column named dog that holds the dog names
The columns holding the date data have different names in the DataFrames
The DataFrames have different numbers of observations

Let's see how .merge() handles these issues.

Inner merge

Use left and right to label the DataFrames to merge
Use left_on and right_on to identify the columns with keys. It's not a problem that one column is called 'dates' and the other 'time'

dogs_inner = pd.merge(left=dogs, right=steps, left_on=['dog', 'time'], right_on=['dog', 'date'], how='inner')
dogs_inner

	dog	time	snacks	walks	date	steps
0	Buster	2018-10-01	4.0	2.0	2018-10-01	503
1	Buster	2018-10-02	3.0	2.0	2018-10-02	457
2	Buster	2018-10-03	4.0	NaN	2018-10-03	299
3	Buster	2018-10-04	5.0	2.0	2018-10-04	550
4	Buster	2018-10-05	4.0	3.0	2018-10-05	476
5	Buster	2018-10-06	3.0	1.0	2018-10-06	377
6	Buster	2018-10-07	2.0	3.0	2018-10-07	405
7	Jax	2018-10-04	5.0	3.0	2018-10-04	465
8	Jax	2018-10-05	5.0	2.0	2018-10-05	510
9	Jax	2018-10-06	5.0	2.0	2018-10-06	563
10	Jax	2018-10-07	7.0	3.0	2018-10-07	549
11	Su	2018-10-01	3.0	2.0	2018-10-01	499
12	Su	2018-10-02	4.0	2.0	2018-10-02	468
13	Su	2018-10-04	3.0	3.0	2018-10-04	350
14	Su	2018-10-05	3.0	2.0	2018-10-05	537
15	Su	2018-10-06	4.0	4.0	2018-10-06	325
16	Su	2018-10-07	2.0	1.0	2018-10-07	387

The merged DataFrame has 17 rows—less than either of the DataFrames we started with. There are some rows that are only in dogs and some rows that are only in steps. Those rows are not included in the merged DataFrame when we use an inner merge.

Outer merge

An outer merge keeps all the keys—we do not lose any observations. I am adding the indicator to help us see how the two DataFrames compare.

dogs_outer = pd.merge(left=dogs, right=steps, 
                      left_on=['dog', 'time'], right_on=['dog', 'date'], 
                      how='outer', indicator=True)
dogs_outer

	dog	time	snacks	walks	date	steps	_merge
0	Buster	2018-10-01	4.0	2.0	2018-10-01	503.0	both
1	Buster	2018-10-02	3.0	2.0	2018-10-02	457.0	both
2	Buster	2018-10-03	4.0	NaN	2018-10-03	299.0	both
3	Buster	2018-10-04	5.0	2.0	2018-10-04	550.0	both
4	Buster	2018-10-05	4.0	3.0	2018-10-05	476.0	both
5	Buster	2018-10-06	3.0	1.0	2018-10-06	377.0	both
6	Buster	2018-10-07	2.0	3.0	2018-10-07	405.0	both
7	Jax	2018-10-04	5.0	3.0	2018-10-04	465.0	both
8	Jax	2018-10-05	5.0	2.0	2018-10-05	510.0	both
9	Jax	2018-10-06	5.0	2.0	2018-10-06	563.0	both
10	Jax	2018-10-07	7.0	3.0	2018-10-07	549.0	both
11	Su	2018-10-01	3.0	2.0	2018-10-01	499.0	both
12	Su	2018-10-02	4.0	2.0	2018-10-02	468.0	both
13	Su	2018-10-03	4.0	2.0	NaT	NaN	left_only
14	Su	2018-10-04	3.0	3.0	2018-10-04	350.0	both
15	Su	2018-10-05	3.0	2.0	2018-10-05	537.0	both
16	Su	2018-10-06	4.0	4.0	2018-10-06	325.0	both
17	Su	2018-10-07	2.0	1.0	2018-10-07	387.0	both
18	Buster	NaT	NaN	NaN	2018-10-08	425.0	right_only
19	Buster	NaT	NaN	NaN	2018-10-09	512.0	right_only
20	Su	NaT	NaN	NaN	2018-10-08	450.0	right_only
21	Su	NaT	NaN	NaN	2018-10-09	515.0	right_only
22	Jax	NaT	NaN	NaN	2018-10-01	475.0	right_only
23	Jax	NaT	NaN	NaN	2018-10-02	450.0	right_only
24	Jax	NaT	NaN	NaN	2018-10-03	501.0	right_only
25	Jax	NaT	NaN	NaN	2018-10-08	605.0	right_only
26	Jax	NaT	NaN	NaN	2018-10-09	465.0	right_only

The _merge variables helps us understand the inner merge, too. The observation Su 2018-10-03 is only in dogs, so it was dropped in the inner merge. The last nine rows are only in the steps data. Notice that pandas filled in NaN and NaT (not a time) where needed. Nice.

Left merge

This merge keeps all the keys in the left DataFrame and only keeps the keys in the right DataFrame that have matches in the left DataFrame.

dogs_left = pd.merge(left=dogs, right=steps, 
                     left_on=['dog', 'time'], right_on=['dog', 'date'], 
                     how='left', indicator=True)
dogs_left

	dog	time	snacks	walks	date	steps	_merge
0	Buster	2018-10-01	4.0	2.0	2018-10-01	503.0	both
1	Buster	2018-10-02	3.0	2.0	2018-10-02	457.0	both
2	Buster	2018-10-03	4.0	NaN	2018-10-03	299.0	both
3	Buster	2018-10-04	5.0	2.0	2018-10-04	550.0	both
4	Buster	2018-10-05	4.0	3.0	2018-10-05	476.0	both
5	Buster	2018-10-06	3.0	1.0	2018-10-06	377.0	both
6	Buster	2018-10-07	2.0	3.0	2018-10-07	405.0	both
7	Jax	2018-10-04	5.0	3.0	2018-10-04	465.0	both
8	Jax	2018-10-05	5.0	2.0	2018-10-05	510.0	both
9	Jax	2018-10-06	5.0	2.0	2018-10-06	563.0	both
10	Jax	2018-10-07	7.0	3.0	2018-10-07	549.0	both
11	Su	2018-10-01	3.0	2.0	2018-10-01	499.0	both
12	Su	2018-10-02	4.0	2.0	2018-10-02	468.0	both
13	Su	2018-10-03	4.0	2.0	NaT	NaN	left_only
14	Su	2018-10-04	3.0	3.0	2018-10-04	350.0	both
15	Su	2018-10-05	3.0	2.0	2018-10-05	537.0	both
16	Su	2018-10-06	4.0	4.0	2018-10-06	325.0	both
17	Su	2018-10-07	2.0	1.0	2018-10-07	387.0	both

The left merge keeps the observation Su 2018-10-03 even though it doesn't have a match in steps.

The right merge works analogously.

Merging on the index

We can also merge on the index. replace the left_on and right_on with left_index and right_index.

The names of index columns need to match in this case.

dogs.rename(columns={'time':'date'}, inplace=True)
dogs.set_index(['dog', 'date'], inplace=True)
steps.set_index(['dog', 'date'], inplace=True)

x = pd.merge(left=dogs, right=steps, left_index=True, right_index=True, how='outer', indicator=True)
x.head(30)

		snacks	walks	steps	_merge
dog	date
Buster	2018-10-01	4.0	2.0	503.0	both
	2018-10-02	3.0	2.0	457.0	both
	2018-10-03	4.0	NaN	299.0	both
	2018-10-04	5.0	2.0	550.0	both
	2018-10-05	4.0	3.0	476.0	both
	2018-10-06	3.0	1.0	377.0	both
	2018-10-07	2.0	3.0	405.0	both
	2018-10-08	NaN	NaN	425.0	right_only
	2018-10-09	NaN	NaN	512.0	right_only
Jax	2018-10-01	NaN	NaN	475.0	right_only
	2018-10-02	NaN	NaN	450.0	right_only
	2018-10-03	NaN	NaN	501.0	right_only
	2018-10-04	5.0	3.0	465.0	both
	2018-10-05	5.0	2.0	510.0	both
	2018-10-06	5.0	2.0	563.0	both
	2018-10-07	7.0	3.0	549.0	both
	2018-10-08	NaN	NaN	605.0	right_only
	2018-10-09	NaN	NaN	465.0	right_only
Su	2018-10-01	3.0	2.0	499.0	both
	2018-10-02	4.0	2.0	468.0	both
	2018-10-03	4.0	2.0	NaN	left_only
	2018-10-04	3.0	3.0	350.0	both
	2018-10-05	3.0	2.0	537.0	both
	2018-10-06	4.0	4.0	325.0	both
	2018-10-07	2.0	1.0	387.0	both
	2018-10-08	NaN	NaN	450.0	right_only
	2018-10-09	NaN	NaN	515.0	right_only

Practice

Head over to MovieLens and download the ml-latest-small dataset. It will come as a zipped file. Put it in your user drive (and cwd) and unzip it. You will find 5 files.

Each user ranks movies and can tag movies. Users have a userId and movies have a movieId. * 'movies.csv' holds the description of movies. An observation is a movie. * 'ratings.csv' holds user ratings of movies. An observation is a user-movie * 'tags.csv' holds the tags a user assigns to a movie. A observation is a user-movie. A user can add more than one tag per movie.

Movies ratings by genre

Suppose we would like to know if comedies are, on average, rated better than adventure movies. We need data on both ratings and the movie genre.

Load the 'movies.csv' and the 'ratings.csv' files as DataFrames
What size are the two DataFrames?

movies = pd.read_csv('ml-latest-small/ml-latest-small/movies.csv')
print(movies.head(2))
print(movies.shape)

   movieId             title                                       genres
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
1        2    Jumanji (1995)                   Adventure|Children|Fantasy
(9742, 3)

ratings = pd.read_csv('ml-latest-small/ml-latest-small/ratings.csv')
print(ratings.loc[ratings['userId']==1,])
print(ratings.shape)

     userId  movieId  rating  timestamp
0         1        1     4.0  964982703
1         1        3     4.0  964981247
2         1        6     4.0  964982224
3         1       47     5.0  964983815
4         1       50     5.0  964982931
..      ...      ...     ...        ...
227       1     3744     4.0  964980694
228       1     3793     5.0  964981855
229       1     3809     4.0  964981220
230       1     4006     4.0  964982903
231       1     5060     5.0  964984002

[232 rows x 4 columns]
(100836, 4)

Merge the two DataFrames. In the new DataFrame, each row should hold the rating of a movie and the movies genres. We want to keep all the observations in the ratings file. Call your merged DataFrame 'ratings_by_genre'.
What size is the resulting DataFrame? Are there any unmatched ratings?

# Each movie shows up several times in the ratings file (each movie is reviewed by many raters)
# We use a left join to keep all the observations in the ratings file 

ratings_by_genre = pd.merge(left=ratings, right=movies, on='movieId', how='left', indicator=True)
print(ratings_by_genre.head(2))
print(ratings_by_genre.shape)

   userId  movieId  rating  timestamp                    title  \
0       1        1     4.0  964982703         Toy Story (1995)   
1       1        3     4.0  964981247  Grumpier Old Men (1995)

                                        genres _merge  
0  Adventure|Animation|Children|Comedy|Fantasy   both  
1                               Comedy|Romance   both  
(100836, 7)

Movie ratings by tags

Suppose we would like to know if Disney movies are, on average, rated better than superhero movies. We do not have a genre named "Disney" or "superhero", but some users have tagged movies as Disney and superhero.

Load the 'tags.csv' file.
Merge it with the 'ratings' DataFrame. Name this DataFrame ratings_by_tags.

tags = pd.read_csv('ml-latest-small/ml-latest-small/tags.csv')
print(tags.head(2))
print(tags.shape)

   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
(3683, 4)

ratings_by_tags = pd.merge(left=ratings, right=tags, on=['userId', 'movieId'], how='right', indicator=True)
ratings_by_tags.head(2)

	userId	movieId	rating	timestamp_x	tag	timestamp_y	_merge
0	2	60756	5.0	1.445715e+09	funny	1445714994	both
1	2	60756	5.0	1.445715e+09	Highly quotable	1445714996	both

How many tags could be matched with a movie rating? How many rating are untagged?

ratings_by_tags['_merge'].value_counts()

both          3476
right_only     207
left_only        0
Name: _merge, dtype: int64

What are the 10 most used tags?

ratings_by_tags['tag'].value_counts().head(10)

In Netflix queue     131
atmospheric           36
thought-provoking     24
superhero             24
funny                 23
surreal               23
Disney                23
religion              22
sci-fi                21
quirky                21
Name: tag, dtype: int64

Extra practice (try at home)

Are movies tagged Disney better than movies tagged superhero?

Compute the average rating for movies with superhero tags and for movies tagged Disney.

dis = ratings_by_tags[ratings_by_tags['tag']=='Disney']
dis_mean = dis['rating'].mean()

sup = ratings_by_tags[ratings_by_tags['tag']=='superhero']
super_mean = sup['rating'].mean()

# It's close!
print('The Disney mean rating is {0:.3} and the superhero rating is {1:.3}.'.format(dis_mean, super_mean))

The Disney mean rating is 3.54 and the superhero rating is 3.63.

Let's see how the ratings of Disney movies compare to superhero movies. We will plot two histograms on the same axes.
1. Plot the ratings histogram of movies tagged 'Disney'. Use 10 bins. Make the plot blue.
2. Plot the ratings histogram of superhero movies. Use 10 bins. make the plot red.
3. Add a legend
4. Add the mean ratings for superhero movies and for superhero movies as text to the histogram

fig, ax = plt.subplots(figsize=(10,4))

# Two histograms one-atop the other
ax.hist(dis['rating'], bins=10, color = 'blue', alpha = 0.25, density=True, label = 'Tagged Disney', rwidth=0.9)
ax.hist(sup['rating'], bins=10, color = 'red', alpha = 0.25, density=True, label = 'Superhero movies', rwidth=0.9)

ax.set_title('MovieLens Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Density')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.text( 1.4, 0.6, 'Mean rating for superhero movies: {0:.3}'.format(super_mean))
ax.text( 1.4, 0.52, 'Mean rating for disney movies: {0:.3}'.format(dis_mean))

ax.legend(frameon=False)
plt.show()

png

# We do not have many observations, so we shouldn't read too much into this...but, Disney movies have had fewer flops. 
# You could try doing this exercise with the full movie lens dataset. (https://files.grouplens.org/datasets/movielens/ml-latest.zip)  

# We did get some good practice in, though.