download notebook
view notebook w/o solutions
Merging
Files needed = (dogs.csv, steps.csv and ml-latest-small.zip from here)
We will often find ourselves with variables spread across different datasets and files. We merge datasets together by matching up the two datasets on one or more variables. For example, I might have GDP data by country from the Penn World Tables, and demographic data by country from the World Bank. We would merge these two datasets and match up the observations by country.
Some of the most powerful analysis comes from combining data from different sources.
import pandas as pd
import matplotlib.pyplot as plt
The data
Let's load the 'dogs.csv' file that tracks the snacks and walks for Buster, Su and Jax.
dogs = pd.read_csv('dogs.csv', parse_dates=['time'])
dogs.dtypes
dog object
var object
time datetime64[ns]
value int64
dtype: object
dogs.sample(5)
dog | var | time | value | |
---|---|---|---|---|
18 | Buster | snacks | 2018-10-07 | 2 |
15 | Buster | walks | 2018-10-07 | 3 |
17 | Buster | snacks | 2018-10-06 | 3 |
2 | Buster | snacks | 2018-10-01 | 4 |
33 | Jax | snacks | 2018-10-04 | 5 |
The data are stacked. I move everything but the values into the index, and then unstack 'var'.
dogs = dogs.set_index(['dog', 'time', 'var'])
dogs = dogs.unstack('var')
dogs
value | |||
---|---|---|---|
var | snacks | walks | |
dog | time | ||
Buster | 2018-10-01 | 4.0 | 2.0 |
2018-10-02 | 3.0 | 2.0 | |
2018-10-03 | 4.0 | NaN | |
2018-10-04 | 5.0 | 2.0 | |
2018-10-05 | 4.0 | 3.0 | |
2018-10-06 | 3.0 | 1.0 | |
2018-10-07 | 2.0 | 3.0 | |
Jax | 2018-10-04 | 5.0 | 3.0 |
2018-10-05 | 5.0 | 2.0 | |
2018-10-06 | 5.0 | 2.0 | |
2018-10-07 | 7.0 | 3.0 | |
Su | 2018-10-01 | 3.0 | 2.0 |
2018-10-02 | 4.0 | 2.0 | |
2018-10-03 | 4.0 | 2.0 | |
2018-10-04 | 3.0 | 3.0 | |
2018-10-05 | 3.0 | 2.0 | |
2018-10-06 | 4.0 | 4.0 | |
2018-10-07 | 2.0 | 1.0 |
I do not like the nuisance index level 'value' so I remove it.
Then I reset the index.
dogs = dogs.droplevel(0, axis=1)
dogs = dogs.reset_index()
dogs
var | dog | time | snacks | walks |
---|---|---|---|---|
0 | Buster | 2018-10-01 | 4.0 | 2.0 |
1 | Buster | 2018-10-02 | 3.0 | 2.0 |
2 | Buster | 2018-10-03 | 4.0 | NaN |
3 | Buster | 2018-10-04 | 5.0 | 2.0 |
4 | Buster | 2018-10-05 | 4.0 | 3.0 |
5 | Buster | 2018-10-06 | 3.0 | 1.0 |
6 | Buster | 2018-10-07 | 2.0 | 3.0 |
7 | Jax | 2018-10-04 | 5.0 | 3.0 |
8 | Jax | 2018-10-05 | 5.0 | 2.0 |
9 | Jax | 2018-10-06 | 5.0 | 2.0 |
10 | Jax | 2018-10-07 | 7.0 | 3.0 |
11 | Su | 2018-10-01 | 3.0 | 2.0 |
12 | Su | 2018-10-02 | 4.0 | 2.0 |
13 | Su | 2018-10-03 | 4.0 | 2.0 |
14 | Su | 2018-10-04 | 3.0 | 3.0 |
15 | Su | 2018-10-05 | 3.0 | 2.0 |
16 | Su | 2018-10-06 | 4.0 | 4.0 |
17 | Su | 2018-10-07 | 2.0 | 1.0 |
Steps data
I bought activity trackers for the team and their steps are recorded in an app. I downloaded their step data, and would like to add it to the dogs DataFrame. Let's load the steps file into a DataFrame and merge the two DataFrames.
steps = pd.read_csv('steps.csv', parse_dates=['date'])
steps.sample(5)
date | dog | steps | |
---|---|---|---|
21 | 2018-10-05 | Jax | 510 |
24 | 2018-10-08 | Jax | 605 |
6 | 2018-10-07 | Buster | 405 |
3 | 2018-10-04 | Buster | 550 |
5 | 2018-10-06 | Buster | 377 |
Steps is already in good shape.
Merge the two DataFrames into one DataFrame
We want to match the two DataFrames together according to the dog-date pairs. In database-ese, we refer to variables we are matching as keys. So, in our case, the keys are dog and date.
We also need to tell pandas how to treat keys that are not present in both databases. The different types of 'join' (more database-ese) are
- inner: keep the intersection of the keys
- left: keep all the keys from the left DataFrame
- right: keep all the keys from right DataFrame
- outer: keep all the keys from both DataFrames
We specify the join type with the how
parameter. The default is inner, but for sanity's sake, be explicit about your join.
Before we merge, let's review the two DataFrames.
print(dogs.head(2), '\n\n')
print(steps.head(2), '\n\n')
print('dogs has {0} rows and steps has {1} rows'.format(dogs.shape, steps.shape))
var dog time snacks walks
0 Buster 2018-10-01 4.0 2.0
1 Buster 2018-10-02 3.0 2.0
date dog steps
0 2018-10-01 Buster 503
1 2018-10-02 Buster 457
dogs has (18, 4) rows and steps has (26, 3) rows
- Both DataFrames have a column named
dog
that holds the dog names - The columns holding the date data have different names in the DataFrames
- The DataFrames have different numbers of observations
Let's see how .merge()
handles these issues.
Inner merge
- Use
left
andright
to label the DataFrames to merge - Use
left_on
andright_on
to identify the columns with keys. It's not a problem that one column is called 'dates' and the other 'time'
dogs_inner = pd.merge(left=dogs, right=steps, left_on=['dog', 'time'], right_on=['dog', 'date'], how='inner')
dogs_inner
dog | time | snacks | walks | date | steps | |
---|---|---|---|---|---|---|
0 | Buster | 2018-10-01 | 4.0 | 2.0 | 2018-10-01 | 503 |
1 | Buster | 2018-10-02 | 3.0 | 2.0 | 2018-10-02 | 457 |
2 | Buster | 2018-10-03 | 4.0 | NaN | 2018-10-03 | 299 |
3 | Buster | 2018-10-04 | 5.0 | 2.0 | 2018-10-04 | 550 |
4 | Buster | 2018-10-05 | 4.0 | 3.0 | 2018-10-05 | 476 |
5 | Buster | 2018-10-06 | 3.0 | 1.0 | 2018-10-06 | 377 |
6 | Buster | 2018-10-07 | 2.0 | 3.0 | 2018-10-07 | 405 |
7 | Jax | 2018-10-04 | 5.0 | 3.0 | 2018-10-04 | 465 |
8 | Jax | 2018-10-05 | 5.0 | 2.0 | 2018-10-05 | 510 |
9 | Jax | 2018-10-06 | 5.0 | 2.0 | 2018-10-06 | 563 |
10 | Jax | 2018-10-07 | 7.0 | 3.0 | 2018-10-07 | 549 |
11 | Su | 2018-10-01 | 3.0 | 2.0 | 2018-10-01 | 499 |
12 | Su | 2018-10-02 | 4.0 | 2.0 | 2018-10-02 | 468 |
13 | Su | 2018-10-04 | 3.0 | 3.0 | 2018-10-04 | 350 |
14 | Su | 2018-10-05 | 3.0 | 2.0 | 2018-10-05 | 537 |
15 | Su | 2018-10-06 | 4.0 | 4.0 | 2018-10-06 | 325 |
16 | Su | 2018-10-07 | 2.0 | 1.0 | 2018-10-07 | 387 |
The merged DataFrame has 17 rows—less than either of the DataFrames we started with. There are some rows that are only in dogs
and some rows that are only in steps
. Those rows are not included in the merged DataFrame when we use an inner merge.
Outer merge
An outer merge keeps all the keys—we do not lose any observations. I am adding the indicator
to help us see how the two DataFrames compare.
dogs_outer = pd.merge(left=dogs, right=steps,
left_on=['dog', 'time'], right_on=['dog', 'date'],
how='outer', indicator=True)
dogs_outer
dog | time | snacks | walks | date | steps | _merge | |
---|---|---|---|---|---|---|---|
0 | Buster | 2018-10-01 | 4.0 | 2.0 | 2018-10-01 | 503.0 | both |
1 | Buster | 2018-10-02 | 3.0 | 2.0 | 2018-10-02 | 457.0 | both |
2 | Buster | 2018-10-03 | 4.0 | NaN | 2018-10-03 | 299.0 | both |
3 | Buster | 2018-10-04 | 5.0 | 2.0 | 2018-10-04 | 550.0 | both |
4 | Buster | 2018-10-05 | 4.0 | 3.0 | 2018-10-05 | 476.0 | both |
5 | Buster | 2018-10-06 | 3.0 | 1.0 | 2018-10-06 | 377.0 | both |
6 | Buster | 2018-10-07 | 2.0 | 3.0 | 2018-10-07 | 405.0 | both |
7 | Jax | 2018-10-04 | 5.0 | 3.0 | 2018-10-04 | 465.0 | both |
8 | Jax | 2018-10-05 | 5.0 | 2.0 | 2018-10-05 | 510.0 | both |
9 | Jax | 2018-10-06 | 5.0 | 2.0 | 2018-10-06 | 563.0 | both |
10 | Jax | 2018-10-07 | 7.0 | 3.0 | 2018-10-07 | 549.0 | both |
11 | Su | 2018-10-01 | 3.0 | 2.0 | 2018-10-01 | 499.0 | both |
12 | Su | 2018-10-02 | 4.0 | 2.0 | 2018-10-02 | 468.0 | both |
13 | Su | 2018-10-03 | 4.0 | 2.0 | NaT | NaN | left_only |
14 | Su | 2018-10-04 | 3.0 | 3.0 | 2018-10-04 | 350.0 | both |
15 | Su | 2018-10-05 | 3.0 | 2.0 | 2018-10-05 | 537.0 | both |
16 | Su | 2018-10-06 | 4.0 | 4.0 | 2018-10-06 | 325.0 | both |
17 | Su | 2018-10-07 | 2.0 | 1.0 | 2018-10-07 | 387.0 | both |
18 | Buster | NaT | NaN | NaN | 2018-10-08 | 425.0 | right_only |
19 | Buster | NaT | NaN | NaN | 2018-10-09 | 512.0 | right_only |
20 | Su | NaT | NaN | NaN | 2018-10-08 | 450.0 | right_only |
21 | Su | NaT | NaN | NaN | 2018-10-09 | 515.0 | right_only |
22 | Jax | NaT | NaN | NaN | 2018-10-01 | 475.0 | right_only |
23 | Jax | NaT | NaN | NaN | 2018-10-02 | 450.0 | right_only |
24 | Jax | NaT | NaN | NaN | 2018-10-03 | 501.0 | right_only |
25 | Jax | NaT | NaN | NaN | 2018-10-08 | 605.0 | right_only |
26 | Jax | NaT | NaN | NaN | 2018-10-09 | 465.0 | right_only |
The _merge
variables helps us understand the inner merge, too. The observation Su 2018-10-03 is only in dogs
, so it was dropped in the inner merge. The last nine rows are only in the steps
data. Notice that pandas filled in NaN and NaT (not a time) where needed. Nice.
Left merge
This merge keeps all the keys in the left DataFrame and only keeps the keys in the right DataFrame that have matches in the left DataFrame.
dogs_left = pd.merge(left=dogs, right=steps,
left_on=['dog', 'time'], right_on=['dog', 'date'],
how='left', indicator=True)
dogs_left
dog | time | snacks | walks | date | steps | _merge | |
---|---|---|---|---|---|---|---|
0 | Buster | 2018-10-01 | 4.0 | 2.0 | 2018-10-01 | 503.0 | both |
1 | Buster | 2018-10-02 | 3.0 | 2.0 | 2018-10-02 | 457.0 | both |
2 | Buster | 2018-10-03 | 4.0 | NaN | 2018-10-03 | 299.0 | both |
3 | Buster | 2018-10-04 | 5.0 | 2.0 | 2018-10-04 | 550.0 | both |
4 | Buster | 2018-10-05 | 4.0 | 3.0 | 2018-10-05 | 476.0 | both |
5 | Buster | 2018-10-06 | 3.0 | 1.0 | 2018-10-06 | 377.0 | both |
6 | Buster | 2018-10-07 | 2.0 | 3.0 | 2018-10-07 | 405.0 | both |
7 | Jax | 2018-10-04 | 5.0 | 3.0 | 2018-10-04 | 465.0 | both |
8 | Jax | 2018-10-05 | 5.0 | 2.0 | 2018-10-05 | 510.0 | both |
9 | Jax | 2018-10-06 | 5.0 | 2.0 | 2018-10-06 | 563.0 | both |
10 | Jax | 2018-10-07 | 7.0 | 3.0 | 2018-10-07 | 549.0 | both |
11 | Su | 2018-10-01 | 3.0 | 2.0 | 2018-10-01 | 499.0 | both |
12 | Su | 2018-10-02 | 4.0 | 2.0 | 2018-10-02 | 468.0 | both |
13 | Su | 2018-10-03 | 4.0 | 2.0 | NaT | NaN | left_only |
14 | Su | 2018-10-04 | 3.0 | 3.0 | 2018-10-04 | 350.0 | both |
15 | Su | 2018-10-05 | 3.0 | 2.0 | 2018-10-05 | 537.0 | both |
16 | Su | 2018-10-06 | 4.0 | 4.0 | 2018-10-06 | 325.0 | both |
17 | Su | 2018-10-07 | 2.0 | 1.0 | 2018-10-07 | 387.0 | both |
The left merge keeps the observation Su 2018-10-03 even though it doesn't have a match in steps
.
The right merge works analogously.
Merging on the index
We can also merge on the index. replace the left_on
and right_on
with left_index
and right_index
.
The names of index columns need to match in this case.
dogs.rename(columns={'time':'date'}, inplace=True)
dogs.set_index(['dog', 'date'], inplace=True)
steps.set_index(['dog', 'date'], inplace=True)
x = pd.merge(left=dogs, right=steps, left_index=True, right_index=True, how='outer', indicator=True)
x.head(30)
snacks | walks | steps | _merge | ||
---|---|---|---|---|---|
dog | date | ||||
Buster | 2018-10-01 | 4.0 | 2.0 | 503.0 | both |
2018-10-02 | 3.0 | 2.0 | 457.0 | both | |
2018-10-03 | 4.0 | NaN | 299.0 | both | |
2018-10-04 | 5.0 | 2.0 | 550.0 | both | |
2018-10-05 | 4.0 | 3.0 | 476.0 | both | |
2018-10-06 | 3.0 | 1.0 | 377.0 | both | |
2018-10-07 | 2.0 | 3.0 | 405.0 | both | |
2018-10-08 | NaN | NaN | 425.0 | right_only | |
2018-10-09 | NaN | NaN | 512.0 | right_only | |
Jax | 2018-10-01 | NaN | NaN | 475.0 | right_only |
2018-10-02 | NaN | NaN | 450.0 | right_only | |
2018-10-03 | NaN | NaN | 501.0 | right_only | |
2018-10-04 | 5.0 | 3.0 | 465.0 | both | |
2018-10-05 | 5.0 | 2.0 | 510.0 | both | |
2018-10-06 | 5.0 | 2.0 | 563.0 | both | |
2018-10-07 | 7.0 | 3.0 | 549.0 | both | |
2018-10-08 | NaN | NaN | 605.0 | right_only | |
2018-10-09 | NaN | NaN | 465.0 | right_only | |
Su | 2018-10-01 | 3.0 | 2.0 | 499.0 | both |
2018-10-02 | 4.0 | 2.0 | 468.0 | both | |
2018-10-03 | 4.0 | 2.0 | NaN | left_only | |
2018-10-04 | 3.0 | 3.0 | 350.0 | both | |
2018-10-05 | 3.0 | 2.0 | 537.0 | both | |
2018-10-06 | 4.0 | 4.0 | 325.0 | both | |
2018-10-07 | 2.0 | 1.0 | 387.0 | both | |
2018-10-08 | NaN | NaN | 450.0 | right_only | |
2018-10-09 | NaN | NaN | 515.0 | right_only |
Practice
Head over to MovieLens and download the ml-latest-small dataset. It will come as a zipped file. Put it in your user drive (and cwd) and unzip it. You will find 5 files.
Each user ranks movies and can tag movies. Users have a userId
and movies have a movieId
.
* 'movies.csv' holds the description of movies. An observation is a movie.
* 'ratings.csv' holds user ratings of movies. An observation is a user-movie
* 'tags.csv' holds the tags a user assigns to a movie. A observation is a user-movie. A user can add more than one tag per movie.
Movies ratings by genre
Suppose we would like to know if comedies are, on average, rated better than adventure movies. We need data on both ratings and the movie genre.
- Load the 'movies.csv' and the 'ratings.csv' files as DataFrames
- What size are the two DataFrames?
movies = pd.read_csv('ml-latest-small/ml-latest-small/movies.csv')
print(movies.head(2))
print(movies.shape)
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
(9742, 3)
ratings = pd.read_csv('ml-latest-small/ml-latest-small/ratings.csv')
print(ratings.loc[ratings['userId']==1,])
print(ratings.shape)
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
.. ... ... ... ...
227 1 3744 4.0 964980694
228 1 3793 5.0 964981855
229 1 3809 4.0 964981220
230 1 4006 4.0 964982903
231 1 5060 5.0 964984002
[232 rows x 4 columns]
(100836, 4)
-
Merge the two DataFrames. In the new DataFrame, each row should hold the rating of a movie and the movies genres. We want to keep all the observations in the ratings file. Call your merged DataFrame 'ratings_by_genre'.
-
What size is the resulting DataFrame? Are there any unmatched ratings?
# Each movie shows up several times in the ratings file (each movie is reviewed by many raters)
# We use a left join to keep all the observations in the ratings file
ratings_by_genre = pd.merge(left=ratings, right=movies, on='movieId', how='left', indicator=True)
print(ratings_by_genre.head(2))
print(ratings_by_genre.shape)
userId movieId rating timestamp title \
0 1 1 4.0 964982703 Toy Story (1995)
1 1 3 4.0 964981247 Grumpier Old Men (1995)
genres _merge
0 Adventure|Animation|Children|Comedy|Fantasy both
1 Comedy|Romance both
(100836, 7)
Movie ratings by tags
Suppose we would like to know if Disney movies are, on average, rated better than superhero movies. We do not have a genre named "Disney" or "superhero", but some users have tagged movies as Disney and superhero.
-
Load the 'tags.csv' file.
-
Merge it with the 'ratings' DataFrame. Name this DataFrame
ratings_by_tags
.
tags = pd.read_csv('ml-latest-small/ml-latest-small/tags.csv')
print(tags.head(2))
print(tags.shape)
userId movieId tag timestamp
0 2 60756 funny 1445714994
1 2 60756 Highly quotable 1445714996
(3683, 4)
ratings_by_tags = pd.merge(left=ratings, right=tags, on=['userId', 'movieId'], how='right', indicator=True)
ratings_by_tags.head(2)
userId | movieId | rating | timestamp_x | tag | timestamp_y | _merge | |
---|---|---|---|---|---|---|---|
0 | 2 | 60756 | 5.0 | 1.445715e+09 | funny | 1445714994 | both |
1 | 2 | 60756 | 5.0 | 1.445715e+09 | Highly quotable | 1445714996 | both |
- How many tags could be matched with a movie rating? How many rating are untagged?
ratings_by_tags['_merge'].value_counts()
both 3476
right_only 207
left_only 0
Name: _merge, dtype: int64
- What are the 10 most used tags?
ratings_by_tags['tag'].value_counts().head(10)
In Netflix queue 131
atmospheric 36
thought-provoking 24
superhero 24
funny 23
surreal 23
Disney 23
religion 22
sci-fi 21
quirky 21
Name: tag, dtype: int64
Extra practice (try at home)
Are movies tagged Disney better than movies tagged superhero?
- Compute the average rating for movies with superhero tags and for movies tagged Disney.
dis = ratings_by_tags[ratings_by_tags['tag']=='Disney']
dis_mean = dis['rating'].mean()
sup = ratings_by_tags[ratings_by_tags['tag']=='superhero']
super_mean = sup['rating'].mean()
# It's close!
print('The Disney mean rating is {0:.3} and the superhero rating is {1:.3}.'.format(dis_mean, super_mean))
The Disney mean rating is 3.54 and the superhero rating is 3.63.
-
Let's see how the ratings of Disney movies compare to superhero movies. We will plot two histograms on the same axes.
- Plot the ratings histogram of movies tagged 'Disney'. Use 10 bins. Make the plot blue.
- Plot the ratings histogram of superhero movies. Use 10 bins. make the plot red.
- Add a legend
- Add the mean ratings for superhero movies and for superhero movies as text to the histogram
fig, ax = plt.subplots(figsize=(10,4))
# Two histograms one-atop the other
ax.hist(dis['rating'], bins=10, color = 'blue', alpha = 0.25, density=True, label = 'Tagged Disney', rwidth=0.9)
ax.hist(sup['rating'], bins=10, color = 'red', alpha = 0.25, density=True, label = 'Superhero movies', rwidth=0.9)
ax.set_title('MovieLens Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Density')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.text( 1.4, 0.6, 'Mean rating for superhero movies: {0:.3}'.format(super_mean))
ax.text( 1.4, 0.52, 'Mean rating for disney movies: {0:.3}'.format(dis_mean))
ax.legend(frameon=False)
plt.show()
# We do not have many observations, so we shouldn't read too much into this...but, Disney movies have had fewer flops.
# You could try doing this exercise with the full movie lens dataset. (https://files.grouplens.org/datasets/movielens/ml-latest.zip)
# We did get some good practice in, though.