download notebook
view notebook w/ solutions

Merging

Files needed = (dogs.csv, steps.csv and ml-latest-small.zip from here)

We will often find ourselves with variables spread across different datasets and files. We merge datasets together by matching up the two datasets on one or more variables. For example, I might have GDP data by country from the Penn World Tables, and demographic data by country from the World Bank. We would merge these two datasets and match up the observations by country.

Some of the most powerful analysis comes from combining data from different sources.

import pandas as pd
import matplotlib.pyplot as plt

The data

Let's load the 'dogs.csv' file that tracks the snacks and walks for Buster, Su and Jax.

dogs = pd.read_csv('dogs.csv', parse_dates=['time'])
dogs.dtypes
dogs.sample(5)

The data are stacked. I move everything but the values into the index, and then unstack 'var'.

dogs = dogs.set_index(['dog', 'time', 'var'])
dogs = dogs.unstack('var')
dogs

I do not like the nuisance index level 'value' so I remove it.

Then I reset the index.

dogs = dogs.droplevel(0, axis=1)
dogs = dogs.reset_index()
dogs

Steps data

I bought activity trackers for the team and their steps are recorded in an app. I downloaded their step data, and would like to add it to the dogs DataFrame. Let's load the steps file into a DataFrame and merge the two DataFrames.

steps = pd.read_csv('steps.csv', parse_dates=['date'])
steps.sample(5)

Steps is already in good shape.

Merge the two DataFrames into one DataFrame

We want to match the two DataFrames together according to the dog-date pairs. In database-ese, we refer to variables we are matching as keys. So, in our case, the keys are dog and date.

We also need to tell pandas how to treat keys that are not present in both databases. The different types of 'join' (more database-ese) are

  1. inner: keep the intersection of the keys
  2. left: keep all the keys from the left DataFrame
  3. right: keep all the keys from right DataFrame
  4. outer: keep all the keys from both DataFrames

We specify the join type with the how parameter. The default is inner, but for sanity's sake, be explicit about your join.

Before we merge, let's review the two DataFrames.

print(dogs.head(2), '\n\n')
print(steps.head(2), '\n\n')
print('dogs has {0} rows and steps has {1} rows'.format(dogs.shape, steps.shape))
  • Both DataFrames have a column named dog that holds the dog names
  • The columns holding the date data have different names in the DataFrames
  • The DataFrames have different numbers of observations

Let's see how .merge() handles these issues.

Inner merge

  • Use left and right to label the DataFrames to merge
  • Use left_on and right_on to identify the columns with keys. It's not a problem that one column is called 'dates' and the other 'time'
dogs_inner = pd.merge(left=dogs, right=steps, left_on=['dog', 'time'], right_on=['dog', 'date'], how='inner')
dogs_inner

The merged DataFrame has 17 rows—less than either of the DataFrames we started with. There are some rows that are only in dogs and some rows that are only in steps. Those rows are not included in the merged DataFrame when we use an inner merge.

Outer merge

An outer merge keeps all the keys—we do not lose any observations. I am adding the indicator to help us see how the two DataFrames compare.

dogs_outer = pd.merge(left=dogs, right=steps, 
                      left_on=['dog', 'time'], right_on=['dog', 'date'], 
                      how='outer', indicator=True)
dogs_outer

The _merge variables helps us understand the inner merge, too. The observation Su 2018-10-03 is only in dogs, so it was dropped in the inner merge. The last nine rows are only in the steps data. Notice that pandas filled in NaN and NaT (not a time) where needed. Nice.

Left merge

This merge keeps all the keys in the left DataFrame and only keeps the keys in the right DataFrame that have matches in the left DataFrame.

dogs_left = pd.merge(left=dogs, right=steps, 
                     left_on=['dog', 'time'], right_on=['dog', 'date'], 
                     how='left', indicator=True)
dogs_left

The left merge keeps the observation Su 2018-10-03 even though it doesn't have a match in steps.

The right merge works analogously.

Merging on the index

We can also merge on the index. replace the left_on and right_on with left_index and right_index.

The names of index columns need to match in this case.

dogs.rename(columns={'time':'date'}, inplace=True)
dogs.set_index(['dog', 'date'], inplace=True)
steps.set_index(['dog', 'date'], inplace=True)
x = pd.merge(left=dogs, right=steps, left_index=True, right_index=True, how='outer', indicator=True)
x.head(30)

Practice

Head over to MovieLens and download the ml-latest-small dataset. It will come as a zipped file. Put it in your user drive (and cwd) and unzip it. You will find 5 files.

Each user ranks movies and can tag movies. Users have a userId and movies have a movieId. * 'movies.csv' holds the description of movies. An observation is a movie. * 'ratings.csv' holds user ratings of movies. An observation is a user-movie * 'tags.csv' holds the tags a user assigns to a movie. A observation is a user-movie. A user can add more than one tag per movie.

Movies ratings by genre

Suppose we would like to know if comedies are, on average, rated better than adventure movies. We need data on both ratings and the movie genre.

  1. Load the 'movies.csv' and the 'ratings.csv' files as DataFrames
  2. What size are the two DataFrames?

  3. Merge the two DataFrames. In the new DataFrame, each row should hold the rating of a movie and the movies genres. We want to keep all the observations in the ratings file. Call your merged DataFrame 'ratings_by_genre'.

  4. What size is the resulting DataFrame? Are there any unmatched ratings?

Movie ratings by tags

Suppose we would like to know if Disney movies are, on average, rated better than superhero movies. We do not have a genre named "Disney" or "superhero," but some users have tagged movies as Disney and superhero.

  1. Load the 'tags.csv' file.

  2. Merge it with the 'ratings' DataFrame. Name this DataFrame ratings_by_tags.

  3. How many tags could be matched with a movie rating? How many rating are untagged?

  4. What are the 10 most used tags?

Extra practice (try at home)

Are movies tagged Disney better than movies tagged superhero?

  1. Compute the average rating for movies with superhero tags and for movies tagged Disney.

  2. Let's see how the ratings of Disney movies compare to superhero movies. We will plot two histograms on the same axes.

    1. Plot the ratings histogram of movies tagged 'Disney'. Use 10 bins. Make the plot blue.
    2. Plot the ratings histogram of superhero movies. Use 10 bins. make the plot red.
    3. Add a legend
    4. Add the mean ratings for superhero movies and for superhero movies as text to the histogram