download notebook

Fifteen-minute Friday #4

files needed = 'industrial_production.xlsx'

Fifteen-minute Fridays are brief, nongraded workbooks that provide some extra practice and introduce new topics for self-guided study. This week we are working on

  1. Working with DataFrames
  2. Practice working with messy data

The solutions to the practice problems are at the end of the notebook.

1. Helpful DataFrame functions

Let's look at few methods that help us learn about the contents of a DataFrame. Some of these we know already, some will be new.

Describing and learning about a DataFrame is typically the first thing you do after getting it defined. We might want to know:

  • How big is this DataFrame?
  • What are the column or row names?
  • What are the data types? Are they the types we expected?
  • How do we peek at small parts of the DataFrame?

Once we think our DataFrame is set up correctly, we can move on to the analysis. We might still find some gremlins later...

Give the following a try.

Loading and setting up your DataFrame

  1. Load the file 'industrial_production.xlsx' into a DataFrame. We want the sheet 'Quarterly'.
  2. Print out the first 4 rows and the last 4 rows of the DataFrame.
  3. Set the index to 'DATE'.
  4. Those variable names are terrible. Check out the 'README' tab in the excel workbook for the definitions. Rename the columns with sensible names.

DataFrame attributes

Any object, including DataFrame, has attributes. We access attributes using the . operator.

  1. Try dtypes. What does it tell you?
  2. Try shape. What is the return type? What does it tell you?
  3. Try columns. What is the return type? What does it tell you?
  4. Try index. What is the return type? What does it tell you?

DataFrame methods

Objects also have methods. The following are methods of DataFrame. They also use the . to access, but are like functions.

  1. Try sample(5). What does it tell you?
  2. Try describe(). What does it tell you?

2. Working with messy data: Pisa Scores

The pisa test is a test given to 15-year olds around the world. It evaluates reading, math, and science skills.

  1. In a web browser, go to dx.doi.org/10.1787/888932937035 This should initiate a download of an excel file with pisa scores across countries. Open the workbook up and take a look. This is a bit of a mess.

The issue here is that the workbook was formatted for humans to read. Since it is not a neat rectangular block of data, we will need to 'wrangle' it into shape. This is a common task in the real world, so let's practice some more.

  1. Use the read_excel() function to create a DataFrame named pisa with the mean scores in math, reading, and science. Do not set an index yet.

  2. There is some junk at the bottom of the spreadsheet. Try the skipfooter argument. Use the documentation for read_excel() to learn more.

  3. There is some junk at the top of the spreadsheet. Try the skiprows argument.
  4. You can pick out the columns from the spreadsheet you want to use using the usecols argument. Alternatively, you could delete columns you do not want once you have the data in a DataFrame.

You can either read in the file that you downloaded to your computer or you can read the file directly from the Internet. Try it both ways!

  1. Look up dropna() in the pandas documentation.
    1. Clean up your DataFrame. Drop any rows that have at least one NaN. Save the result into a new DataFrame named pisa2.

How many rows are in pisa2?

  1. Using pisa2, make the country names the index.

  2. Using pisa2, print out the ratios of the United States pisa scores (math, reading, science) relative to the OECD average.

  3. Challenging. Use pisa2. How correlated are the math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corrs.xlsx'.

    This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' (strings, basically) and .corr() won't work. Take a look at the first row of pisa to see why the data are stored as strings. Google around and see if you can convert the three columns to numbers. Then find the correlations.










Helpful DataFrame functions: Solutions

# Import pandas so that we can use it. The accepted short name for pandas is pd.  
import pandas as pd
# 1. Load the dataframe.
# 2. First four rows.
ind = pd.read_excel('industrial_production.xlsx', sheet_name='Quarterly')
ind.head(4)
DATE IPB51000SQ IPB51100SQ IPG211111CSQ IPG21SQ IPG2211A2SQ IPG3361T3SQ IPGMFSQ IPN31152NQ IPN3311A2RNQ
0 1919-01-01 NaN NaN NaN 22.7730 NaN NaN NaN NaN NaN
1 1919-04-01 NaN NaN NaN 23.4306 NaN NaN NaN NaN NaN
2 1919-07-01 NaN NaN NaN 25.8175 NaN NaN NaN NaN NaN
3 1919-10-01 NaN NaN NaN 24.0151 NaN NaN NaN NaN NaN
# 2. Last four rows.

ind.tail(4)
DATE IPB51000SQ IPB51100SQ IPG211111CSQ IPG21SQ IPG2211A2SQ IPG3361T3SQ IPGMFSQ IPN31152NQ IPN3311A2RNQ
400 2019-01-01 105.9090 119.8992 180.0533 130.8243 105.4272 127.3016 106.5002 100.4274 104.5278
401 2019-04-01 105.0156 120.0236 186.3754 133.3726 103.1247 127.8380 105.6580 118.1640 102.4692
402 2019-07-01 105.2506 121.1178 188.3671 132.6772 105.3323 128.7360 105.8973 104.3486 100.2971
403 2019-10-01 105.4923 119.4774 194.4878 133.4932 105.7513 123.9082 105.8004 82.5920 97.4628

Notice that I did not use print() when outputing ind.head(4) and ind.tail(4). The benefit is that we get the extra formatting from Jupyter on the DataFrame. The cost is that we can only print one thing like this out per cell. Enter a new code cell and enter the following:

ind.head(4)
ind.tail(4)

What do you get?

# 3. Set the index to 'DATE'.

ind = ind.set_index('DATE')

Changing column names

We need to change nine column names. This is a bit tedious, but at least we only have to do this once. Once written, we can reuse the code anytime we deal with this file. Below are three different ways to do it.

# 4. Change the column names (method 1). 

# This is the slick way. 
# Grab the column names from the index. 
# Make a list of the new column names.
# Zip them together (check the documentation for zip) and then create a dict. 
# The columns have to be in the correct order with respect to new_names.

old_names = ind.columns.to_list()
new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
names = dict(zip(old_names, new_names))
ind = ind.rename(columns=names)

ind.head(2)
consumer consumer durables crude oil mining elec and gas cars manuf ice cream steel
DATE
1919-01-01 NaN NaN NaN 22.7730 NaN NaN NaN NaN NaN
1919-04-01 NaN NaN NaN 23.4306 NaN NaN NaN NaN NaN
# 4. Change the column names (method 2). 

# If you know for sure that that the columns are in the correct order, you can also do this.
# In method 1, you can look at `names` and make sure the new and old names line up before changing the columns names. 

# new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
# ind.columns = new_names
# 4. Change the column names (method 3).

# The most tedious, but robust way. No matter what order the columns are in, this will replace the names properly. 

# names = {'IPB51000SQ':'consumer', 'IPB51100SQ':'consumer durables', 'IPG211111CSQ':'crude oil', 
#         'IPG21SQ':'mining', 'IPG2211A2SQ':'elec and gas', 'IPG3361T3SQ':'cars', 'IPGMFSQ':'manuf', 
#         'IPN31152NQ':'ice cream', 'IPN3311A2RNQ':'steel'}
# ind = ind.rename(columns=names)
# 5. dtypes

# Each column is made up of floats. (float64 is the same as float)
ind.dtypes
consumer             float64
consumer durables    float64
crude oil            float64
mining               float64
elec and gas         float64
cars                 float64
manuf                float64
ice cream            float64
steel                float64
dtype: object
# 6. shape

# This gives us the number of rows and columns in a tuple.
ind.shape
(404, 9)
#7. (column) index 

# An index object. The column names. 
ind.columns
Index(['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas',
       'cars', 'manuf', 'ice cream', 'steel'],
      dtype='object')
# 8. (row) index 

# Another index object. A different type (a DatetimeIndex, to hold dates). 
ind.index
DatetimeIndex(['1919-01-01', '1919-04-01', '1919-07-01', '1919-10-01',
               '1920-01-01', '1920-04-01', '1920-07-01', '1920-10-01',
               '1921-01-01', '1921-04-01',
               ...
               '2017-07-01', '2017-10-01', '2018-01-01', '2018-04-01',
               '2018-07-01', '2018-10-01', '2019-01-01', '2019-04-01',
               '2019-07-01', '2019-10-01'],
              dtype='datetime64[ns]', name='DATE', length=404, freq=None)
# 9. sample()

# A random sample of rows.
ind.sample(5)
consumer consumer durables crude oil mining elec and gas cars manuf ice cream steel
DATE
1968-01-01 50.5306 39.9326 NaN 85.2319 35.3282 NaN NaN NaN NaN
1923-01-01 NaN NaN NaN 32.1014 NaN NaN NaN NaN NaN
1920-01-01 NaN NaN NaN 27.2301 NaN NaN NaN NaN NaN
1985-10-01 73.1879 58.5867 137.7776 95.1897 62.9601 55.9413 51.9295 77.1728 73.7127
1929-01-01 NaN NaN NaN 36.3880 NaN NaN NaN NaN NaN
# 10. describe()

# Summary statistics. If a column is not made up of numbers, it will not appear here. 
ind.describe()
consumer consumer durables crude oil mining elec and gas cars manuf ice cream steel
count 324.000000 292.000000 192.000000 404.000000 324.000000 192.000000 192.000000 192.000000 192.000000
mean 65.047151 65.033777 115.684291 71.476008 54.800645 79.146544 74.033846 109.750889 104.150758
std 32.738490 36.161831 25.731753 27.447983 35.141579 28.333690 24.334594 26.297457 23.704066
min 12.149600 14.197100 73.701500 17.560800 3.421900 32.428100 36.094600 63.286800 57.771000
25% 33.051625 35.750100 90.038175 45.412000 19.624250 54.963425 49.640475 91.246650 90.773525
50% 66.471100 57.756800 117.488500 84.690050 56.409250 78.134450 72.335600 105.492500 98.896600
75% 100.139600 100.364475 133.501100 91.499200 89.436600 105.321125 98.524600 126.651475 105.842975
max 113.684700 124.179300 194.487800 133.493200 107.675900 132.507800 108.265900 179.319100 182.248200


Working with messy data: Solutions

# 2. Reading from the internet.

url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7
                     usecols=[0,1,9,13],      # select columns of interest
                     )

# Rather than use the 'usecols' argument, you could have loaded all the columns and dropped the ones you do not want.
# Notice that the first row is a bunch of text. That will cause some problems later...

pisa.head()
Unnamed: 0 Mathematics Reading Science
0 NaN Mean score in PISA 2012 Mean score in PISA 2012 Mean score in PISA 2012
1 NaN NaN NaN NaN
2 OECD average 494.046 496.463 501.16
3 NaN NaN NaN NaN
4 Shanghai-China 612.676 569.588 580.118
# 3. Drop NaNs.

pisa2 = pisa.dropna()
pisa2.shape
(66, 4)
# 4. Set the index.

pisa2 = pisa2.set_index('Unnamed: 0')
pisa2.head(2)
Mathematics Reading Science
Unnamed: 0
OECD average 494.046 496.463 501.16
Shanghai-China 612.676 569.588 580.118
# 5. US relative to the average. 
# The US is pretty average...

print(pisa2.loc['United States']/pisa2.loc['OECD average'])
Mathematics       0.974335
Reading            1.00225
Science           0.992517
dtype: object
# 6. Convert types/compute correlations
# Strings! 

pisa2.dtypes
Mathematics       object
Reading           object
Science           object
dtype: object
# New column names
pisa2.columns = ['math', 'read', 'sci']  

# There are several ways to convert strings to numeric values. This is one of them.
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)  
pisa2.dtypes
math    float64
read    float64
sci     float64
dtype: object
# Now we are in good shape. What does .corr() do?
# What is the return type of .corr()?

pisa_corrs = pisa2.corr()
print(pisa_corrs)
print(type(pisa_corrs))

# Now save the DataFrame of results to a file. 
pisa_corrs.to_excel('pisa_corrs.xlsx')
          math      read       sci
math  1.000000  0.959806  0.972131
read  0.959806  1.000000  0.978559
sci   0.972131  0.978559  1.000000
<class 'pandas.core.frame.DataFrame'>