Fifteen-minute Friday #4
files needed = 'industrial_production.xlsx'
Fifteen-minute Fridays are brief, nongraded workbooks that provide some extra practice and introduce new topics for self-guided study. This week we are working on
- Working with DataFrames
- Practice working with messy data
The solutions to the practice problems are at the end of the notebook.
1. Helpful DataFrame functions
Let's look at few methods that help us learn about the contents of a DataFrame. Some of these we know already, some will be new.
Describing and learning about a DataFrame is typically the first thing you do after getting it defined. We might want to know:
- How big is this DataFrame?
- What are the column or row names?
- What are the data types? Are they the types we expected?
- How do we peek at small parts of the DataFrame?
Once we think our DataFrame is set up correctly, we can move on to the analysis. We might still find some gremlins later...
Give the following a try.
Loading and setting up your DataFrame
- Load the file 'industrial_production.xlsx' into a DataFrame. We want the sheet 'Quarterly'.
- Print out the first 4 rows and the last 4 rows of the DataFrame.
- Set the index to 'DATE'.
- Those variable names are terrible. Check out the 'README' tab in the excel workbook for the definitions. Rename the columns with sensible names.
DataFrame attributes
Any object, including DataFrame, has attributes. We access attributes using the .
operator.
- Try
dtypes
. What does it tell you? - Try
shape
. What is the return type? What does it tell you? - Try
columns
. What is the return type? What does it tell you? - Try
index
. What is the return type? What does it tell you?
DataFrame methods
Objects also have methods. The following are methods of DataFrame. They also use the .
to access, but are like functions.
- Try
sample(5)
. What does it tell you? - Try
describe()
. What does it tell you?
2. Working with messy data: Pisa Scores
The pisa test is a test given to 15-year olds around the world. It evaluates reading, math, and science skills.
- In a web browser, go to dx.doi.org/10.1787/888932937035 This should initiate a download of an excel file with pisa scores across countries. Open the workbook up and take a look. This is a bit of a mess.
The issue here is that the workbook was formatted for humans to read. Since it is not a neat rectangular block of data, we will need to 'wrangle' it into shape. This is a common task in the real world, so let's practice some more.
-
Use the
read_excel()
function to create a DataFrame namedpisa
with the mean scores in math, reading, and science. Do not set an index yet. -
There is some junk at the bottom of the spreadsheet. Try the
skipfooter
argument. Use the documentation forread_excel()
to learn more. - There is some junk at the top of the spreadsheet. Try the
skiprows
argument. - You can pick out the columns from the spreadsheet you want to use using the
usecols
argument. Alternatively, you could delete columns you do not want once you have the data in a DataFrame.
You can either read in the file that you downloaded to your computer or you can read the file directly from the Internet. Try it both ways!
- Look up
dropna()
in the pandas documentation.- Clean up your DataFrame. Drop any rows that have at least one
NaN
. Save the result into a new DataFrame namedpisa2
.
- Clean up your DataFrame. Drop any rows that have at least one
How many rows are in pisa2
?
-
Using
pisa2
, make the country names the index. -
Using
pisa2
, print out the ratios of the United States pisa scores (math, reading, science) relative to the OECD average. -
Challenging. Use
pisa2
. How correlated are the math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corrs.xlsx'.This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' (strings, basically) and
.corr()
won't work. Take a look at the first row ofpisa
to see why the data are stored as strings. Google around and see if you can convert the three columns to numbers. Then find the correlations.
Helpful DataFrame functions: Solutions
# Import pandas so that we can use it. The accepted short name for pandas is pd.
import pandas as pd
# 1. Load the dataframe.
# 2. First four rows.
ind = pd.read_excel('industrial_production.xlsx', sheet_name='Quarterly')
ind.head(4)
DATE | IPB51000SQ | IPB51100SQ | IPG211111CSQ | IPG21SQ | IPG2211A2SQ | IPG3361T3SQ | IPGMFSQ | IPN31152NQ | IPN3311A2RNQ | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1919-01-01 | NaN | NaN | NaN | 22.7730 | NaN | NaN | NaN | NaN | NaN |
1 | 1919-04-01 | NaN | NaN | NaN | 23.4306 | NaN | NaN | NaN | NaN | NaN |
2 | 1919-07-01 | NaN | NaN | NaN | 25.8175 | NaN | NaN | NaN | NaN | NaN |
3 | 1919-10-01 | NaN | NaN | NaN | 24.0151 | NaN | NaN | NaN | NaN | NaN |
# 2. Last four rows.
ind.tail(4)
DATE | IPB51000SQ | IPB51100SQ | IPG211111CSQ | IPG21SQ | IPG2211A2SQ | IPG3361T3SQ | IPGMFSQ | IPN31152NQ | IPN3311A2RNQ | |
---|---|---|---|---|---|---|---|---|---|---|
400 | 2019-01-01 | 105.9090 | 119.8992 | 180.0533 | 130.8243 | 105.4272 | 127.3016 | 106.5002 | 100.4274 | 104.5278 |
401 | 2019-04-01 | 105.0156 | 120.0236 | 186.3754 | 133.3726 | 103.1247 | 127.8380 | 105.6580 | 118.1640 | 102.4692 |
402 | 2019-07-01 | 105.2506 | 121.1178 | 188.3671 | 132.6772 | 105.3323 | 128.7360 | 105.8973 | 104.3486 | 100.2971 |
403 | 2019-10-01 | 105.4923 | 119.4774 | 194.4878 | 133.4932 | 105.7513 | 123.9082 | 105.8004 | 82.5920 | 97.4628 |
Notice that I did not use print()
when outputing ind.head(4)
and ind.tail(4)
. The benefit is that we get the extra formatting from Jupyter on the DataFrame. The cost is that we can only print one thing like this out per cell. Enter a new code cell and enter the following:
ind.head(4)
ind.tail(4)
What do you get?
# 3. Set the index to 'DATE'.
ind = ind.set_index('DATE')
Changing column names
We need to change nine column names. This is a bit tedious, but at least we only have to do this once. Once written, we can reuse the code anytime we deal with this file. Below are three different ways to do it.
# 4. Change the column names (method 1).
# This is the slick way.
# Grab the column names from the index.
# Make a list of the new column names.
# Zip them together (check the documentation for zip) and then create a dict.
# The columns have to be in the correct order with respect to new_names.
old_names = ind.columns.to_list()
new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
names = dict(zip(old_names, new_names))
ind = ind.rename(columns=names)
ind.head(2)
consumer | consumer durables | crude oil | mining | elec and gas | cars | manuf | ice cream | steel | |
---|---|---|---|---|---|---|---|---|---|
DATE | |||||||||
1919-01-01 | NaN | NaN | NaN | 22.7730 | NaN | NaN | NaN | NaN | NaN |
1919-04-01 | NaN | NaN | NaN | 23.4306 | NaN | NaN | NaN | NaN | NaN |
# 4. Change the column names (method 2).
# If you know for sure that that the columns are in the correct order, you can also do this.
# In method 1, you can look at `names` and make sure the new and old names line up before changing the columns names.
# new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
# ind.columns = new_names
# 4. Change the column names (method 3).
# The most tedious, but robust way. No matter what order the columns are in, this will replace the names properly.
# names = {'IPB51000SQ':'consumer', 'IPB51100SQ':'consumer durables', 'IPG211111CSQ':'crude oil',
# 'IPG21SQ':'mining', 'IPG2211A2SQ':'elec and gas', 'IPG3361T3SQ':'cars', 'IPGMFSQ':'manuf',
# 'IPN31152NQ':'ice cream', 'IPN3311A2RNQ':'steel'}
# ind = ind.rename(columns=names)
# 5. dtypes
# Each column is made up of floats. (float64 is the same as float)
ind.dtypes
consumer float64
consumer durables float64
crude oil float64
mining float64
elec and gas float64
cars float64
manuf float64
ice cream float64
steel float64
dtype: object
# 6. shape
# This gives us the number of rows and columns in a tuple.
ind.shape
(404, 9)
#7. (column) index
# An index object. The column names.
ind.columns
Index(['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas',
'cars', 'manuf', 'ice cream', 'steel'],
dtype='object')
# 8. (row) index
# Another index object. A different type (a DatetimeIndex, to hold dates).
ind.index
DatetimeIndex(['1919-01-01', '1919-04-01', '1919-07-01', '1919-10-01',
'1920-01-01', '1920-04-01', '1920-07-01', '1920-10-01',
'1921-01-01', '1921-04-01',
...
'2017-07-01', '2017-10-01', '2018-01-01', '2018-04-01',
'2018-07-01', '2018-10-01', '2019-01-01', '2019-04-01',
'2019-07-01', '2019-10-01'],
dtype='datetime64[ns]', name='DATE', length=404, freq=None)
# 9. sample()
# A random sample of rows.
ind.sample(5)
consumer | consumer durables | crude oil | mining | elec and gas | cars | manuf | ice cream | steel | |
---|---|---|---|---|---|---|---|---|---|
DATE | |||||||||
1968-01-01 | 50.5306 | 39.9326 | NaN | 85.2319 | 35.3282 | NaN | NaN | NaN | NaN |
1923-01-01 | NaN | NaN | NaN | 32.1014 | NaN | NaN | NaN | NaN | NaN |
1920-01-01 | NaN | NaN | NaN | 27.2301 | NaN | NaN | NaN | NaN | NaN |
1985-10-01 | 73.1879 | 58.5867 | 137.7776 | 95.1897 | 62.9601 | 55.9413 | 51.9295 | 77.1728 | 73.7127 |
1929-01-01 | NaN | NaN | NaN | 36.3880 | NaN | NaN | NaN | NaN | NaN |
# 10. describe()
# Summary statistics. If a column is not made up of numbers, it will not appear here.
ind.describe()
consumer | consumer durables | crude oil | mining | elec and gas | cars | manuf | ice cream | steel | |
---|---|---|---|---|---|---|---|---|---|
count | 324.000000 | 292.000000 | 192.000000 | 404.000000 | 324.000000 | 192.000000 | 192.000000 | 192.000000 | 192.000000 |
mean | 65.047151 | 65.033777 | 115.684291 | 71.476008 | 54.800645 | 79.146544 | 74.033846 | 109.750889 | 104.150758 |
std | 32.738490 | 36.161831 | 25.731753 | 27.447983 | 35.141579 | 28.333690 | 24.334594 | 26.297457 | 23.704066 |
min | 12.149600 | 14.197100 | 73.701500 | 17.560800 | 3.421900 | 32.428100 | 36.094600 | 63.286800 | 57.771000 |
25% | 33.051625 | 35.750100 | 90.038175 | 45.412000 | 19.624250 | 54.963425 | 49.640475 | 91.246650 | 90.773525 |
50% | 66.471100 | 57.756800 | 117.488500 | 84.690050 | 56.409250 | 78.134450 | 72.335600 | 105.492500 | 98.896600 |
75% | 100.139600 | 100.364475 | 133.501100 | 91.499200 | 89.436600 | 105.321125 | 98.524600 | 126.651475 | 105.842975 |
max | 113.684700 | 124.179300 | 194.487800 | 133.493200 | 107.675900 | 132.507800 | 108.265900 | 179.319100 | 182.248200 |
Working with messy data: Solutions
# 2. Reading from the internet.
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
skiprows=18, # skip the first 18 rows
skipfooter=7, # skip the last 7
usecols=[0,1,9,13], # select columns of interest
)
# Rather than use the 'usecols' argument, you could have loaded all the columns and dropped the ones you do not want.
# Notice that the first row is a bunch of text. That will cause some problems later...
pisa.head()
Unnamed: 0 | Mathematics | Reading | Science | |
---|---|---|---|---|
0 | NaN | Mean score in PISA 2012 | Mean score in PISA 2012 | Mean score in PISA 2012 |
1 | NaN | NaN | NaN | NaN |
2 | OECD average | 494.046 | 496.463 | 501.16 |
3 | NaN | NaN | NaN | NaN |
4 | Shanghai-China | 612.676 | 569.588 | 580.118 |
# 3. Drop NaNs.
pisa2 = pisa.dropna()
pisa2.shape
(66, 4)
# 4. Set the index.
pisa2 = pisa2.set_index('Unnamed: 0')
pisa2.head(2)
Mathematics | Reading | Science | |
---|---|---|---|
Unnamed: 0 | |||
OECD average | 494.046 | 496.463 | 501.16 |
Shanghai-China | 612.676 | 569.588 | 580.118 |
# 5. US relative to the average.
# The US is pretty average...
print(pisa2.loc['United States']/pisa2.loc['OECD average'])
Mathematics 0.974335
Reading 1.00225
Science 0.992517
dtype: object
# 6. Convert types/compute correlations
# Strings!
pisa2.dtypes
Mathematics object
Reading object
Science object
dtype: object
# New column names
pisa2.columns = ['math', 'read', 'sci']
# There are several ways to convert strings to numeric values. This is one of them.
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)
pisa2.dtypes
math float64
read float64
sci float64
dtype: object
# Now we are in good shape. What does .corr() do?
# What is the return type of .corr()?
pisa_corrs = pisa2.corr()
print(pisa_corrs)
print(type(pisa_corrs))
# Now save the DataFrame of results to a file.
pisa_corrs.to_excel('pisa_corrs.xlsx')
math read sci
math 1.000000 0.959806 0.972131
read 0.959806 1.000000 0.978559
sci 0.972131 0.978559 1.000000
<class 'pandas.core.frame.DataFrame'>