Economics + Data

Fifteen-minute Friday #4

files needed = 'industrial_production.xlsx'

Fifteen-minute Fridays are brief, nongraded workbooks that provide some extra practice and introduce new topics for self-guided study. This week we are working on

Working with DataFrames
Practice working with messy data

The solutions to the practice problems are at the end of the notebook.

1. Helpful DataFrame functions

Let's look at few methods that help us learn about the contents of a DataFrame. Some of these we know already, some will be new.

Describing and learning about a DataFrame is typically the first thing you do after getting it defined. We might want to know:

How big is this DataFrame?
What are the column or row names?
What are the data types? Are they the types we expected?
How do we peek at small parts of the DataFrame?

Once we think our DataFrame is set up correctly, we can move on to the analysis. We might still find some gremlins later...

Give the following a try.

Loading and setting up your DataFrame

Load the file 'industrial_production.xlsx' into a DataFrame. We want the sheet 'Quarterly'.
Print out the first 4 rows and the last 4 rows of the DataFrame.
Set the index to 'DATE'.
Those variable names are terrible. Check out the 'README' tab in the excel workbook for the definitions. Rename the columns with sensible names.

DataFrame attributes

Any object, including DataFrame, has attributes. We access attributes using the . operator.

Try dtypes. What does it tell you?
Try shape. What is the return type? What does it tell you?
Try columns. What is the return type? What does it tell you?
Try index. What is the return type? What does it tell you?

DataFrame methods

Objects also have methods. The following are methods of DataFrame. They also use the . to access, but are like functions.

Try sample(5). What does it tell you?
Try describe(). What does it tell you?

2. Working with messy data: Pisa Scores

The pisa test is a test given to 15-year olds around the world. It evaluates reading, math, and science skills.

In a web browser, go to dx.doi.org/10.1787/888932937035 This should initiate a download of an excel file with pisa scores across countries. Open the workbook up and take a look. This is a bit of a mess.

The issue here is that the workbook was formatted for humans to read. Since it is not a neat rectangular block of data, we will need to 'wrangle' it into shape. This is a common task in the real world, so let's practice some more.

Use the read_excel() function to create a DataFrame named pisa with the mean scores in math, reading, and science. Do not set an index yet.
There is some junk at the bottom of the spreadsheet. Try the skipfooter argument. Use the documentation for read_excel() to learn more.
There is some junk at the top of the spreadsheet. Try the skiprows argument.
You can pick out the columns from the spreadsheet you want to use using the usecols argument. Alternatively, you could delete columns you do not want once you have the data in a DataFrame.

You can either read in the file that you downloaded to your computer or you can read the file directly from the Internet. Try it both ways!

Look up dropna() in the pandas documentation.
1. Clean up your DataFrame. Drop any rows that have at least one NaN. Save the result into a new DataFrame named pisa2.

How many rows are in pisa2?

Using pisa2, make the country names the index.
Using pisa2, print out the ratios of the United States pisa scores (math, reading, science) relative to the OECD average.
Challenging. Use pisa2. How correlated are the math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corrs.xlsx'.

This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' (strings, basically) and .corr() won't work. Take a look at the first row of pisa to see why the data are stored as strings. Google around and see if you can convert the three columns to numbers. Then find the correlations.

Helpful DataFrame functions: Solutions

# Import pandas so that we can use it. The accepted short name for pandas is pd.  
import pandas as pd

# 1. Load the dataframe.
# 2. First four rows.
ind = pd.read_excel('industrial_production.xlsx', sheet_name='Quarterly')
ind.head(4)

	DATE	IPB51000SQ	IPB51100SQ	IPG211111CSQ	IPG21SQ	IPG2211A2SQ	IPG3361T3SQ	IPGMFSQ	IPN31152NQ	IPN3311A2RNQ
0	1919-01-01	NaN	NaN	NaN	22.7730	NaN	NaN	NaN	NaN	NaN
1	1919-04-01	NaN	NaN	NaN	23.4306	NaN	NaN	NaN	NaN	NaN
2	1919-07-01	NaN	NaN	NaN	25.8175	NaN	NaN	NaN	NaN	NaN
3	1919-10-01	NaN	NaN	NaN	24.0151	NaN	NaN	NaN	NaN	NaN

# 2. Last four rows.

ind.tail(4)

	DATE	IPB51000SQ	IPB51100SQ	IPG211111CSQ	IPG21SQ	IPG2211A2SQ	IPG3361T3SQ	IPGMFSQ	IPN31152NQ	IPN3311A2RNQ
400	2019-01-01	105.9090	119.8992	180.0533	130.8243	105.4272	127.3016	106.5002	100.4274	104.5278
401	2019-04-01	105.0156	120.0236	186.3754	133.3726	103.1247	127.8380	105.6580	118.1640	102.4692
402	2019-07-01	105.2506	121.1178	188.3671	132.6772	105.3323	128.7360	105.8973	104.3486	100.2971
403	2019-10-01	105.4923	119.4774	194.4878	133.4932	105.7513	123.9082	105.8004	82.5920	97.4628

Notice that I did not use print() when outputing ind.head(4) and ind.tail(4). The benefit is that we get the extra formatting from Jupyter on the DataFrame. The cost is that we can only print one thing like this out per cell. Enter a new code cell and enter the following:

ind.head(4)
ind.tail(4)

What do you get?

# 3. Set the index to 'DATE'.

ind = ind.set_index('DATE')

Changing column names

We need to change nine column names. This is a bit tedious, but at least we only have to do this once. Once written, we can reuse the code anytime we deal with this file. Below are three different ways to do it.

# 4. Change the column names (method 1). 

# This is the slick way. 
# Grab the column names from the index. 
# Make a list of the new column names.
# Zip them together (check the documentation for zip) and then create a dict. 
# The columns have to be in the correct order with respect to new_names.

old_names = ind.columns.to_list()
new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
names = dict(zip(old_names, new_names))
ind = ind.rename(columns=names)

ind.head(2)

	consumer	consumer durables	crude oil	mining	elec and gas	cars	manuf	ice cream	steel
DATE
1919-01-01	NaN	NaN	NaN	22.7730	NaN	NaN	NaN	NaN	NaN
1919-04-01	NaN	NaN	NaN	23.4306	NaN	NaN	NaN	NaN	NaN

# 4. Change the column names (method 2). 

# If you know for sure that that the columns are in the correct order, you can also do this.
# In method 1, you can look at `names` and make sure the new and old names line up before changing the columns names. 

# new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
# ind.columns = new_names

# 4. Change the column names (method 3).

# The most tedious, but robust way. No matter what order the columns are in, this will replace the names properly. 

# names = {'IPB51000SQ':'consumer', 'IPB51100SQ':'consumer durables', 'IPG211111CSQ':'crude oil', 
#         'IPG21SQ':'mining', 'IPG2211A2SQ':'elec and gas', 'IPG3361T3SQ':'cars', 'IPGMFSQ':'manuf', 
#         'IPN31152NQ':'ice cream', 'IPN3311A2RNQ':'steel'}
# ind = ind.rename(columns=names)

# 5. dtypes

# Each column is made up of floats. (float64 is the same as float)
ind.dtypes

consumer             float64
consumer durables    float64
crude oil            float64
mining               float64
elec and gas         float64
cars                 float64
manuf                float64
ice cream            float64
steel                float64
dtype: object

# 6. shape

# This gives us the number of rows and columns in a tuple.
ind.shape

(404, 9)

#7. (column) index 

# An index object. The column names. 
ind.columns

Index(['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas',
       'cars', 'manuf', 'ice cream', 'steel'],
      dtype='object')

# 8. (row) index 

# Another index object. A different type (a DatetimeIndex, to hold dates). 
ind.index

DatetimeIndex(['1919-01-01', '1919-04-01', '1919-07-01', '1919-10-01',
               '1920-01-01', '1920-04-01', '1920-07-01', '1920-10-01',
               '1921-01-01', '1921-04-01',
               ...
               '2017-07-01', '2017-10-01', '2018-01-01', '2018-04-01',
               '2018-07-01', '2018-10-01', '2019-01-01', '2019-04-01',
               '2019-07-01', '2019-10-01'],
              dtype='datetime64[ns]', name='DATE', length=404, freq=None)

# 9. sample()

# A random sample of rows.
ind.sample(5)

	consumer	consumer durables	crude oil	mining	elec and gas	cars	manuf	ice cream	steel
DATE
1968-01-01	50.5306	39.9326	NaN	85.2319	35.3282	NaN	NaN	NaN	NaN
1923-01-01	NaN	NaN	NaN	32.1014	NaN	NaN	NaN	NaN	NaN
1920-01-01	NaN	NaN	NaN	27.2301	NaN	NaN	NaN	NaN	NaN
1985-10-01	73.1879	58.5867	137.7776	95.1897	62.9601	55.9413	51.9295	77.1728	73.7127
1929-01-01	NaN	NaN	NaN	36.3880	NaN	NaN	NaN	NaN	NaN

# 10. describe()

# Summary statistics. If a column is not made up of numbers, it will not appear here. 
ind.describe()

	consumer	consumer durables	crude oil	mining	elec and gas	cars	manuf	ice cream	steel
count	324.000000	292.000000	192.000000	404.000000	324.000000	192.000000	192.000000	192.000000	192.000000
mean	65.047151	65.033777	115.684291	71.476008	54.800645	79.146544	74.033846	109.750889	104.150758
std	32.738490	36.161831	25.731753	27.447983	35.141579	28.333690	24.334594	26.297457	23.704066
min	12.149600	14.197100	73.701500	17.560800	3.421900	32.428100	36.094600	63.286800	57.771000
25%	33.051625	35.750100	90.038175	45.412000	19.624250	54.963425	49.640475	91.246650	90.773525
50%	66.471100	57.756800	117.488500	84.690050	56.409250	78.134450	72.335600	105.492500	98.896600
75%	100.139600	100.364475	133.501100	91.499200	89.436600	105.321125	98.524600	126.651475	105.842975
max	113.684700	124.179300	194.487800	133.493200	107.675900	132.507800	108.265900	179.319100	182.248200

Working with messy data: Solutions

# 2. Reading from the internet.

url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7
                     usecols=[0,1,9,13],      # select columns of interest
                     )

# Rather than use the 'usecols' argument, you could have loaded all the columns and dropped the ones you do not want.
# Notice that the first row is a bunch of text. That will cause some problems later...

pisa.head()

	Unnamed: 0	Mathematics	Reading	Science
0	NaN	Mean score in PISA 2012	Mean score in PISA 2012	Mean score in PISA 2012
1	NaN	NaN	NaN	NaN
2	OECD average	494.046	496.463	501.16
3	NaN	NaN	NaN	NaN
4	Shanghai-China	612.676	569.588	580.118

# 3. Drop NaNs.

pisa2 = pisa.dropna()
pisa2.shape

(66, 4)

# 4. Set the index.

pisa2 = pisa2.set_index('Unnamed: 0')
pisa2.head(2)

	Mathematics	Reading	Science
Unnamed: 0
OECD average	494.046	496.463	501.16
Shanghai-China	612.676	569.588	580.118

# 5. US relative to the average. 
# The US is pretty average...

print(pisa2.loc['United States']/pisa2.loc['OECD average'])

Mathematics       0.974335
Reading            1.00225
Science           0.992517
dtype: object

# 6. Convert types/compute correlations
# Strings! 

pisa2.dtypes

Mathematics       object
Reading           object
Science           object
dtype: object

# New column names
pisa2.columns = ['math', 'read', 'sci']  

# There are several ways to convert strings to numeric values. This is one of them.
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)  
pisa2.dtypes

math    float64
read    float64
sci     float64
dtype: object

# Now we are in good shape. What does .corr() do?
# What is the return type of .corr()?

pisa_corrs = pisa2.corr()
print(pisa_corrs)
print(type(pisa_corrs))

# Now save the DataFrame of results to a file. 
pisa_corrs.to_excel('pisa_corrs.xlsx')

          math      read       sci
math  1.000000  0.959806  0.972131
read  0.959806  1.000000  0.978559
sci   0.972131  0.978559  1.000000
<class 'pandas.core.frame.DataFrame'>