download notebook
view notebook w/o solutions

Exam 2 (100 pts total)

files needed = ('covid.csv', 'wdi_e2.csv', 'clubs.csv', 'roster.csv', 'q2a.csv') You can find them in 'exam2_data_prac.zip' on the course webpage.

You have 75 minutes to complete this exam.

Answer all the questions below in this notebook. You should insert as many cells into the notebook as you need. When you are finished, upload your finished notebook to Canvas.

You may use your notes and the Internet, but you cannot work with others.
Import any packages you need to complete this exam.
Do not modify the data files directly. All data manipulation should happen in your code.

Remember, jupyter notebooks and python have lots of built in help facilities.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import pandas_datareader.data as web

Question 0 (5 pts): Ruhl, Kim

Replace 'Last, First' above with your actual name. Enter it as: last name, first name.

Question 1 (5 points): Saving figures

If you are saving a file to use on a website or to embed in a MS Word document which format should you use? Why?

Insert a cell below and type your answer there.

# When creating a file for use in a Word or web format, use the "scalable vector graphics" (SVG) file format. 
# When saved as an SVG, the figure will scale as the user zooms in or zooms out, while reaming clear. 
# A figured saved as a PNG, for example, will become blurry when a user zooms in.

Question 2A (10 pts): Apply-split-combine

The file 'covid.csv' contains data on covid tests in Wisconsin by county. Each observation in the data is a county-day. The data cover March 15, 2020 to November 16, 2020.

The POS_NEW variable contains the day's new cases. Use this variable to compute the cumulative number of cases in each county.
Sort your results by county name (a to z) and print out the first three counties.

c19 = pd.read_csv('covid.csv', parse_dates=['DATE'])
c19.head(2)

	GEOID	GEO	NAME	DATE	POSITIVE	POS_NEW	POS_7DAYAVG	NEGATIVE	NEG_NEW	NEG_7DAYAVG	...	DTH_WHT	DTH_MLTOTH	DTH_UNK	DTH_E_HSP	DTH_E_NHSP	DTH_E_UNK	POS_HC_Y	POS_HC_N	POS_HC_UNK	OBJECTID
0	55001	County	Adams	2020-03-15 19:00:00+00:00	0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
1	55003	County	Ashland	2020-03-15 19:00:00+00:00	0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3

2 rows × 66 columns

cases_cum = c19[['NAME', 'POS_NEW']].groupby('NAME').sum().sort_index()
cases_cum.to_csv('q2a.csv', index_label='NAME')
cases_cum.head(3)

	POS_NEW
NAME
Adams	852.0
Ashland	514.0
Barron	2721.0

Question 2B (15 pts): Bar graphs and label formatting

If you completed question 2A, you may use your results in this question. If not, use the file 'q2a.csv' to complete this question.

Create two subplots arranged as one column and two rows. Make the figure size 15 inches by 15 inches.
In the top subplot, create a bar plot of the cumulative cases for the counties whose names start with A-L. The x axis should be the names of the counties.
In the bottom subplot, create a bar plot of the cumulative cases for the counties whose names start with M-Z. The x axis should be the names of the counties.
Rotate the county names on the x axis 90 degrees so that they are vertical.
Label the y axes 'cumulative number of cases.'
The bars should be silver and the major y-axis grid lines should be white.
Give your graph a title.
Remove the top and right spines from each subplot.

# If needed:
cases_cum = pd.read_csv('q2a.csv')
cases_cum = cases_cum.set_index('NAME')

low = cases_cum[cases_cum.index.str.slice(0,1) <= 'L']
high = cases_cum[cases_cum.index.str.slice(0,1) > 'L']

fig, ax = plt.subplots(2, 1, figsize=(15,15))

ax[0].bar(low.index, low['POS_NEW'], color='silver')
ax[1].bar(high.index, high['POS_NEW'], color='silver')

ax[0].set_title('Cumulative COVID-19 cases in Wisconsin (November 16, 2020)')

for a in ax:
    a.tick_params(axis='x', labelrotation=90)
    a.set_ylabel('cumulative number of cases')
    a.grid(color='white', axis='y')
    sns.despine(ax=a)

png

Question 4 (15 pts): Long and wide data

The file 'wdi_e2.csv' contains data on real GDP and the working age population (15-64 years) for China and Japan from 1970 to 2019.

Compute the ratio of GDP to the working age population for both countries for all years. Name the variable 'gdpc'.
Make sure that your DataFrame is organized so that the unit of observation (country-date) are rows and the variables are columns.
Sorth the index of your DataFrame.
Print out the first two rows of your DataFrame.

wdi = pd.read_csv('wdi_e2.csv')
wdi = wdi.set_index(['var', 'DATE', 'cty'])
wdi.head(2)

			value
var	DATE	cty
gdp	1970-01-01	chn	228.510558
gdp	1970-01-01	jpn	18699.742958

wdi = wdi.unstack(['var'])
wdi.head(2)

		value
	var	gdp	pop
DATE	cty
1970-01-01	chn	228.510558	55.844752
1970-01-01	jpn	18699.742958	69.010653

# You can leave the 'value' level and then add 'value' in all of the variable names, e.g., ('value', 'gdp')
wdi.columns = wdi.columns.droplevel(0)
wdi['gdpc'] = wdi['gdp'] / wdi['pop']
wdi.sort_index(inplace=True)
wdi.head(2)

	var	gdp	pop	gdpc
DATE	cty
1970-01-01	chn	228.510558	55.844752	4.091890
1970-01-01	jpn	18699.742958	69.010653	270.968933

Question 5 (10 pts): FRED and working with dates

Download the weekly initial (unemployment) claims data for Wisconsin from FRED. Get the data from the beginning of 2019 to the most recent. [Use the FRED website to find the variable code for 'Initial Claims in Wisconsin'.]
Compute the average weekly claims for 2019 and the average weekly claims for 2020. Print them out as below, replacing the xs with the appropriate numbers.

'The 2019 average weekly claims are xxxx. The 2020 average weekly claims are xxxxx.'

Note that the numbers are printed out as integers.

claims = web.DataReader('WIICLAIMS', 'fred', start='2019-01-01')

# Two ways to do this. 

# 1. Slice
claims19 = claims.loc['2019-01-01':'2019-12-31'].mean()
claims20 = claims.loc['2020-01-01':'2020-12-31'].mean()

# 2. Groupby!
means = claims.groupby(claims.index.year).mean()

print('The 2019 average weekly claims are {0:.0f}. The 2020 average weekly claims are {1:.0f}.'
      .format(float(means.loc[2019]), float(means.loc[2020])))

The 2019 average weekly claims are 5520. The 2020 average weekly claims are 23129.

print('The 2019 average weekly claims are {0:.0f}. The 2020 average weekly claims are {1:.0f}.'
      .format(claims19.values[0], claims20.values[0]))

The 2019 average weekly claims are 5520. The 2020 average weekly claims are 23129.

claims19

WIICLAIMS    5520.057692
dtype: float64

Done!

This was originally an exam that I gave in the past. I think it was a little too long. Let's call Q1-Q5 the about the right length for the exam. I am leaving Q6 below, though, so you can use it to practice.

Question 6 (20 pts): Merging and cleaning

Load 'clubs.csv' into a DataFrame named clubs. The file contains data about four BIG10 schools.
Load 'roster.csv' into a DataFrame named roster. The file contains data about football players at four BIG10 schools.
Merge the two DataFrames. The new DataFrame should contain all the players that are in the players dataset, matched to their team information. Name the new DataFrame data.
How many matches do you have? Print out the number as "There are xx matches in data."
There should be 50 matches but you probably found fewer. Let's fix this. Clean up the clubs and roster DataFrames using pandas commands (do not modify the data files) and perform another merge. Name this new matched DataFrame data_clean.
Print out the number as "There are xx matches in data_clean." (The xx should be equal to 50 at this point!)

clubs = pd.read_csv('clubs.csv')
roster = pd.read_csv('roster.csv')
print(clubs.head(2))
print(roster.head(2))

   team   mascot      city state
0   wis   badger   Madison    WI
1   nwn  wildcat  Evanston    IL
   NUMBER          NAME POSITION  WEIGHT CLASS TEAM
0      17     Jack Coan       QB     221    Jr  wis
1       5  Graham Mertz       QB     215    Fr  wis

data = pd.merge(left=clubs, right=roster, left_on='team', right_on='TEAM', how='outer', indicator=True)
print('There are {} matches in data.'.format(data['_merge'].value_counts().loc['both']))

There are 25 matches in data.

# Extra whitespace!
clubs['team'].unique()

array([' wis', 'nwn', 'umi ', 'umn', 'msu'], dtype=object)

# Extra whitespace!
roster['TEAM'].unique()

array(['wis', 'wis ', 'umn', 'umi', 'umi ', 'nwn', 'utah'], dtype=object)

# Remove the extra whitespace.
clubs['team'] = clubs['team'].str.strip()
roster['TEAM'] = roster['TEAM'].str.strip()

data_clean = pd.merge(left=clubs, right=roster, left_on='team', right_on='TEAM', how='outer', indicator=True)
print('There are {} matches in data_clean.'.format(data_clean['_merge'].value_counts().loc['both']))

There are 50 matches in data_clean.