download notebook
view notebook w/o solutions
Exam 2 Fall 2023 (100 pts total)
files needed = ('fish.csv', 'room_char.xlsx', 'cpi.csv', 'cb_2022_55_tract_500k.zip', 'wdi.csv'), which can be found in exam2_data.zip
You have 75 minutes to complete this exam.
Answer all the questions below in this notebook. You should insert as many cells into the notebook as you need. When you are finished, upload your finished notebook to Canvas.
- You may use your notes and the Internet.
- You cannot work with others on the exam. You cannot post questions online and solicit answers, e.g., through Chegg or chat GPT.
- Import any packages you need to complete this exam.
- Do not modify the data files directly. All data manipulation should happen in your code.
Question 0 (5 pts): Last, First
Replace 'Last, First' above with your actual name. Enter it as: last name, first name.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import geopandas as gpd
C:\Users\kimru\anaconda3\envs\geo3\lib\site-packages\pyproj\__init__.py:91: UserWarning: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_LIB or with `pyproj.datadir.set_data_dir`.
warnings.warn(str(err))
Question 1 (15 points):
Below are two DataFrames. Insert a markdown cell below this one, and use the DataFrames to answer the following questions. You do not need to write any code for this question.
- How many rows and columns does Z have after running the following code?
z = pd.merge(left=X, right=Y, on='id', how='inner')
- How many rows and columns does Z have after running the following code? How many missing values are in
Z['price']
?
z = pd.merge(left=X, right=Y, on='id', how='right')
DataFrame X
id | qty |
---|---|
pear | 2 |
orange | 13 |
Banana | 89 |
apple | 1 |
DataFrame Y
id | price |
---|---|
orange | 0.5 |
pear | 1.2 |
Bannana | 0.1 |
apple | 1.25 |
kiwi | 1.75 |
- columns = 3, rows = 3 (pear, orange, apple)
- columns = 3, rows = 5; missing values in
Z['price']
= 2 (unmatched Bannana and kiwi)
Question 2 (10 points): MultiIndex
- Load the file
fish.csv
- The column index should be a multiIndex with 'Family' at level 0 and 'Type' at level 1
-
The row index should be a multiIndex with date at level 0 and state at level 1
-
Print out a DataFrame that contains only the observations from 2023
fish = pd.read_csv('fish.csv', header=[0,1], index_col=[0,1], parse_dates=[0])
fish.sort_index(inplace=True)
fish.loc['2023']
Bass | Snapper | |||||||
---|---|---|---|---|---|---|---|---|
Large mouth | Small mouth | Striped | Nothern red | Mutton | Lane | Mangrove red | ||
2023-01-01 | FL | 1 | 0 | 0 | 0 | 0 | 0 | 3 |
2023-02-15 | WI | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2023-04-04 | AK | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2023-05-07 | FL | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
2023-08-09 | WI | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2023-09-05 | GA | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
2023-09-06 | WI | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
2023-10-07 | AL | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Question 3 (20 points): Grouping
- Load the file
room_char.xlsx
- Create a DataFrame named that holds the number of rooms with windows and without windows for each building
- The row index should have BLDG at level 0 and WINDOWS at level 1
- There is only one column, which holds the room counts. Name the column
ROOM COUNT
- Print out your DataFrame
For example, the first two rows of this DataFrame would be
CHAMBERLIN | NO | 7 |
CHAMBERLIN | YES | 5 |
room = pd.read_excel('room_char.xlsx')
room.head(2)
BLDG | ROOM | CAPACITY | WINDOWS | |
---|---|---|---|---|
0 | SEWELL | 1 | 3 | YES |
1 | SEWELL | 2 | 3 | YES |
room_counts = room[['BLDG', 'WINDOWS', 'ROOM']].groupby(['BLDG', 'WINDOWS']).count()
room_counts.rename(columns={'ROOM':'ROOM COUNT'}, inplace=True)
room_counts
ROOM COUNT | ||
---|---|---|
BLDG | WINDOWS | |
CHAMBERLIN | NO | 7 |
YES | 5 | |
SEWELL | NO | 4 |
YES | 6 | |
VILAS | YES | 4 |
Question 4 (15 points): Time series
- Load the file
cpi.csv
. It contains data on the consumer price index. - Create a CPI variable at the quarterly frequency
- Plot the quarterly CPI. Make your figure graphically excellent.
- Add a vertical line to your plot at August 1, 2020
cpi = pd.read_csv('cpi.csv', parse_dates=['DATE'], index_col='DATE')
cpiq = cpi.resample('Q').mean()
fig, ax = plt.subplots()
ax.plot(cpiq.index, cpiq['CPIAUCSL'], color='black')
ax.axvline(x=dt.datetime(2020,8,1), color='black', ls='--')
ax.set_title('Consumer Price Index (1982-84 = 100)')
ax.set_ylabel('index')
sns.despine()
Question 5 (15 points):
The file cb_2022_55_tract_500k.zip
contains the shapefile for the census tracts in Wisconsin, downloaded from here.
- Create a map of only the census tracts in Dane County. Use a white background and black edges.
- Lake Mendota is tract 9917.02 and Lake Monona is tract 9917.03. Color these two tracts skyblue.
dane = gpd.read_file('cb_2022_55_tract_500k.zip')
fig, ax = plt.subplots(figsize=(15,8))
dane[dane['NAMELSADCO']=='Dane County'].plot(color='white', edgecolor='black', ax=ax)
dane[dane['NAME'].isin(['9917.03','9917.02'])].plot(color='skyblue', edgecolor='black', ax=ax)
ax.set_title('Dane County census tracts')
ax.axis('off')
plt.show()
Question 6 (10 points):
- Use the code below to create a DataFrame
ads = pd.DataFrame({'address':['1180 Observatory Dr.;Madison;WI;53706', '1220 Linden Dr.;Madison;WI;53706']})
- Extract the street, city, state, and zip out of the address variable and save them in columns named named
Street
,City
,State
, andZip
.
The first row should look like:
address | Street | City | State | Zip |
---|---|---|---|---|
1180 Observatory Dr.;Madison;WI;53706 | 1180 Observatory Dr. | Madison | WI | 53706 |
ads = pd.DataFrame({'address':['1180 Observatory Dr.;Madison;WI;53706', '1220 Linden Dr.;Madison;WI;53706']})
ads[['Street', 'City', 'State', 'Zip']] = ads['address'].str.split(';', expand=True)
ads
address | Street | City | State | Zip | |
---|---|---|---|---|---|
0 | 1180 Observatory Dr.;Madison;WI;53706 | 1180 Observatory Dr. | Madison | WI | 53706 |
1 | 1220 Linden Dr.;Madison;WI;53706 | 1220 Linden Dr. | Madison | WI | 53706 |
Question 7 (10 points):
The file wdi.csv
contains a panel of data. It covers the United States and Germany from 1998 to 2022. For each country and year, it measures service exports and GDP.
- Insert a markdown cell below this one and write which part of the data should be columns and which should be rows in order to make the data "tidy."
Insert a code cell below your answer from part 1 and
-
Load the
wdi.csv
file into a DataFrame. -
Reshape the DataFrame to match your answer from part 1.
- Sort your row index and print out your DataFrame.
Tidy data has observations as rows and variables as columns. In this case, I want the country and time to be the rows and service exports and gdp to be the columns.
wdi = pd.read_csv('wdi.csv', index_col=['Country Name', 'Country Code', 'Series Name', 'Series Code']).stack()
wdi.unstack(['Series Name', 'Series Code']).sort_index()
Series Name | Commercial service exports (current US$) | GDP (current US$) | ||
---|---|---|---|---|
Series Code | TX.VAL.SERV.CD.WT | NY.GDP.MKTP.CD | ||
Country Name | Country Code | |||
Germany | DEU | 1998 | 7.716637e+10 | 2.238990e+12 |
1999 | 7.648444e+10 | 2.194950e+12 | ||
2000 | 7.785096e+10 | 1.947980e+12 | ||
2001 | 7.843019e+10 | 1.945790e+12 | ||
2002 | 9.281096e+10 | 2.078480e+12 | ||
2003 | 1.127450e+11 | 2.501640e+12 | ||
2004 | 1.438780e+11 | 2.814350e+12 | ||
2005 | 1.563170e+11 | 2.846860e+12 | ||
2006 | 1.782880e+11 | 2.994700e+12 | ||
2007 | 2.080620e+11 | 3.425580e+12 | ||
2008 | 2.392090e+11 | 3.745260e+12 | ||
2009 | 2.216440e+11 | 3.411260e+12 | ||
2010 | 2.259540e+11 | 3.399670e+12 | ||
2011 | 2.505440e+11 | 3.749310e+12 | ||
2012 | 2.508420e+11 | 3.527140e+12 | ||
2013 | 2.713350e+11 | 3.733800e+12 | ||
2014 | 2.979800e+11 | 3.889090e+12 | ||
2015 | 2.758300e+11 | 3.357590e+12 | ||
2016 | 2.883330e+11 | 3.469850e+12 | ||
2017 | 3.173810e+11 | 3.690850e+12 | ||
2018 | 3.511200e+11 | 3.974440e+12 | ||
2019 | 3.610550e+11 | 3.888230e+12 | ||
2020 | 3.266040e+11 | 3.889670e+12 | ||
2021 | 4.013260e+11 | 4.259930e+12 | ||
2022 | 4.226560e+11 | 4.072190e+12 | ||
United States | USA | 1998 | 2.404860e+11 | 9.062820e+12 |
1999 | 2.697780e+11 | 9.631170e+12 | ||
2000 | 2.888670e+11 | 1.025090e+13 | ||
2001 | 2.758420e+11 | 1.058190e+13 | ||
2002 | 2.804080e+11 | 1.092910e+13 | ||
2003 | 2.887030e+11 | 1.145640e+13 | ||
2004 | 3.325520e+11 | 1.221720e+13 | ||
2005 | 3.629070e+11 | 1.303920e+13 | ||
2006 | 4.038650e+11 | 1.381560e+13 | ||
2007 | 4.747370e+11 | 1.447420e+13 | ||
2008 | 5.217060e+11 | 1.476990e+13 | ||
2009 | 5.019200e+11 | 1.447810e+13 | ||
2010 | 5.628350e+11 | 1.504900e+13 | ||
2011 | 6.231990e+11 | 1.559970e+13 | ||
2012 | 6.626740e+11 | 1.625400e+13 | ||
2013 | 6.975640e+11 | 1.684320e+13 | ||
2014 | 7.373560e+11 | 1.755070e+13 | ||
2015 | 7.493160e+11 | 1.820600e+13 | ||
2016 | 7.646530e+11 | 1.869510e+13 | ||
2017 | 8.175490e+11 | 1.947730e+13 | ||
2018 | 8.434140e+11 | 2.053310e+13 | ||
2019 | 8.686460e+11 | 2.138100e+13 | ||
2020 | 7.043200e+11 | 2.106050e+13 | ||
2021 | 7.781880e+11 | 2.331510e+13 | ||
2022 | 9.000040e+11 | 2.546270e+13 |