download notebook
view notebook w/o solutions
Exam 1 (100 pts total)
files needed = ('GFDEGDQ188S.csv', 'Property_Tax_Roll.csv', 'go-by-industry-82-97.csv'), which can be found in exam1_data.zip
You have 75 minutes to complete this exam.
Answer all the questions below in this notebook. You should insert as many cells into the notebook as you need. When you are finished, upload your finished notebook to Canvas.
- You may use your notes and the Internet, but you cannot work with others.
- Import any packages you need to complete this exam.
- Do not modify the data files directly. All data manipulation should happen in your code.
Remember, jupyter notebooks and python have lots of built in help facilities.
import matplotlib.pyplot as plt
import pandas as pd
Question 0 (5 pts): Last, First
Replace 'Last, First' above with your actual name. Enter it as: last name, first name.
Question 1 (10 pts): Visualization
Go to this link. It is the UW's covid dashboard.
Find the figure titled "Positive PCR Test Results on Campus By Day" and look at the "UW-Madison" tab.
Insert a markdown cell below, and answer the following questions.
-
What is the message you take away from this figure?
-
What audience is this figure created for? Is the figure appropriate for this audience? Why or why not?
-
Is the figure well-suited for the medium in which it is presented? Why or why not?
-
Since mid-February, positive results (i.e., cases) have been falling. Levels now are very low compared to the beginning of the year. Employee cases are always lower than student cases.
-
This is created for students and facutly/staff. The figure is appropriate, it uses only reports of the data and 7-day averages, which the general public can understand.
-
Yes, the figure is suited to the medium, which is a webpage. The use of interactive features is useful for retrieving the exact values.
Not needed for credit, but other observations:
-
This would not look good printed in black and white. The colors would not translate well and would be hard to tell apart. This is clearly, though, meant to be viewed online.
-
I wonder how the 7-day average treats the missing values on the weekends.
-
Thoughful use of color in that the student data are always blue and the emplyee data always green.
-
Here is the original dashboard. That pie chart could be reworked. I like that they reported the data in a table, but the table should be set up to be downloaded.
Responses from students:
- Stacked bar graphs can be tough to read. [I agree.]
- Students (a major part of the audience) will likely be using a mobile device. Is the figure mobile-friendly? [We should check!].
Question 2 (20 pts): Functions and flow control
Write a function named top_5_average
that takes one argument: a list of numbers. Your function should do two things.
-
Check that the variable passed to the function is of the type
list
. If it is not a list, print out "The input is not a list." You do not need to check that the list is only made up of numbers. -
If the variable passed to the function is a
list
, the function should return the average value of the five largest elements of the list. Fory1
, it would be the average of the numbers: 98, 124, 1632, 8715, 9815.
Test your code on these lists of numbers:
y1 = [2, 65, 8715, 12.5, 124, 77, 45.23, 1632, 0, 98, 9815]
y2 = [26, 48, 123.89, 78, 5894, 3654, 59, 12.7, 8994]
and report your answers as
"The 5-element average maximum for y1 is ????.??"
"The 5-element average maximum for y2 is ????.??"
Replace the ????.?? with the average of the five largest elements. Note the two numbers to the right of the decimal point.
import numpy as np
def top_5_average(x):
'''Return the average value of the 5 largest elements of a list.'''
if type(x) != list:
print('The input is not a list.')
else:
x.sort()
return np.mean(x[-5:])
y1 = [2, 65, 8715, 12.5, 124, 77, 45.23, 1632, 0, 98, 9815]
y2 = [26, 48, 123.89, 78, 5894, 3654, 59, 12.7, 8994]
print('The 5-element average maximum for y1 is {:.2f}'.format(top_5_average(y1)) )
print('The 5-element average maximum for y2 is {:.2f}'.format(top_5_average(y2)) )
The 5-element average maximum for y1 is 4076.80
The 5-element average maximum for y2 is 3748.78
Question 3 (20 pts): Selecting data from a DataFrame
Load the file 'Property_Tax_Roll.csv'. It contains property tax information for Madison properties.
Use python and pandas to answer the following questions.
- How many properties have total tax growth (
PctTaxChangeTotal
) of more than 20 percent? Print the answer as
"There are ?,??? properties with a greater than 20 percent increase in property tax." (Note the comma in the reported number, which is an integer.)
tax = pd.read_csv('Property_Tax_Roll.csv')
print('There are {:,} properties with a greater than 20 percent increase in property tax.'.format(tax[tax['PctTaxChangeTotal']>20].shape[0]))
There are 1,379 properties with a greater than 20 percent increase in property tax.
- Create a new DataFrame that contains only the parcels: 60801103172, 71010108118, and 81026423137 and only the columns
address
,TotalAssessedValue
, andEstFairMkt
.
Print out the DataFrame.
tax.set_index('Parcel', inplace=True)
parcels = [60801103172, 71010108118, 81026423137]
cols = ['address', 'TotalAssessedValue', 'EstFairMkt']
tax.loc[parcels, cols]
address | TotalAssessedValue | EstFairMkt | |
---|---|---|---|
Parcel | |||
60801103172 | 6106 Mulberry Cir | 224800 | 226500 |
71010108118 | 5318 Queensbridge Rd | 309200 | 311500 |
81026423137 | 6023 Woodbine Dr | 100 | 100 |
Question 4 (10 pts): Loading messy data
The file 'go-by-industry-82-97.csv' contains annual gross output by industry.
- Load the file into a DataFrame. Print out only the first two rows from your DataFrame.
go = pd.read_csv('go-by-industry-82-97.csv', header=4, na_values=['...', 'n/a'])
go.head(2)
Industry | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | All industries | 6175.0 | 6631.0 | 7313.8 | 7775.7 | 8031.0 | 8707.5 | 9434.2 | 10069.8 | 10624.6 | 10808.0 | 11381.0 | 12024.4 | 12826.8 | 13653.2 | 14463.4 | 15393.3 |
1 | Private industries | 5468.1 | 5868.2 | 6496.1 | 6883.4 | 7075.2 | 7702.6 | 8366.9 | 8919.9 | 9386.8 | 9490.3 | 9998.7 | 10592.7 | 11334.5 | 12104.9 | 12864.4 | 13728.2 |
- Print out:
"The private industries gross output average is ????.? billion dollars."
Replace the ????.? with the average (over all the years) of gross output for the "Private industries" sector.
Note the single number to the right of the decimal point.
go.set_index('Industry', inplace=True)
avg_all = go.loc['Private industries'].mean()
print('Private industries gross output average {:.1f} billion dollars.'.format(avg_all))
Private industries gross output average 9142.6 billion dollars.
Question 5 (30 pts): Plotting
The file 'GFDEGDQ188S.csv' contains the U.S. debt-gdp ratio, in precent. Create a line plot with the date on the x-axis and the debt-gdp ratio on the y-axis.
- The figure size should be 12 inches wide and 8 inches tall.
- The line should be black, solid, and have a width of 3.
- Add a vertical line at 2020. Make the line black and dashed.
- Next to the vertical line, place the text: 'Pandemic onset'.
Make any further adjustments you find neccessary.
debt = pd.read_csv('GFDEGDQ188S.csv')
debt.columns = ['date', 'debt']
debt.set_index('date', inplace=True)
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(debt.index, debt['debt'], color='black', lw=3)
ax.set_ylabel('share of GDP (percent)', fontsize=16)
ax.set_title('U.S. Debt-GDP Ratio', fontsize=20)
ax.axvline(x = 2020, color='black', linestyle='--')
ax.text(2008, 40, 'Pandemic onset', fontsize=16)
for s in ['top', 'right']:
ax.spines[s].set_visible(False)
plt.show()
Question 6 (5 pts): Calculating in DataFrames
The code below works—it will compute the total of the column y
—but it is inefficient.
- Insert a markdown cell below and, in a sentence or two, describe why the code is not efficient.
df = pd.DataFrame({'x':[100, 200, 300], 'y':[1,2,3]})
ytot = 0
for i in range(0,3):
ytot = ytot + df.iloc[i, 1]
The code is not efficient because we are explicitly looping over the rows of the DataFrame. We should avoid looping directly over the rows of a DataFrame whenever possible.
You are finished!
Upload your completed notebook to Canvas.