download notebook

Coding practice #5: Due midnight, December 8

files needed = ('state_gdp.csv', 'state_unemp.xlsx')

Answer the questions below in a jupyter notebook. You can simply add cells to this notebook and enter your answers. When you are finished, upload the completed notebook to canvas.

My office hours are Tuesdays 9:00AM-10:00AM and Tuesdays 3:30PM-4:30PM in 7444 Soc Sciences. Satyen's are Mondays 3:00PM-4:00PM in 6413 Soc Sciences and Mitchell's are Thursdays 3:00PM-4:00PM in 7308 Soc Sciences.

You should feel free to discuss the coding practice with your classmates, but the work you turn in should be your own.

Cite any code that is not yours: badgerdata.org/pages/citing-code

Read the class policy on AI: badgerdata.org/pages/ai

Q1 Cleaning and merging panel data from multiple sources

Economists often use data from multiple sources to conduct their analysis. It is almost never the case that all the data required for a project is available from one source. This is an artifact of how economic data are collected in the United States. For example, the Bureau of Labor Statistics (BLS) is responsible for measuring the unemployment rate, while the Bureau of Economic Analysis (BEA) is responsible for measuring gross domestic product. (These bureaus are themselves parts of different departments of the executive branch—the Department of Labor in the case of the BLS and the Department of Commerce in the case of the BEA). As a result, data are often published in different formats that require some extra work to merge.

In this exercise, your task is to merge state-level unemployment rate data from the BLS with state-level GDP data from the BEA. You will then make a scatterplot showing Okun's Law for Wisconsin and Michigan.

Note: An easier way to do this is to use the FRED API to retreive each series directly. However, the purpose of this exercise is to practice cleaning and merging data, so let's get to it. (In the bad old days, we didn't have FRED...)

Part (a): Importing and cleaning the GDP data

  1. Read in the file "state_gdp.csv" as a Pandas data frame called gdp. This file (downloaded from https://apps.bea.gov/regional/Downloadzip.cfm) contains annualized real GDP (in millions of chained 2009 dollars) for each quarter from 2005Q1 to 2018Q1 for each state and industry. Note that the data is in "wide" format.
  2. The data contain a breakdown of each industry's contribution to GDP in each state. For this exercise, we are only interested in total real GDP for each state. Thus, keep only the rows for which the column "Description" is equal to "All industry total".

  3. The data contains real GDP for the United States as a whole as well as for several subregions. We are only interested in states so we will drop the others. To do this, drop any row for which the column 'GeoName' contains "United States", "New England", "Mideast", "Great Lakes", "Plains", "Southeast", "Southwest", "Rocky Mountain", or "Far West".

4a. Drop the columns 'GeoFIPS', 'Region', 'ComponentId', 'ComponentName', 'IndustryId', 'IndustryClassification', and 'Description'.

4b. Rename the column 'GeoName' to 'state'.

5a. Convert the data from wide to long format. The rows should be (state, date) and the column should be real_gdp.

5b. Reset the index. Rename the columns variable to date and value to real_gdp.

5c. Convert 'real_gdp' to a floating point number (the columns containing the GDP data were originally imported as strings since some values of GDP for some industries/states are censored or missing).

  1. Finally, we should convert the date to a datetime object. Unfortunately, the datetime package doesn't like the format of our dates: "2005:Q1" etc. This will require a few steps:

    1. Use the .str.replace() method to replace ":Q" with a dash "-q".
    2. Convert 'date' to a datetime object as normal.

Part (b): Importing and cleaning the unemployment rate data

  1. Import the file "state_unemp.xlsx" as a Pandas data frame called unemp. This file (downloaded from https://www.bls.gov/lau/rdscnp16.htm) contains measures of the population, labor force, employment, unemployment, and unemployment rate by state (plus New York City and Los Angeles County) in each month from January 1976 to September 2018. Notice how different the formating is compared to the GDP data! When importing, do the following:

    1. The column names are not friendly for importing, so keep only columns 1, 2, 3, and 10. Try the usecols option in read_excel.
    2. Use skiprows=8 to avoid importing the messy headers.
    3. Set the column names to 'state', 'year', 'month', and 'unemp_rate'.
  2. Drop the observations where state is either "Los Angeles County" or "New York City".

  3. Next, we need to create a date variable that we can use to merge with the gdp data. The gdp data are quarterly.

  4. Create a column that is 'yyyy-mm' constructed from the year and month columns.

  5. Convert your string date variable to a datetime. Name your datetime variable date.
  6. Drop the year and month columns.

  7. Now we need to turn the monthly data into quarterly data. This is a little more complicated because we need to do this for each state. We are combining a groupby and a resample.

    1. Set the index to date. We need a datetime index in order to resample.
    2. Groupby state and use .resample('QS').mean()

We are grouping by state. Then, within each group, we are resampling, and taking the mean over the 3 months in each quarter. The 'QS' tells pandas to assign the first day of the quarter ('S' for start) as the date, e.g., 1999-04-01 for the second quarter.

Part (c): Merging and plotting the data

Wow, that was a lot of work. Bending data to our will takes careful effort.

At this point, we have two data sets. In each dataset, the unit of observation is a state-quarter.

  1. Create a new data frame by merging the gdp and unemployment rate data on 'state' and 'date'. Set 'state' and 'date' as the index on the new data frame and sort the index. Use an inner merge.

  2. Okun's Law is about changes in gdp and unemployment.

    1. Compute the percent change in real gdp for each state and quarter. You will need to group by state and then use .pct_change()
    2. Compute the difference in the unemployment rate for each state and quarter. Again, group by state. This time, use .diff().

Notice that we are taking the growth rate of gdp, but the difference of unemployment. This is because it is awkward to talk about the percent change of variables that are already in percentage terms.

  1. Change the index to 'state'.

  2. Now we can make the desired scatterplot. Plot the GDP growth rate against the change in the unemployment rate for Wisconsin and Michigan on the same figure.

    1. Put GDP growth on the horizontal axis and the change in the unemployment rate on the vertical axis.
    2. Make the markers for Wisconsin red x's and the markers for Michigan blue squares that aren't filled in (Only the outline should be shown in blue. The interior should be blank).
    3. Make other aspects of the plot look nice, including nice labels and a legend.

Q2: How well does Okun's Law work?

Okun's Law states that there is a strong linear relationship between changes in unemployment and real gdp. This seems empirically plausible at the national level. How well does in work at the state level?

Part (a): regplot

Let's create two seaborn regplot()s. One for Wisconsin and one for Michigan.

  1. Make two subplots on the same figure. One row, with two columns of axes.
  2. regplot() Michigan's data on the first axis. Make the color blue.
  3. regplot() Wisconsin's data on the second axis. Make the color red.

For both plots, keep GDP growth rate on the x-axis and change in unemployment on the y-axis.

Part (b): Analysis

What do you conclude about Okun's Law?

[If you are interested in more about Okun's Law, check this out.]

Optional

If you have some experience with econometrics, you could estimate the relationship in the figures you just plotted. Are the coefficients significant? Is the r-squared large?