Workflow: An example

files needed = None

In this notebook I am going to take you through a real-world example. I was asked to prepare some background data for a UW Now broadcast on the restaurant industry during covid.

The goal is to illustrate, from beginning to end, taking an idea to a presentation. For brevity, I will focus here on some of the time-series figures

1. Pose a preliminary question(s)

How did restaurants (ahem) fare during the pandemic?

Some of my initial thoughts: I want to look at measures of sales and employment; maybe business start up and closure? Can I find data on Wisconsin? I should compare the restaurant industry to national averages or maybe other sectors.

2. Find the appropriate data

Sales and employment data are from the BLS, accessed through the FRED database. Some is at the national level, some at the Wisconsin level.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import pandas_datareader as pdr

bls = pdr.get_data_fred(['CES7072200001', 'USPRIV', 'RSFSDP', 'RSGCS', 'WINA'], start='2016-01-01', end='2021-01-01')

# Industry 70722 is "Food Services and Drinking Places" 
bls.columns = ['emp_70722', 'emp_all', 'sales_70722', 'sales_grocery', 'emp_all_wi']

3. Get the data into a usable form

Not much to do here yet. The FRED data are pretty clean. Let's check on a few things.

print(bls.head(2), '\n')
print(bls.columns)
print(bls.dtypes)
print(bls.index)

            emp_70722  emp_all  sales_70722  sales_grocery  emp_all_wi
DATE                                                                  
2016-01-01    11318.6   121091        53045          51655      2917.1
2016-02-01    11337.1   121279        54237          51530      2922.0

Index(['emp_70722', 'emp_all', 'sales_70722', 'sales_grocery', 'emp_all_wi'], dtype='object')
emp_70722        float64
emp_all            int64
sales_70722        int64
sales_grocery      int64
emp_all_wi       float64
dtype: object
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
               '2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01',
               '2016-09-01', '2016-10-01', '2016-11-01', '2016-12-01',
               '2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01',
               '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01',
               '2017-09-01', '2017-10-01', '2017-11-01', '2017-12-01',
               '2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01',
               '2018-05-01', '2018-06-01', '2018-07-01', '2018-08-01',
               '2018-09-01', '2018-10-01', '2018-11-01', '2018-12-01',
               '2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
               '2019-05-01', '2019-06-01', '2019-07-01', '2019-08-01',
               '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01',
               '2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
               '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01',
               '2021-01-01'],
              dtype='datetime64[ns]', name='DATE', freq=None)

The data types look good; I expected them all to be numeric. The data are monthly, and start in 2016. The last observation is January 2021.

4. Preliminary analysis

Let's make some figures. These will be ugly, because they are just for me to get look through. These would not be included in my final output or in my code that I post that backs-up my final output.

I made a lot more figures than these, but I hope this gives you some idea of the "casting about" that typically happens at the beginning of a project.

People are generally not interested in how you found the answer. They are generally interested in understanding the answer.

fig, ax = plt.subplots(3,2, figsize=(16,9))

for a, c in zip(fig.axes, bls.columns):
    a.plot(bls.index, bls[c], color='black')
    a.set_title(c)

png

Okay, so employment has the same general shape in the national data for all firms, the Wisconsin data for all firms, and the national data for restaurants. I need to find a way to compare them.

Grocery sales growth actually increases and spikes at the beginning of the pandemic. This is everyone running out to stock-up on supplies. Sales in restaurants looks a lot like employment. No surprise there.

How did WI compare to the nation?

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(bls.index, bls['emp_all_wi'], color='blue')
ax.plot(bls.index, bls['emp_all'], color='black')

plt.show()

png

Hmmmmm, that is not very useful. Let's go back to step 3 and transform these variables.

The data are monthly, so, below, I'm computing the year-over-year growth rates. This takes care of any month-specific issues. For example, people may dine out more during December for the holidays.

bls_yy= bls.pct_change(12)*100

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(bls_yy.index, bls_yy['emp_all_wi'], color='blue')
ax.plot(bls_yy.index, bls_yy['emp_all'], color='black')

plt.show()

png

The growth rate figure is much easier to read. I see that the covid period was very similar in WI and the nation, so I will focus on the national data.

Now let's compare restaurants to all industries.

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(bls_yy.index, bls_yy['emp_70722'], color='blue')
ax.plot(bls_yy.index, bls_yy['emp_all'], color='black')

plt.show()

png

Okay, restaurants did much worse than the private sector in total. This is a good (partial) answer to my question.

Now let's look at sales.

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(bls_yy.index, bls_yy['sales_70722'], color='blue')
ax.plot(bls_yy.index, bls_yy['sales_grocery'], color='black')

plt.show()

png

Yeah, this figure tells the big story. Restaurant meals and meals at home are pretty close substitutes. People bought more groceries and spent less time eating in restaurants.

Even though this is the fifth figure I made, this is the first figure I want to show. This is the one that drives the story.

I want to dig a bit deeper on the employment data. Let's grab the JOLTs data. Also from FRED.

jolts = pdr.get_data_fred(['JTS7200JOR', 'JTS7200HIR', 'JTU7200LDR', 'JTS7200QUR', 'JTS7200TSR', 'JTSHIR', 'JTSTSR'], 
                         start='2016-01-01', end='2021-01-01')
jolts.columns = ['opens', 'hires', 'layoffs', 'quits', 'sep', 'hires_all', 'sep_all']

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(jolts.index, jolts['sep_all'], color='blue')
ax.plot(jolts.index, jolts['hires_all'], color='black')
ax.set_title('all industries')
plt.show()

png

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(jolts.index, jolts['sep'], color='blue')
ax.plot(jolts.index, jolts['hires'], color='black')
ax.set_title('restaurants')
plt.show()

png

I see a big spike in the separation rate (the number of people who lost a job / number employed) followed by a bit of a hiring spike as some of them are rehired. The spike is much larger in the restaurant data.

5. Revisit and revise your question(s)

Time to collect my findings:

Covid leads to a shift away from spending on restaurants towards groceries.
This leads to a large decrease in employment in the restaurant industry (compared to the whole economy)
Almost 35 percent of restaurant workers separated from jobs in March (April?) 2020
Separations are increasing again at the end of 2020

6. Final analysis

In this case, I'm pretty much finished.

I would like to add something about business closures and openings, but I did not. This notebook is already getting pretty long. If I push this project further, that is the first thing I will do.

7. Report your findings

Now it is time to choose which figures I want to use and how I want to make them look.

Message: The restaurant industry has suffered much more than the total economy.
Audience: My audience is UW alums from all backgrounds. This will not be a technical presentation.
Medium: I'm presenting to people watching on a computer or maybe a mobile device. I can use color, but I should keep the visualizations sparse. Some of the screens may be small.

Some notes on the figures

The group sent me a slide template. It is a 16x9 perspective, but I'm going with 16x10 since I will include the titles in my figures.
The template uses a color red that I can match with the 'firebrick' color.
I'm making the labels with large fonts so they show up on small screens.
I'm saving the figures as svg files so that they will look good when someone zooms in.

fig, ax = plt.subplots(figsize=(16,10))
ax.plot(bls_yy.index, bls_yy['sales_70722'], color='black', lw=2.5)
ax.plot(bls_yy.index, bls_yy['sales_grocery'], color='firebrick',lw=2.5)
ax.axhline(0, color='black')
ax.tick_params(axis='both', labelsize=18 )
ax.set_xlim(dt.datetime(2020,1,1), dt.datetime(2021,1,1))


ax.text(dt.datetime(2020,4,1), 18, 'Grocery', fontsize=18, color='firebrick')
ax.text(dt.datetime(2020,6,1), -32, 'Food services \nand drinking places', fontsize=18)

ax.set_title('Percent change in sales (national, y-o-y)', fontsize=25)

sns.despine()

plt.savefig('nat_sales.svg', bbox_inches='tight')

png

fig, ax = plt.subplots(figsize=(16,10))
ax.plot(bls_yy.index, bls_yy['emp_70722'], color='black', lw=2.5)
ax.plot(bls_yy.index, bls_yy['emp_all'], color='firebrick',lw=2.5)
ax.axhline(0, color='black')
ax.tick_params(axis='both', labelsize=18 )
ax.set_xlim(dt.datetime(2020,1,1), dt.datetime(2021,1,1))


ax.text(dt.datetime(2020,5,5), -15, 'All private', fontsize=18, color='firebrick')
ax.text(dt.datetime(2020,6,1), -30, 'Food services \nand drinking places', fontsize=18)

ax.set_title('Percent change in employment (national, y-o-y)', fontsize=25)

sns.despine()

plt.savefig('nat_emp.svg', bbox_inches='tight')

png

More notes

Putting all four lines on one axes would have looked very cluttered, so I'm putting them on side-by-side subplots.
The shorter x axes mean I need to finesse the tick labels
I'm not putting tick labels on the right-side y axis to remove some clutter

from matplotlib.ticker import FormatStrFormatter
import matplotlib.dates as mdates

years_fmt = mdates.DateFormatter('%Y-%m')

fig, ax = plt.subplots(1,2, figsize=(18,10))

ax[0].plot(jolts.index, jolts['sep_all'], color='firebrick', lw=2.5)
ax[0].plot(jolts.index, jolts['hires_all'], color='black', lw=2.5)

ax[1].plot(jolts.index, jolts['sep'], color='firebrick', lw=2.5)
ax[1].plot(jolts.index, jolts['hires'], color='black', lw=2.5)

ax[0].set_title('All private', fontsize=25, pad=20)
ax[1].set_title('Accommodation and food service', fontsize=25, pad=20)

for a in ax:
    a.set_xlim(dt.datetime(2020,1,1), dt.datetime(2021,1,1))
    a.set_ylim(0,37)
    sns.despine(ax=a)
    a.tick_params(axis='both', labelsize=18 )
    a.xaxis.set_major_formatter(years_fmt)
    a.set_xticks([dt.datetime(2020,3,1), dt.datetime(2020,6,1), dt.datetime(2020,9,1), dt.datetime(2020,12,1)])

ax[1].tick_params(axis='y', left=False, labelleft=False)
ax[0].set_ylabel('percent', fontsize=18)

ax[0].text(dt.datetime(2020,3,1), 11.5, 'Separations/Employment', fontsize=18, color='firebrick')
ax[0].text(dt.datetime(2020,5,1), 6.75, 'Hires/Employment', fontsize=18, color='black')
ax[1].text(dt.datetime(2020,4,1), 30, 'Separations/Employment', fontsize=18, color='firebrick')
ax[1].text(dt.datetime(2020,7,1), 13, 'Hires/Employment', fontsize=18, color='black')

plt.savefig('jolts_compare.svg', bbox_inches='tight')
plt.show()

png

I am now ready to insert these svg files into the slide template that I was given. I have posted the finished version of my slides on my github. They use more figures than just the ones created in this notebook.

The github repo also has my code that created the figures (bls_figures.ipynb). Notice that they do not include all of the intermediate figures that are in this notebook. I only submit the code that gets used. This is the same way you should "prune" your final project code before submitting it.