download notebook
view notebook w/ solutions

Working with time series

files needed = ('vix.csv', 'osk.csv')

Time series data are very common in finance and economics. Date and time systems have their own set of rules and logic, which makes them potentially complicated but powerful. Pandas/python have a very powerful set of tools to handle dates and time. We will work through the basics here. As always, the Internet if full of the details.

In this workbook we will cover 1. The datetime data type, which is python's representation of a date and time 2. Setting up a DataFrame to index with datetime objects 3. Plotting time series data 4. Changing the frequency of our date, e.g., converting daily data to monthly 5. Slicing on dates

The packages

import pandas as pd               
import matplotlib.pyplot as plt   

# Data types and methods for dealing with dates and time. Name it dt for short.
import datetime as dt

The datetime type

Python uses the datetime type of object to hold date and time data. This allows for very sophisticated handling of time series data. Most of this is beyond the scope of our work, but if you ever find yourself having to deal with time zones or which week of the month a date lies in, datetime has you covered. [Python was developed to do a lot more than data analysis...]

We will mostly need to create datetime objects to hold dates. We pass datetime objects to methods that need to know about dates.

# year, month, date (datetime will add a time code of 00:00:00)
time_1 = dt.datetime(1776, 7, 4)  
print('Declare independence on:', time_1)

What kind of object do we have?

type(time_1)

datetime to str

We can convert a datetime object to different string outputs using the strftime() method. It takes formatting codes as arguments and you can format the string however you wish. I added a comma. A list of codes is in table 11-3 of McKinney. You can find them in the documentation here.

# I have no idea why month is 'B'.
print('Declare independence on:', time_1.strftime('%B %d, %Y'))

str to datetime

We will often read dates in as strings. Converting them to datetime objects will be one of our first orders of business. We can convert strings to datetime using the same codes and the strptime() method.

We pass it the template the date is written in, and it parses the string to create a datetime. It's flexibility let's us handle strange formats.

# Format codes make for very flexible date input.

string_date = '1815/1/8'                                 

time_2 = dt.datetime.strptime(string_date, '%Y/%m/%d')

print('Battle in New Orleans on:', time_2.strftime('%B %d, %Y'))

We might need some more datetime functionality, but that should be enough for now.

Datatimes in pandas

We just learned about the datetime package, which allows us to explicity create datetime objects. This datetime functionality is built into pandas, too, and works similarly.

The data

The file 'vix.csv' contains daily end-of-trading values of the 'VIX,' a measure of expected market volatility as implied by S&P 500 options. Business-news types like to refer to it as the 'fear index'. [That's a bit too dramatic for my tastes, but those guys have ads to sell.] The idea is that expected volatility rises when people are worried about the future.

The data are from FRED.

vix = pd.read_csv('vix.csv', na_values='.')   # There are some missing values marked as '.'
vix

The data look about like I would expect. What kind of variables do we have?

print(vix.dtypes)

The 'DATE' variable is stored as a string right now.

We use the .to_datetime() method of pandas to convert it to a datetime object.

vix['DATE'] = pd.to_datetime(vix['DATE'])
vix.dtypes

So the to_datetime() conversion creates datetime64[ns] objects.

As we have seen with float and int, pandas add some extra information to the type, but it is still a datetime.

Notice that we did not pass a template to .to_datetime() like we did with .strptime(). If we do not pass a template, pandas will try to guess the format. If the format is "standard" it typically works. If we have data with a strange date setup, we can always supply a template.

A datetime in the index

It makes sense to set the index as the unit of observation. In this data, the unit of observation is a time period—a day.

There is nothing special here. We set the index as we normally do.

vix_di = vix.set_index('DATE')
vix_di.index

Notice that the index object is now a DatetimeIndex object.

A datetime index at import

pd.read_csv() and pd.read_excel() can handle dates when we import the data.

vix = pd.read_csv('vix.csv', parse_dates=['DATE'], na_values='.')
vix.dtypes

vix = pd.read_csv('vix.csv', parse_dates=['DATE'], index_col='DATE', na_values='.')
vix.index

Now that pandas knows the index is a datetime, it can handle time calculations easily. Let's look at three applications:

Plotting time series data
Changing the frequency of data
Slicing using dates

Plotting

Now that we have our index set as datetime objects, we can plot the data in sensible ways. In particular, matplotlib can gracefully handle data at different frequencies.

fig, ax = plt.subplots(figsize=(14,7))

ax.plot(vix.index, vix['VIXCLS'], color = 'red')

ax.set_title('CBOE Volatility Index: VIX')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

If we zoom in tight enough, the time axis will relabel in a natural way to reflect the changed time scale. I am going to "zoom in" by changing the limits of the x-axis so that it covers a short period of time.

When we refer to coordinates in a time series figure, we pass the x-coordinates as a datetime object. Below, we use datetimes to limit the x-axis and to make some annotations.

fig, ax = plt.subplots(figsize=(14,7))

ax.plot(vix.index, vix['VIXCLS'], color = 'red')

ax.set_title('CBOE Volatility Index: VIX')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Change the x-axis limits to focus on the recession period. 
# The since the x-axis are dates, the limits are datetime objects.
ax.set_xlim( dt.datetime(2008,1,1), dt.datetime(2010, 12, 31) ) 

plt.show()

fig, ax = plt.subplots(figsize=(14,7))

ax.plot(vix.index, vix['VIXCLS'], color = 'red')

ax.set_title('CBOE Volatility Index: VIX')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Change the x-axis limits to focus on the recession period. 
# The since the x-axis are dates, the limits are datetime objects.
ax.set_xlim( dt.datetime(2008,1,1), dt.datetime(2010, 12, 31) ) 

# Add some annotations. The x-coordinate needs to be a datetime object.
ax.annotate('Lehman Collapse', 
            xy=(dt.datetime(2008,9,15), 70), 
            xytext=(dt.datetime(2008,2,1), 70), 
            arrowprops={'facecolor':'black',  'arrowstyle':'->'}
           ) 

plt.show()

Practice: Dates

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

Read the file 'osk.csv' into a DataFrame and name it "prices". The file contains daily closing prices for OshKosh Corp. and the S&P500.
Set the index to the date. Make sure the dates are datetime objects!
Normalize each series (osk, sp500) by dividing every observation by the value at 01-02-2018. Try

prices['osk_n'] = prices['osk'] / prices.loc['2018-01-02','osk']

and something similar for the S&P500.

Plot the two normalized series. What does normalizing do for us? [You can try plotting the non-normalized data to see the difference.]

Changing the frequency

We can resample the data to change its frequency.

We are downsampling data if we are reducing the number of observations (daily\(\rightarrow\)monthly) and we are upsampling the data if we are increasing the number of observations (monthly\(\rightarrow\)daily). In my work, I have downsampled a lot. I do not think I have ever upsampled.

When we downsample, we need to tell pandas how we want the data aggregated: for example, we might want it summed, or averaged.
When we resample, we create a new DataFrame to hold the new data. The index in the new data will be of a different frequency than the old data, so it is easiest to have different DataFrames.

Let's resample to the weekly frequency and take the average over the observations within a week.

vix = pd.read_csv('vix.csv', parse_dates=['DATE'], index_col='DATE', na_values='.')

# Create a new DataFrame with a weekly index. 
vix_w = vix.resample('w').mean()

print(vix_w.index)
print('\n', vix_w.head())

We had 8,559 daily observations. Now we have 1,712 weekly observations. Note the weeks are Sunday-based (W-SUN). Let's downsample some more.

vix_m = vix_di.resample('m').mean()   # monthly
vix_y = vix_di.resample('y').mean()   # yearly

vix_y.head()

fig, ax = plt.subplots(figsize=(14,7))

ax.plot(vix_m.index, vix_m['VIXCLS'], color = 'red', label = 'monthly average')
ax.plot(vix_y.index, vix_y['VIXCLS'], color = 'blue', label = 'yearly average')

ax.set_title('CBOE Volatility Index: VIX')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.legend(frameon=False)

plt.show()

Notice that plotting at different frequencies was handled without a problem. Matplotlib/Pandas is handling all the datetime manipulation for us.

Slicing with dates

We can slice on a datetime index. This is surprisingly easy.

Let's start by taking a slice of a single data: 6/18/2008.

Here are two ways.

# Option 1: Create a datetime and use .loc
date_to_extract = dt.datetime(2008, 6, 18)  

vix.loc[date_to_extract]

# Option 2: Query directly with a string
vix_di.loc['2008-06-18']

Notice that in the second method, I need to use the standard format of yyyy-mm-dd.

We can slice, too. Using the weekly data, let's 1. We can grab every observation with a year of 2008 with '2008'. 2. We can grab every observation between Jan 1, 2008 and Jan 31, 2008 by slicing with '2008/01/01':'2008/01/31'. Python is smart enough to understand references to dates that are not in the DataFrame. There is no January 1 or January 31 observation.

print('Every week in 2008:', vix_w.loc['2008'])  # all the weeks in 2008

# All of the weeks in January 2008.
print('\nEvery week in Jan 2008:', vix_w.loc['2008/01/01':'2008/01/31'])

Practice: Dates (continued)

Continuing work with the prices DataFrame you created earlier. It containes the normalized prices of OshKosh and the SP500.

If you didn't finish the last practice, run this code to create the daily data:

prices = pd.read_csv('osk.csv', parse_dates=['Date'], index_col='Date')
prices['osk_n'] = prices['osk'] / prices.loc['2018-01-02,','osk']
prices['sp500_n'] = prices['sp500'] / prices.loc['2018-01-02,','sp500']

Resample the (non-normalized) data for OshKosh and SP500 to a weekly frequency, creating average weekly prices. Name the new DataFrame with the weekly data prices_w.
Normalize the two weekly series, like you did in step 3., but this time dividing by the first week of the year.
Add the weekly data to your figure from part 4. Use the alpha option to lighten up the daily data lines.
How does your plot look? Are the titles correct? Is there a legend or some other way of identifying the lines?
How has fading the daily changed the message of the figure? Why leave the daily data at all?

Fading the daily data puts the reader's focus on the weekly data. The message is about the trends: "OshKosh's price fell while the SP500 grew."

I have left th daily data in the figure so that the reader can get a sense of the volatility in the underlying data&nmdash;which is lost when take weekly averages.