download notebook
view notebook w/o solutions
Pandas and packages
files needed = none
We have a handle on python now: we understand the data structures and enough about working with them to move on to stuff more directly relevant to data analysis. That means digging into pandas, the package that provides the core data structures and functions for data work.
Pandas is a package. We have seen the numpy and scipy packages before, but we haven't discussed packages in any detail yet. Let's brush up on packages before we jump into pandas.
Packages
We can think of 'python' as a set of core libraries (the python standard libraries) that contain the basic python data structures (int, str, list, dict, etc.) and functions (print(), len(), sum(), etc). We can add additional data structures and functions to core python. Developers group related functions and data structures into packages, which we add to core python to increase its functionality.
On winstat, we are working with the Anaconda distribution of python. The Anaconda distribution bundles together basic python with common and useful packages such as
* numpy: functions for numerical computing (ln(), exp(), sqrt(),...)
* scipy: functions for math and engineering (statistical functions, signal processing, linear algebra...)
* matplotlib: a plotting interface (bar, scatter, time series plots,...)
* pandas: functions and data structures for data analysis (data management, merging, grouping...)
At the end of the semester, when you move away from winstat, you can install Anaconda (it's free) on your own computer and continue your python adventures.
Anaconda already installed the packages for us, but we still need to tell python that we want to use the packages. This way, we only load into memory the packages we need.
To add a package we use the import statement. For example, to load the numpy package, we use:
import numpy as np
The statement says to add the numpy package (import it) and to give it a shortened name of np. We don't have to use the name np—we could name it George—but np is short and informative. Many packages have "agreed upon" short names: numpy is np, pandas is pd, and matplotlib.pyplot is plt.
Why do we want a shorter name than numpy? When we use a function from the numpy package, we use the dot notation. Here, we use the log function:
y = 10
x = np.log(y)
The np. tells python to look for the log() function in the package called np. Python knows that np means numpy, and everything works. [See why calling it George would have looked strange?]
We only need to import a package once. We typically group all of our imports at the beginning of the code. Here we go:
import pandas as pd #load the pandas package and call it pd
import numpy as np #load the pandas package and call it np
When you run the cell, you might notice that the text to the left of the cell looked like In[*] for a while. That * means that python was working. It takes a few seconds to load all those functions. Now run whos to see what's in the namespace:
whos
Variable Type Data/Info
------------------------------
np module <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd module <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
We see two module variables ready for action. As usual, we can use ? to check them out.
A full list of the packages included in the Anaconda distribution is here. If we need a package that is not already a part of Anaconda, we will need to install it. More on that later, but the big idea is that python can be extended by installing and adding packages.
Pandas
Pandas (panel data) is our workhorse package for data anlaysis. In addition to all the things we can do with core python, pandas gives us many powerful functions and data types that we can use for working with data. If you are familiar with STATA, you will find many parallels between python and STATA. This makes sense: both are meant to handle panel data. Pandas was invented by Wes McKinney (sound familiar?) at the quantitative hedge fund AQR. One of the reasons we have chosen python for this course is its prevalence in the financial sector.
To make the pandas package available, we import it into our code. We have already imported it above, but we can do it again. It won't hurt.
# The most common short name for pandas is pd. We will go with that.
import pandas as pd
Pandas data structures: DataFrames
Pandas is built around two main data structures: the series and the dataframe. A dataframe is a 'rectangular' data structure, like a Microsoft Excel spreadsheet. It is made up of rows and columns. [But it is much, much, more powerful than a spreadsheet.]
Let's create a dataframe. To keep things simple (they will get complicated later!) we will create a dataframe from a dict.
data_dict = {'year': [1990, 1995, 2000, 2005 ], 'gdp':[5.9, 7.6, 10.2, 13.0], 'cons':[3.8, 4.9, 6.8, 8.7]}
print('First, print the dict:')
print(data_dict)
# We create a data frame called 'df' from the dict.
# We use the 'pd.' syntax to call the DataFrame constructor.
df = pd.DataFrame(data_dict)
print('\nNext, print the DataFrame and its type\n')
print(df)
print('\n', type(df))
First, print the dict:
{'year': [1990, 1995, 2000, 2005], 'gdp': [5.9, 7.6, 10.2, 13.0], 'cons': [3.8, 4.9, 6.8, 8.7]}
Next, print the DataFrame and its type
year gdp cons
0 1990 5.9 3.8
1 1995 7.6 4.9
2 2000 10.2 6.8
3 2005 13.0 8.7
<class 'pandas.core.frame.DataFrame'>
Nice. Compared to the dict, the dataframe is much easier to read when printed. Let's break it down:
df = pd.DataFrame(data_dict)
We are creating a DataFrame object. We can see that when we print out df's type. The stuff to the left of DataFrame in the print out is the hierarchy: DataFrame is part of the frame group which is part of the core group of the package pandas. The hierarchy is not important for our work.
We will use this function sometimes, but, more often, we will be building our DataFrames from spreadsheets, csv files, or api (application programming interface) calls directly to servers to retrieve data. Very powerful stuff.
Columns and rows in a DataFrame
A DataFrame is made up of columns and rows. How big is our DataFrame? We use the shape attribute of a DataFrame to see this. [Reminder: attributes and methods are associated with objects. Attributes return features of the object and methods are like functions that act on the object. Hence, attributes don't need parentheses to be called.]
# .shape returns the (rows, columns) of the DataFrame
print('The shape is:', df.shape)
# I often mess up and try this when what I really want the shape of a DataFrame. What is it telling us?
print('The size is:', df.size)
The shape is: (4, 3)
The size is: 12
What type does df.shape return?
Back to our DataFrame. Let's print if out again, but this time, without the print function.
df
| year | gdp | cons | |
|---|---|---|---|
| 0 | 1990 | 5.9 | 3.8 |
| 1 | 1995 | 7.6 | 4.9 |
| 2 | 2000 | 10.2 | 6.8 |
| 3 | 2005 | 13.0 | 8.7 |
The jupyter notebook adds some shading and formatting, making it even easier to read. I still often use the print() function. Old habits die hard.
The DataFrame Index
The left-most 'column' of numbers is not a column. It is the index that pandas assigned to our DataFrame: similar to the row 'numbers' of a spreadsheet. I put numbers in quotes, because an index can be more than just a sequence of integers. In this DataFrame, it makes sense for our index to be the year variable. A little later, we will discuss changing the index. For now, we will go with what pandas gave us.
The index is important; we will work on it more in a bit.
Pandas data structures: Series
We reference a column in the DataFrame by its name. Notice the similarity to the syntax for referencing a dict key.
df['gdp']
print(df['gdp'])
0 5.9
1 7.6
2 10.2
3 13.0
Name: gdp, dtype: float64
When we print the column, we get the index, the column, and the type of data contained in the column. The gdp data in the column are floats. Suppose we instead ask for the type of the column itself — not what's in the column. This reveals the second major data structure in pandas:
gdp = df['gdp']
print(gdp)
print(type(gdp))
0 5.9
1 7.6
2 10.2
3 13.0
Name: gdp, dtype: float64
<class 'pandas.core.series.Series'>
When we extract a single column from a DataFrame, we are given a Series.
print(type(df['gdp']))
<class 'pandas.core.series.Series'>
We can think of a Series as a DataFrame with only one column: It is one column and an index. In the same way, we can think of a DataFrame as a collection of Series that all have the same index.
Learning more about a DataFrame
We have seen the size and shape methods of DataFrame. Let's look at a few more. If we want to see what variables we have in the DataFrame (imagine you are working with a dataset containing 1,000 variables...) we use the columns attribute.
print(df.columns)
print(type(df.columns))
Index(['year', 'gdp', 'cons'], dtype='object')
<class 'pandas.core.indexes.base.Index'>
So the columns attribute returns the list of columns in a somewhat complicated way. (It is returning an Index object that is useful in some situations.) We can use the tolist() method to create a list object. We can do something similar to the index as well. The axes attribute tells us about both at once.
print(df.columns.tolist())
['year', 'gdp', 'cons']
print(df.index.tolist())
[0, 1, 2, 3]
print(df.axes)
[RangeIndex(start=0, stop=4, step=1), Index(['year', 'gdp', 'cons'], dtype='object')]
We can learn about the data types in DataFrame with dtypes [note the 's'].
This is a very important attribute of DataFrame. This is the first thing I check when working with a new dataset.
print(df.dtypes)
year int64
gdp float64
cons float64
dtype: object
Practice: DataFrames
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.
Below is a dict with data about U.S. states.
state_data = {'state':['CA','MI','WI','MN'], 'pop':[37,9.8,5.7, '5.3'],
'size':[163.7, 96.7, 65.5, 86.9], 'bird':['Quail', 'Redbreast Robin', 'American Robin', 'Loon']}
state_data = {'state':['CA','MI','WI','MN'], 'pop':[37,9.8,5.7, '5.3'],
'size':[163.7, 96.7, 65.5, 86.9], 'bird':['Quail', 'Redbreast Robin', 'American Robin', 'Loon']}
state_data
{'state': ['CA', 'MI', 'WI', 'MN'],
'pop': [37, 9.8, 5.7, '5.3'],
'size': [163.7, 96.7, 65.5, 86.9],
'bird': ['Quail', 'Redbreast Robin', 'American Robin', 'Loon']}
- Convert the dict to a DataFrame named
states - What are the types of the variables? Does anything look funny?
states = pd.DataFrame(state_data)
print(states.dtypes)
# Why is population an object and not a float?
state object
pop object
size float64
bird object
dtype: object
states
| state | pop | size | bird | |
|---|---|---|---|---|
| 0 | CA | 37 | 163.7 | Quail |
| 1 | MI | 9.8 | 96.7 | Redbreast Robin |
| 2 | WI | 5.7 | 65.5 | American Robin |
| 3 | MN | 5.3 | 86.9 | Loon |
- How many rows and columns are in this DataFrame?
- Print out the index as a list.
- Print out the columns as a list.
print("state's rows and columns =", states.shape)
print("the index is ", states.index.tolist())
print("the columns are ", states.columns.tolist())
state's rows and columns = (4, 4)
the index is [0, 1, 2, 3]
the columns are ['state', 'pop', 'size', 'bird']
More on dataframe columns
To extract a column from a DataFrame, we have been using states['bird']. This is the preferred way to do so, in that it always works. You will come across two other ways to extract a column.
1. The first uses a . like: states.bird.
2. The second uses the iloc method states.iloc[:,3].
print(states['bird'])
0 Quail
1 Redbreast Robin
2 American Robin
3 Loon
Name: bird, dtype: object
print(states.bird)
0 Quail
1 Redbreast Robin
2 American Robin
3 Loon
Name: bird, dtype: object
# iloc works like a 2-D slice with iloc[rows, cols].
print(states.iloc[:,3])
0 Quail
1 Redbreast Robin
2 American Robin
3 Loon
Name: bird, dtype: object
All three lines of code returned the same column. Why do I not recommend the second two approaches?
Let's try to grab the 'size' column using the second method.
# Ask for the column 'size'.
print(states.size)
#print(states['size'])
16
Well, that's not what I expected. On further review, however, it is exactly what I should have expected. size is an attribute of DataFrame. We saw this earlier: It returns the number of elements in the DataFrame. Any column name that conflicts with a DataFrame method or attribute is going to cause problems.
What about iloc? iloc references a column by its position. This is not very robust. What if I change my code and add an extra column to the beginning of the DataFrame? What if I change my code to sort the columns alphabetically? The 'bird' column would no longer be the fourth column and states.iloc[:,3] will no longer return the 'bird' column. We might still use iloc here and there, but we do so at our own peril.
Subsets of columns
Okay, back to referencing columns. We can take several columns by passing a list of the column names.
cols_to_get = ['state', 'bird', 'pop']
got_cols = states[cols_to_get]
print(got_cols)
print(type(got_cols))
state bird pop
0 CA Quail 37
1 MI Redbreast Robin 9.8
2 WI American Robin 5.7
3 MN Loon 5.3
<class 'pandas.core.frame.DataFrame'>
When we take more than one column, we are creating a DataFrame object. A more compact notation creates the list in place.
got_cols_2 = states[['state', 'bird', 'pop']]
got_cols_2
| state | bird | pop | |
|---|---|---|---|
| 0 | CA | Quail | 37 |
| 1 | MI | Redbreast Robin | 9.8 |
| 2 | WI | American Robin | 5.7 |
| 3 | MN | Loon | 5.3 |
Renaming columns
Often data sets come with poor variable names. How do we rename a column? If we are only changing a few variables, the dictionary approach works well.
# Is it population or state soda? Let's get a better name on that variable.
states = states.rename(columns={'pop':'population'})
states
| state | population | size | bird | |
|---|---|---|---|---|
| 0 | CA | 37 | 163.7 | Quail |
| 1 | MI | 9.8 | 96.7 | Redbreast Robin |
| 2 | WI | 5.7 | 65.5 | American Robin |
| 3 | MN | 5.3 | 86.9 | Loon |
The only column that was changed was the pop column. Let's take a closer look at the syntax.
states = states.rename(columns={'pop':'population'})
- We are calling the
rename()method from the DataFrame objectstates - We pass to
rename()the argumentcolumns(passingindexwould rename the index) - We use a dict to give the
{old name : new name}key-value pairs
Notice that we had to assign the result of state.rename() back to states. The rename() method does not act on the original data, but creates a copy, which we assign back to the states variable.
We can ask rename() to perform the action on the original data with the inplace argument. [You can see if a method supports in-place operations by checking the documentation, or using ? in jupyter notebook.] Remember, inplace = True is used when we don't want to make a new copy of our data through assignment to a new variable. Think about this as eliminating lines like states = states.rename(...).
#Let's rename population again...
states.rename(columns={'population':'people'}, inplace=True)
states
| state | people | size | bird | |
|---|---|---|---|---|
| 0 | CA | 37 | 163.7 | Quail |
| 1 | MI | 9.8 | 96.7 | Redbreast Robin |
| 2 | WI | 5.7 | 65.5 | American Robin |
| 3 | MN | 5.3 | 86.9 | Loon |
More on DataFrame Rows
Now we know how to manipulate columns. How do we take a row, or a subset of rows? We can resort to iloc again...
# Take the third row, all columns
print(states.iloc[2,:])
state WI
people 5.7
size 65.5
bird American Robin
Name: 2, dtype: object
...but this is still subject to problems with reordering rows. It is also less likely that we want to take a row in a specific position. It is more likely we want a row corresponding to a set of characteristics.
states
| state | people | size | bird | |
|---|---|---|---|---|
| 0 | CA | 37 | 163.7 | Quail |
| 1 | MI | 9.8 | 96.7 | Redbreast Robin |
| 2 | WI | 5.7 | 65.5 | American Robin |
| 3 | MN | 5.3 | 86.9 | Loon |
# Take the row corresponding to Wisconsin.
states[states['state']=='WI']
| state | people | size | bird | |
|---|---|---|---|---|
| 2 | WI | 5.7 | 65.5 | American Robin |
That's a one-row DataFrame.
Let's break that down. First, we ask which rows have STATE equal to WI.
temp = states['state']=='WI'
temp
0 False
1 False
2 True
3 False
Name: state, dtype: bool
states[temp]
| state | people | size | bird | |
|---|---|---|---|---|
| 2 | WI | 5.7 | 65.5 | American Robin |
That is returning a Series of bools. When we give the Series to the DataFrame, it only grabs the rows that are true.
We will use this technique—choosing a subset of data based on a conditional statement—often. Do not worry, there will be plenty of practice!
Changing the row index
We can simplify our lives and get rid of an index that doesn't tell us anything useful about the data by changing the index. The unit of observation in states is a state. That seems like a good index.
# Make the column named 'STATE' the new index.
# Save the DataFrame with the new index to a different variable.
states_new_index = states.set_index('state')
print(states_new_index)
print('\n')
print(states)
people size bird
state
CA 37 163.7 Quail
MI 9.8 96.7 Redbreast Robin
WI 5.7 65.5 American Robin
MN 5.3 86.9 Loon
state people size bird
0 CA 37 163.7 Quail
1 MI 9.8 96.7 Redbreast Robin
2 WI 5.7 65.5 American Robin
3 MN 5.3 86.9 Loon
The index is now the state abreviations and not the squence of integers. What does this buy us? A simpler way to reference a column using the loc method. This is loc, not iloc! iloc indicates reference by integer position, hence the name.
[Again, notice that set_index did not operate in place. It created a copy that I assigned to the new variable. set_index also takes the inplace argument.]
Let's use loc to extract the data for Wisconsin.
print(states_new_index.loc['WI']) # ask for the row corresponding to WI
people 5.7
size 65.5
bird American Robin
Name: WI, dtype: object
Very nice. Even if the order of the rows changed, that line of code still works.
Suppose you regret your choice of index. Are you ruined? Nope.
# Reset the index to integers and return STATE to a column.
states_undo = states_new_index.reset_index()
states_undo
| state | people | size | bird | |
|---|---|---|---|---|
| 0 | CA | 37 | 163.7 | Quail |
| 1 | MI | 9.8 | 96.7 | Redbreast Robin |
| 2 | WI | 5.7 | 65.5 | American Robin |
| 3 | MN | 5.3 | 86.9 | Loon |
Deleting rows and columns
To remove a row or column, we use the drop() method. The drop method is our first introduction to the axis argument. DataFrames have two axes. We have been calling them rows and columns, but pandas thinks of them as rows = 0 and columns = 1.
In the following, we want to remove the column 'SIZE', but there could be a row index 'SIZE', too, so we need to tell drop where to look for 'SIZE'. Again, we use inplace if we don't want to create a copy.
states_subset = states.drop('size', axis = 1)
states_subset
| state | people | bird | |
|---|---|---|---|
| 0 | CA | 37 | Quail |
| 1 | MI | 9.8 | Redbreast Robin |
| 2 | WI | 5.7 | American Robin |
| 3 | MN | 5.3 | Loon |
Deleting a row works just as you would expect...
# Why is this code not very robust?
states_no_mn = states.drop(3, axis = 0)
states_no_mn
| state | people | size | bird | |
|---|---|---|---|---|
| 0 | CA | 37 | 163.7 | Quail |
| 1 | MI | 9.8 | 96.7 | Redbreast Robin |
| 2 | WI | 5.7 | 65.5 | American Robin |
Practice: DataFrames
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. The TA and I are here, too.
- Explain why
states = states[['state', 'size', 'bird', 'pop']]reordered the DataFramestates. - Create a new DataFrame from the dict
data_dictwe used earlier. Name it data.
# The code states[['state', 'size', 'bird', 'pop']] is creating a new DataFrame with the columns state, size, bird, and pop,
# in that particular order. We could name that new DataFrame anything, but by naming it 'states' we are effectively reording
# the original DataFrame.
data = pd.DataFrame(data_dict)
data
| year | gdp | cons | |
|---|---|---|---|
| 0 | 1990 | 5.9 | 3.8 |
| 1 | 1995 | 7.6 | 4.9 |
| 2 | 2000 | 10.2 | 6.8 |
| 3 | 2005 | 13.0 | 8.7 |
- Change the index to be the year variable and print the DataFrame
# data = data.set_index('year')
data.set_index('year', inplace=True)
data
| gdp | cons | |
|---|---|---|
| year | ||
| 1990 | 5.9 | 3.8 |
| 1995 | 7.6 | 4.9 |
| 2000 | 10.2 | 6.8 |
| 2005 | 13.0 | 8.7 |
- Change the remaining column names to 'consumption' and 'gross domestic product'
# data.columns = ['gross domestic product', 'consumption']
data.rename(columns = {'gdp': 'gross domestic product', 'cons':'consumption'}, inplace = True)
data
| gross domestic product | consumption | |
|---|---|---|
| year | ||
| 1990 | 5.9 | 3.8 |
| 1995 | 7.6 | 4.9 |
| 2000 | 10.2 | 6.8 |
| 2005 | 13.0 | 8.7 |
- Extract the rows corresponding to 1990 and 2000 and assign them to a DataFrame named
data_dec. Hint: When we wanted to take to columns, we passed a list of the column names...
data_dec = data.loc[[1990, 2000]]
data_dec
| gross domestic product | consumption | |
|---|---|---|
| year | ||
| 1990 | 5.9 | 3.8 |
| 2000 | 10.2 | 6.8 |
- Create
data_dec_2fromdataby deleting the 1995 and 2005 rows.
data_dec_2 = data.drop(1995, axis=0)
data_dec_2 = data_dec_2.drop(2005, axis=0)
data_dec_2
# You can drop both at once by passing a list.
# data_dec_2 = data.drop([1995, 2005], axis=0)
| gross domestic product | consumption | |
|---|---|---|
| year | ||
| 1990 | 5.9 | 3.8 |
| 2000 | 10.2 | 6.8 |