download notebook
view notebook w/ solutions
Calculations on DataFrames
files needed = none
Now that we understand DataFrames, let's look at basic calculations on DataFrames. We will cover more advanced calculation later.
The big idea here is that we never want to loop over the rows of a DataFrame to perform a calculation.
Pandas gives us lots of ways to perform simple and complex computations in a DataFrame. When we use the pandas' tools, there is a lot of optimized code working in the background. When we loop over the DataFrame on our own, we lose all that fast code and our program grinds away slowly.
import pandas as pd
A DataFrame naturally understands how to perform operations element-wise. For example, let's compute the share of consumption in GDP. We get started by loading up our little data dictionary and creating a DataFrame.
data_dict = {'year': [1990, 1995, 2000, 2005 ], 'gdp':[5.9, 7.6, 10.2, 13.0], 'cons':[3.8, 4.9, 6.8, 8.7]}
data = pd.DataFrame(data_dict)
data
# This will divide cons by gdp: row by row.
data['cons_share'] = data['cons'] / data['gdp']
data
Notice the left-hand side of the assignment. I am creating a new variable (column) in the DataFrame and assigning the consumption share to it.
# Oops, I wanted it in percentage
data['cons_share'] = data['cons_share']*100
data
The +,-,/,* operators all work element wise. As we have seen, multiplying and dividing by scaler works fine, too.
DataFrame methods for simple operations
DataFrame has many methods for computing various statistics on the data. Note that some of them take an axis argument: For example, you could compute .sum()
across a row or a column. You have to tell pandas which one you want.
# Sums
print('Sum across columns')
print(data.sum(axis=1)) # summing across columns. Not terribly useful here.
print('\nSum across rows')
print(data.sum(axis=0)) # summing across rows. Cumulative GDP, consumption
print('\nSum up gdp')
print(data['gdp'].sum()) # Sum a single column. No axis necessary because this is a 1-D series.
# Means
print('\nMean of each column')
print(data.mean(axis=0))
print('\nMean gdp and cons')
print(data[['gdp', 'cons']].mean(axis=0)) # We could also omit the axis here as well
Try TAB completion to see the methods available or the documentation.
Here are a few: sum
, mean
, var
, std
, skew
, rank
, quantile
, mode
, min
, max
, kurtosis
, cumsum
, cumprod
...
These will be even more powerful once we learn how to group data within a DataFrame and compute statistics by group.
One very useful one...
data.
# .describe() is a good place to start with a new data set.
print(data.describe())
print('\n\n') # Print a few blank lines.
print(data)
Practice: Calculations on DataFrames
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.
-
Compute the mean of the consumption share for 1990 and 1995. You might try using
.loc[]
with two arguments.loc[rows, cols]
-
Try
desc = data.describe()
What is the return type? -
Looking ahead, try out the following code. What does it do? Can you find the file? What is inside of it?
desc.to_csv('desc.csv')
desc.to_excel('desc.xlsx')