download notebook
view notebook w/o solutions

Calculations on DataFrames

files needed = none

Now that we understand DataFrames, let's look at basic calculations on DataFrames. We will cover more advanced calculation later.

The big idea here is that we never want to loop over the rows of a DataFrame to perform a calculation.

Pandas gives us lots of ways to perform simple and complex computations in a DataFrame. When we use the pandas' tools, there is a lot of optimized code working in the background. When we loop over the DataFrame on our own, we lose all that fast code and our program grinds away slowly.

import pandas as pd

A DataFrame naturally understands how to perform operations element-wise. For example, let's compute the share of consumption in GDP. We get started by loading up our little data dictionary and creating a DataFrame.

data_dict = {'year': [1990, 1995, 2000, 2005 ], 'gdp':[5.9, 7.6, 10.2, 13.0], 'cons':[3.8, 4.9, 6.8, 8.7]}
data = pd.DataFrame(data_dict)   
data
year gdp cons
0 1990 5.9 3.8
1 1995 7.6 4.9
2 2000 10.2 6.8
3 2005 13.0 8.7
# This will divide cons by gdp: row by row.
data['cons_share'] = data['cons'] / data['gdp'] 

data
year gdp cons cons_share
0 1990 5.9 3.8 0.644068
1 1995 7.6 4.9 0.644737
2 2000 10.2 6.8 0.666667
3 2005 13.0 8.7 0.669231

Notice the left-hand side of the assignment. I am creating a new variable (column) in the DataFrame and assigning the consumption share to it.

# Oops, I wanted it in percentage
data['cons_share'] = data['cons_share']*100
data
year gdp cons cons_share
0 1990 5.9 3.8 64.406780
1 1995 7.6 4.9 64.473684
2 2000 10.2 6.8 66.666667
3 2005 13.0 8.7 66.923077

The +,-,/,* operators all work element wise. As we have seen, multiplying and dividing by scaler works fine, too.

DataFrame methods for simple operations

DataFrame has many methods for computing various statistics on the data. Note that some of them take an axis argument: For example, you could compute .sum() across a row or a column. You have to tell pandas which one you want.

# Sums 
print('Sum across columns')
print(data.sum(axis=1)) # summing across columns. Not terribly useful here.

print('\nSum across rows')
print(data.sum(axis=0)) # summing across rows. Cumulative GDP, consumption

print('\nSum up gdp')
print(data['gdp'].sum()) # Sum a single column. No axis necessary because this is a 1-D series.

# Means
print('\nMean of each column')
print(data.mean(axis=0)) 

print('\nMean gdp and cons')
print(data[['gdp', 'cons']].mean(axis=0)) # We could also omit the axis here as well
Sum across columns
0    2064.106780
1    2071.973684
2    2083.666667
3    2093.623077
dtype: float64

Sum across rows
year          7990.000000
gdp             36.700000
cons            24.200000
cons_share     262.470207
dtype: float64

Sum up gdp
36.7

Mean of each column
year          1997.500000
gdp              9.175000
cons             6.050000
cons_share      65.617552
dtype: float64

Mean gdp and cons
gdp     9.175
cons    6.050
dtype: float64

Try TAB completion to see the methods available or the documentation.

Here are a few: sum, mean, var, std, skew, rank, quantile, mode, min, max, kurtosis, cumsum, cumprod...

These will be even more powerful once we learn how to group data within a DataFrame and compute statistics by group.

One very useful one...

data.
# .describe() is a good place to start with a new data set.
print(data.describe())   

print('\n\n')  # Print a few blank lines.
print(data)
              year       gdp      cons  cons_share
count     4.000000   4.00000  4.000000    4.000000
mean   1997.500000   9.17500  6.050000   65.617552
std       6.454972   3.10309  2.157931    1.363750
min    1990.000000   5.90000  3.800000   64.406780
25%    1993.750000   7.17500  4.625000   64.456958
50%    1997.500000   8.90000  5.850000   65.570175
75%    2001.250000  10.90000  7.275000   66.730769
max    2005.000000  13.00000  8.700000   66.923077



   year   gdp  cons  cons_share
0  1990   5.9   3.8   64.406780
1  1995   7.6   4.9   64.473684
2  2000  10.2   6.8   66.666667
3  2005  13.0   8.7   66.923077

Practice: Calculations on DataFrames

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.

  1. Compute the mean of the consumption share for 1990 and 1995. You might try using .loc[] with two arguments .loc[rows, cols]
data.set_index('year', inplace=True)
data
gdp cons cons_share
year
1990 5.9 3.8 64.406780
1995 7.6 4.9 64.473684
2000 10.2 6.8 66.666667
2005 13.0 8.7 66.923077
data.loc[1990:1995,'cons_share']
year
1990    64.406780
1995    64.473684
Name: cons_share, dtype: float64
mean_c = data.loc[1990:1995,'cons_share'].mean()

print(data.loc[1990:1995,'cons_share'])
print('\n\n')
print(mean_c)
year
1990    64.406780
1995    64.473684
Name: cons_share, dtype: float64



64.44023193577164
data.loc[[1990,1995], 'cons_share'].mean()
64.44023193577164
  1. Try desc = data.describe() What is the return type?
desc = data.describe()
print(type(desc))
desc
<class 'pandas.core.frame.DataFrame'>
gdp cons cons_share
count 4.00000 4.000000 4.000000
mean 9.17500 6.050000 65.617552
std 3.10309 2.157931 1.363750
min 5.90000 3.800000 64.406780
25% 7.17500 4.625000 64.456958
50% 8.90000 5.850000 65.570175
75% 10.90000 7.275000 66.730769
max 13.00000 8.700000 66.923077
  1. Looking ahead, try out the following code. What does it do? Can you find the file? What is inside of it?
desc.to_csv('desc.csv')
desc.to_excel('desc.xlsx')