download notebook
view notebook w/o solutions
Calculations on DataFrames
files needed = none
Now that we understand DataFrames, let's look at basic calculations on DataFrames. We will cover more advanced calculation later.
The big idea here is that we never want to loop over the rows of a DataFrame to perform a calculation.
Pandas gives us lots of ways to perform simple and complex computations in a DataFrame. When we use the pandas' tools, there is a lot of optimized code working in the background. When we loop over the DataFrame on our own, we lose all that fast code and our program grinds away slowly.
import pandas as pd
A DataFrame naturally understands how to perform operations element-wise. For example, let's compute the share of consumption in GDP. We get started by loading up our little data dictionary and creating a DataFrame.
data_dict = {'year': [1990, 1995, 2000, 2005 ], 'gdp':[5.9, 7.6, 10.2, 13.0], 'cons':[3.8, 4.9, 6.8, 8.7]}
data = pd.DataFrame(data_dict)
data
year | gdp | cons | |
---|---|---|---|
0 | 1990 | 5.9 | 3.8 |
1 | 1995 | 7.6 | 4.9 |
2 | 2000 | 10.2 | 6.8 |
3 | 2005 | 13.0 | 8.7 |
# This will divide cons by gdp: row by row.
data['cons_share'] = data['cons'] / data['gdp']
data
year | gdp | cons | cons_share | |
---|---|---|---|---|
0 | 1990 | 5.9 | 3.8 | 0.644068 |
1 | 1995 | 7.6 | 4.9 | 0.644737 |
2 | 2000 | 10.2 | 6.8 | 0.666667 |
3 | 2005 | 13.0 | 8.7 | 0.669231 |
Notice the left-hand side of the assignment. I am creating a new variable (column) in the DataFrame and assigning the consumption share to it.
# Oops, I wanted it in percentage
data['cons_share'] = data['cons_share']*100
data
year | gdp | cons | cons_share | |
---|---|---|---|---|
0 | 1990 | 5.9 | 3.8 | 64.406780 |
1 | 1995 | 7.6 | 4.9 | 64.473684 |
2 | 2000 | 10.2 | 6.8 | 66.666667 |
3 | 2005 | 13.0 | 8.7 | 66.923077 |
The +,-,/,* operators all work element wise. As we have seen, multiplying and dividing by scaler works fine, too.
DataFrame methods for simple operations
DataFrame has many methods for computing various statistics on the data. Note that some of them take an axis argument: For example, you could compute .sum()
across a row or a column. You have to tell pandas which one you want.
# Sums
print('Sum across columns')
print(data.sum(axis=1)) # summing across columns. Not terribly useful here.
print('\nSum across rows')
print(data.sum(axis=0)) # summing across rows. Cumulative GDP, consumption
print('\nSum up gdp')
print(data['gdp'].sum()) # Sum a single column. No axis necessary because this is a 1-D series.
# Means
print('\nMean of each column')
print(data.mean(axis=0))
print('\nMean gdp and cons')
print(data[['gdp', 'cons']].mean(axis=0)) # We could also omit the axis here as well
Sum across columns
0 2064.106780
1 2071.973684
2 2083.666667
3 2093.623077
dtype: float64
Sum across rows
year 7990.000000
gdp 36.700000
cons 24.200000
cons_share 262.470207
dtype: float64
Sum up gdp
36.7
Mean of each column
year 1997.500000
gdp 9.175000
cons 6.050000
cons_share 65.617552
dtype: float64
Mean gdp and cons
gdp 9.175
cons 6.050
dtype: float64
Try TAB completion to see the methods available or the documentation.
Here are a few: sum
, mean
, var
, std
, skew
, rank
, quantile
, mode
, min
, max
, kurtosis
, cumsum
, cumprod
...
These will be even more powerful once we learn how to group data within a DataFrame and compute statistics by group.
One very useful one...
data.
# .describe() is a good place to start with a new data set.
print(data.describe())
print('\n\n') # Print a few blank lines.
print(data)
year gdp cons cons_share
count 4.000000 4.00000 4.000000 4.000000
mean 1997.500000 9.17500 6.050000 65.617552
std 6.454972 3.10309 2.157931 1.363750
min 1990.000000 5.90000 3.800000 64.406780
25% 1993.750000 7.17500 4.625000 64.456958
50% 1997.500000 8.90000 5.850000 65.570175
75% 2001.250000 10.90000 7.275000 66.730769
max 2005.000000 13.00000 8.700000 66.923077
year gdp cons cons_share
0 1990 5.9 3.8 64.406780
1 1995 7.6 4.9 64.473684
2 2000 10.2 6.8 66.666667
3 2005 13.0 8.7 66.923077
Practice: Calculations on DataFrames
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too.
- Compute the mean of the consumption share for 1990 and 1995. You might try using
.loc[]
with two arguments.loc[rows, cols]
data.set_index('year', inplace=True)
data
gdp | cons | cons_share | |
---|---|---|---|
year | |||
1990 | 5.9 | 3.8 | 64.406780 |
1995 | 7.6 | 4.9 | 64.473684 |
2000 | 10.2 | 6.8 | 66.666667 |
2005 | 13.0 | 8.7 | 66.923077 |
data.loc[1990:1995,'cons_share']
year
1990 64.406780
1995 64.473684
Name: cons_share, dtype: float64
mean_c = data.loc[1990:1995,'cons_share'].mean()
print(data.loc[1990:1995,'cons_share'])
print('\n\n')
print(mean_c)
year
1990 64.406780
1995 64.473684
Name: cons_share, dtype: float64
64.44023193577164
data.loc[[1990,1995], 'cons_share'].mean()
64.44023193577164
- Try
desc = data.describe()
What is the return type?
desc = data.describe()
print(type(desc))
desc
<class 'pandas.core.frame.DataFrame'>
gdp | cons | cons_share | |
---|---|---|---|
count | 4.00000 | 4.000000 | 4.000000 |
mean | 9.17500 | 6.050000 | 65.617552 |
std | 3.10309 | 2.157931 | 1.363750 |
min | 5.90000 | 3.800000 | 64.406780 |
25% | 7.17500 | 4.625000 | 64.456958 |
50% | 8.90000 | 5.850000 | 65.570175 |
75% | 10.90000 | 7.275000 | 66.730769 |
max | 13.00000 | 8.700000 | 66.923077 |
- Looking ahead, try out the following code. What does it do? Can you find the file? What is inside of it?
desc.to_csv('desc.csv')
desc.to_excel('desc.xlsx')