download notebook

Revisiting Gapminder

files needed = 'conts.csv'

On the first day of class, we looked at the gapminder website, in particular, this figure.

We talked about all the information we could squeeze out the figure. After three months of class, we would say that the figure is graphically excellent.

On this last day of class, I thought I would walk through constructing this figure and highlight all the skills we have picked up along the way.

Our roadmap:

  1. Get the data from the World Bank (apis)
  2. Extract the data we need (working with dicts, loops, functions; creating DataFrames)
  3. Create one data set (merging DataFrames; reading from files; dropping observations; calculations in DataFrames)
  4. Create the figure (interactive figures; visualization best practices; graphical excellence)
import requests
import pandas as pd
import plotly.express as px
import numpy as np

1. Get the data and build a DataFrame: apis, merging, and cleaning

We need data on populations, life expectancy and GDP. I will get them from the World Bank World Development Indicators dataset through the api.

After looking up the variable names, I build the urls, request the data, and parse the dicts that are returned. In theory, I should be able to ask for many variables in one api call. I kept getting an error when I tried to, so I am retrieving each series individually. There is always something to debug...

This is (unnecessarily) repetitive, so I will write a short function and use a loop.

urlpop = 'http://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?format=json&date=2020&per_page=300'
urlgdp = 'http://api.worldbank.org/v2/country/all/indicator/NY.GDP.MKTP.PP.CD?format=json&date=2020&per_page=300'
urlexp = 'http://api.worldbank.org/v2/country/all/indicator/SP.DYN.LE00.IN?format=json&date=2020&per_page=300'
def json_df(j, col):
    '''
    Given the dict retrieved from the WB api, extract the country name, iso, date, and the variable. Return a DataFrame.
    '''
    names, codes, dates, var = [], [], [], []
    for c in j[1]:
        names.append(c['country']['value'])
        codes.append(c['countryiso3code'])
        dates.append(c['date'])
        var.append(c['value'])

    return pd.DataFrame({'country name':names, 'iso':codes, 'date':dates, col:var})
# Loop through the three urls. Each call to json_df returns a DataFrame. I am collecting the DataFrames into a dict.
dfs = {}
for url, v in zip([urlpop, urlgdp, urlexp], ['pop', 'gdp', 'exp']):
    response = requests.get(url)
    dfs[v] = json_df(response.json(), v)

3. Create one DataFrame

Now I have three DataFrames in the dict dfs. Merge them together.

data = pd.merge(left=dfs['pop'], right=dfs['gdp'], on=['country name', 'iso', 'date'])
data = pd.merge(left=data, right=dfs['exp'], on=['country name', 'iso', 'date'])
data.set_index('iso', inplace=True)
data.head()

Data cleaning

The data include a bunch of country aggregates. I bet there is a way to drop these when I get the data from the World Bank. Maybe I can figure it out for version 2.0 of this notebook. I'm also dropping any countries with missing data.

aggs = ['AFE', 'AFW', 'ARB', 'CSS', 'CEB', 'EAR', 'EAS', 'EAP', 'TEA', 'EMU', 'ECS', 'ECA', 'EUU', 'FCS', 'TEC',
       'HPC', 'IBD', 'IBT', 'IDB', 'IDX', 'IDA', 'LTE', 'LCN', 'LAC', 'TLA', 'LDC', 'LMY', 'MEA', 'MNA', 'TMN', 'MIC',
       'NAC', 'OED', 'OSS', 'PSS', 'PST', 'PRE', 'SST', 'SAS', 'TSA', 'SSF', 'SSA', 'TSS', 'WLD', '']

data.drop(aggs, inplace=True)
data.dropna(subset=['pop', 'gdp', 'exp'], inplace=True)
data.head(2)

The WB data do not include a 'continent' variable. The file 'conts.csv' has the information I need in it. One more merge.

data = pd.merge(left=data, right=pd.read_csv('conts.csv'), left_index=True, right_on='iso')
data = data.rename(columns={'cont':'regions'})
data.set_index('iso', inplace=True)
data.head(2)

Looking ahead, I know that I want a log base-2 x axis. I'm creating the gdp per capita value and then taking the log. I take a look at the US as a sanity check. gdp per capita of 63K looks about right. Notice that \(2^{15.9477}=63206.\)

data['gdpcap'] = data['gdp']/data['pop']
data['gdpcaplog2'] = np.log2(data['gdpcap'])
data.loc['USA']

4. The figure: interactive figures, graphical excellence

Let's run the code an check out the figure. Then we can go back and look at the code.

# The x data are in log base two. They numbers between 9 and 16. Most people would not find these labels
# very intuitive, so we are going to replace the labels with ones that are easy to read. 

xv = [500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000]
xticklabs = ['500', '1000', '2000', '4000', '8000', '16K', '32K', '64K', '128K']
xtickvals = [np.log2(x) for x in xv]
fig = px.scatter(data, x='gdpcaplog2', y='exp', size='pop', color='regions', 
                 hover_name='country name', 
                 hover_data={'gdpcaplog2':False, 'exp':False, 'pop':False, 'regions':False}, 
                 size_max=70,
                 color_discrete_map={'Asia':'#ff798e', 'Europe':'#ffec33', 'Africa':'#33dded', 'Americas':'#99ef33'}
                )

fig.update_layout(plot_bgcolor='white',
                  xaxis={'title':'Income', 
                         'tickmode':'array', 'tickvals':xtickvals, 'ticktext':xticklabs,
                         'gridcolor':'lightgray'},
                  yaxis={'title':'Life expectancy','gridcolor':'lightgray'},
                  yaxis_range=[10,91],
                  hoverlabel={'bgcolor':'white'}
                  )

fig.update_traces(marker={'line':{'width':0.75, 'color':'black'}})

# Style the axes
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', 
                 showspikes=True, spikemode='across', spikethickness=1.5, spikecolor='lightgray')  

fig.update_yaxes(showline=True, linewidth=1, linecolor='black', 
                 showspikes=True, spikemode='across', spikethickness=1.5, spikecolor='lightgray')

# The annotations for the x label and the big "2020"
fig.add_annotation(x=15.5, y=12, text='per person (GDP/capita, PPP$ inflation-adjusted)', showarrow=False,
                  font={'size':12, 'color':'black'})

fig.add_annotation(x=13, y=50, text='2020', showarrow=False,
                  font={'size':100, 'color':'lightgray'}, opacity=0.75)

import plotly.io as pio

pio.write_html(fig, 
               file='gap_minder.html', 
               full_html = True, 
               auto_open=False, 
               config={'displayModeBar': False, 'showTips': False, 'responsive': True}
              )

fig.show()

The x axis

Check out the x axis. Each grid line is a doubling of income. The x data are transformed into their log base-2 values. For example, Afghanistan's GDP per capita is 2078. The log2 of 2078 is 11.02:

$$2^{11.02} = 2078$$

So the real units on the x axis run from about 9.5 to 16. Comment out this part of the code to see it.

'tickmode':'array', 'tickvals':xtickvals, 'ticktext':xticklabs,
data.head(2)

I do not want the log values on the axis. So I will set the values manually.

xv = [500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000]
xtickvals = [np.log2(x) for x in xv]

The xv list holds the (base 10) values I want as tick marks. They could be any values I wanted as ticks. The xtickvals are these values translated to base 2. These are the values of the tick marks.

The labels can be any arbitrary text. I could have just used xv as the labels, but I decided to go with the 'K' notation from the website.

xticklabs = ['500', '1000', '2000', '4000', '8000', '16K', '32K', '64K', '128K']

Spikes

When I hover on a dot, I get two lines that point out the values on the axis. These are 'spikes' and we style them with .update_xaxes().

fig.update_xaxes(showline=True, linewidth=1, linecolor='black', 
                 showspikes=True, spikemode='across', spikethickness=1.5, spikecolor='lightgray')  

Colors

We have always used named colors in our work: 'black', 'red', 'skyblue'. matplotlib, plotly, etc. allow for very fine control of color. There several ways to specify a specific color. The two most popular are probably by its hex number or its rgb values.

I am terrible at graphic design. As far as I know, I am not color blind but I needed to use a color picker and a screen shot from gapminder to get the colors right.

color_discrete_map={'Asia':'#ff798e', 'Europe':'#ffec33', 'Africa':'#33dded', 'Americas':'#99ef33'}

Bubble charts

If you look through our matplotlib/seaborn notebooks, we discussed bubble charts. A bubble chart scales the size of the data point marker by some third variable. In our case, it is the population of the country.

fig = px.scatter(data, x='gdpcaplog2', y='exp', size='pop', size_max=70)

size_max sets the size of the largest bubble in the figure and the rest are scaled down from there.

Not bad for a semester's work!