download notebook
view notebook w/ solutions

k-mean clustering

files needed = ('iris.csv', 'wi_mn_counties.csv')

In the last notebook we looked at the classifier problem. We knew what species a flower could be and we built a model to predict which species an observation was based on its characteristics.

In this notebook we look at the clustering problem. In the clustering problem, we do not know how many species there are! Instead we try to group observations together in clusters based on how similar the observations are.

You might use clustering to analyze a market (which customers are similar?), to sniff out bots on twitter, to find blocks of voters that might be on the fence, or to group incoming freshman into dorms.

import pandas as pd
import sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns

Let's return to the iris data. We can cluster on many characteristics, but if we use only two, we can visualize them in a scatter plot. Let's use the petal measurements.

We are going to pretend that we do not know the species of each observation. We will use the species information later to see how well the cluster method worked.

iris = pd.read_csv('iris.csv')
iris.head(1)
fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(iris['petal length'], iris['petal width'], color='blue')

ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')

sns.despine()

plt.show()

Notice that I made all the dots the same color. I am pretending that I do now know which species each dot belongs to. In fact, I am pretending that I do not even know how many species there are.

I see two distinct blobs. Let's guess that there are two clusters (k=2) in this data and estimate the model.

Like we saw in the classifier problem, we need numpy arrays to work with sklearn. Below, I convert the two characteristics to numpy arrays.

X = iris[['petal width', 'petal length']].to_numpy()

Now we estimate the model by fining the centers of the clusters. It works (roughly) by

  1. Choosing a center for each cluster.
  2. Computing, for each cluster, the average distance from the center to each point in the cluster.

The algorithm varies the location of the the cluster centers until it has minimized the average distance from the center of the clusters to their members.

# The random_state option ensures we all get the same answer.
kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0).fit(X)

The object returned from the model contains (among other things) the coordinates of the centers. They are numpy arrays. I will convert them to a dataframe.

kmeans.cluster_centers_
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])
centers
fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')

ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')

sns.despine()

plt.show()

This is about what I expected from 'eyeballing' it.

We know that there are three types of irises in the data set. Let's try 3 clusters.

kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])
centers
fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')

ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')

sns.despine()

plt.show()

Do these three centers make sense? Let's grab the predicted values (the labels) for each data point and plot them. Then I will merge them onto the original DataFrame.

kmeans.labels_
pred = pd.DataFrame(kmeans.labels_, columns=['predicted'])
iris = pd.merge(left=pred, right=iris, left_index=True, right_index=True)
iris.head(10)

The algorithm labels the cluster 0, 1, and 2. Let's give them reasonable names.

iris['predicted'] = iris['predicted'].replace({0:'setosa', 1:'versicolor', 2:'virginica'})
iris.head()
fig, ax = plt.subplots(1,2,figsize=(15,6))


g = sns.scatterplot(ax=ax[0], x='petal length', y='petal width', hue='predicted', data=iris)
g.legend_.remove()

g = sns.scatterplot(ax=ax[1], x='petal length', y='petal width', hue='species', data=iris)

ax[0].scatter(centers['x'], centers['y'], color='red', s=300, marker='*')

ax[0].set_title('predicted')
ax[1].set_title('actual')


sns.despine(ax=ax[0])
sns.despine(ax=ax[1])

plt.show()

The algorithm does a good job. The setosa cluster is perfect — which is not surprising. The split between the other two types is decent, but not perfect.


How many clusters? The elbow graph.

When I looked at the data and pretended that I did not know how many species there were, I thought there were two clusters. When I added a third, it did a pretty good job matching the data.

But the whole point of these models is that we do not know how many clusters there are. So how do we pick the number of clusters?

The short answer: Try a bunch as see what happens.

We can loop over the number of clusters and keep track of the sum of squared distance between the cluster centers and the members of the clusters.

kmeans.inertia_
y=[]
for n in range(1,8):
    kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
    y.append(kmeans.inertia_)

y
fig, ax = plt.subplots(figsize=(15,6))

ax.plot(range(1,8), y, color='blue', marker='.')

ax.set_xlabel('number of clusters (k)')

ax.set_ylabel('sum of squared distances to nearest cluster')

sns.despine(ax=ax)

We see that moving from 1 to 2 clusters made a big improvement. ML people call this the "elbow" of the graph because the figure looks like an arm, I guess. Moving from 2 to 3 clusters made less of an improvement.

We can always lower the sum of squared distance (ssd) by increasing k. When k is the same as the number of observations, then each cluster has exactly one observation in it and the ssd is zero. But that isn't very helpful. As economists, we would say that there are diminishing returns to k.

From this figure, I would choose either k=2 or k=3. After that, we do not get much improvement in the ssd.

Practice

Which Minnesota counties are similar to which Wisconsin counties? Perhaps your company has spent a lot of money researching Wisconsin as a marketplace. Now it wants to expand into Minnesota. Can we find counties in Minnesota that are similar to counties in Wisconsin? Let's use our clustering model to see.

  1. Load the file 'wi_mn_counties.csv'.

The variables are

  • income in thousands of USD, for 2018
  • population number of persons, for 2018
  • ALAND area of the county in square meters

  • Create an income per capita variable: income/pop

  • Create a density variable: pop/(ALAND/1000000)

Dividing ALAND by 1,000,000 makes the unit people per square kilometer

  1. Make a scatter plot of income and density. Do you see a relationship? Would a linear model be helpful here?

  2. Compute the elbow graph using density and income_cap as the variables. Try 1 through 10 possible clusters.

Remember, you need to convert your DataFrame data to numpy areas.

  1. Based on your elbow graph, how many clusters seem appropriate?

Let's explore our results some more.

  1. Run the kmeans model with k=3
  2. Retrieve the labels_ from the results
  3. Merge the labels onto your original DataFrame, using the index as the merge key

  4. Plot income_cap and density again. Use seaborn's scatterplot and set the hue to the labels_ you retrieved. What patterns do you see?

You might add labels to the points so you can tell which counties are which.

  1. Redo question 9, but with k=4. How do the results change?