course: (ML Series) K Means Clustering

By Julien Hernandez Lallement, 2020-09-29, in category Course

KMeans, machine learning, python

Introduction

K Mean Clustering (KMC) is an unsupervised learning approach. It can be used when you have a series of data points represented by a vector of features, that are not classified or labelled before hand.

The idea here is to group these data points in some kind of meaningful clusters.

Use cases

KMC can be used in different scenarios:

  • clusters data points in a meaningful way based on similarities between their features.
  • customer classification based on chosen features
  • grouping images together

The good thing with KMC is that it is fairly easy to implement, easy to grasp because it's quite intuitive, and it adds value & insights to your projects very fast by identifying relevant groups in your dataset. In some cases, even if the data visualization does not suggest clear groups, it might still be useful to differentiate between some, in order to draw a strategy and action plan.

Theoretical Backgroud

As always, we have $N$ data points represented by a vector of $M$ features.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (14, 8)
In [2]:
np.random.seed(40)
first_cluster_x = np.random.random_sample(10)
first_cluster_y = np.random.random_sample(10)
second_cluster_x = np.random.random_sample(10)+10
second_cluster_y = np.random.random_sample(10)+10
third_cluster_x = np.random.random_sample(10)*10
third_cluster_y = np.random.random_sample(10)*10
In [3]:
plt.scatter(first_cluster_x, first_cluster_y);
plt.scatter(second_cluster_x, second_cluster_y);
plt.scatter(third_cluster_x, third_cluster_y);
plt.legend(["Cluster 1", "Cluster 2", "Cluster 3"]);

Here, I generated three distributions that I named Cluster 1, 2 & 3.

As you can see, while cluster 1 (blue) seems relatively isolated, some points from cluster 3 (green) might rather be associated with Cluster 2 (orange).

The goal being to group these data points together, we will choose the number of clusters $K$ that we want the algorithm to form. We will see later how we can choose the best value of $K$.

For now, let's say we use $K$ = 3 because we believe that there are 3 natural clusters in the way your data is distributed.

The algorithm will optimize the placement of the cluster's centers (also called centroids) so that each of the $N$ data points is closest to a cluster center.

Now imagine you are evoling not in a two dimensional space, but in in $M$ dimensional space, $M$ being the number of features in your data set. While it might be relatively easy for you to eye ball the centroids in the plot above, it would be much harder to do it when $M$ > 2.

Mathematically, the algorithm will minimize the intra-cluster variance, by measuring how the distance of the data points from the center of the cluster. The centers are chosen randomly at first, and optimized over several iterations of the calculations.

Note that since we are measuring distances, we should scale the data so to have comparable scales, otherwise features with large numbers will outweigh lower value features.

Once that is done, the algorithm will randomly choose two centroids, depicted in red in the graph below.

In [5]:
plt.scatter(first_cluster_x, first_cluster_y);
plt.scatter(second_cluster_x, second_cluster_y);
plt.scatter(third_cluster_x, third_cluster_y);
plt.scatter(centroid_x, centroid_y, s=400,marker=r'$\clubsuit$');
plt.legend(["Cluster 1", "Cluster 2", "Centroids"]);

OK, so we have two leprechauns thrown on the canvas, and the algorithm can use their location to calculate the distance from each data point in the space. As a reminder, here are the different equations for the most typical distance functions:

\begin{equation*} \left (Euclidian \right) = \left (\sqrt{\sum \limits _{i=1} ^{k }{x_i - y_i}^2}\right) \end{equation*}\begin{equation*} \left (Manhattan \right) = \left (\sum \limits _{i=1} ^{k }{|x_i - y_i|}\right) \end{equation*}\begin{equation*} \left (Minkowski \right) = \left (\sum \limits _{i=1} ^{k }{(|x_i - y_i|)^q}\right)^{1/q} \end{equation*}

Each data point is then associated with its nearest cluster center.

That marks the end of the first iteration. Now all the dots are "colored" if you will, meaning that they all belong to a cluster.

Once that is done, the algorithm will take all the data points associated with cluster 1 and will recalculate its center of mass. This means that the centroids will slighly move based on the data points that have been associated to it,

Now that you have new centroids, you can recalculate the distance from all N data points, and see whether some data points switched clusters. You repeat the operation until the behavior within cluster is relatively stable (that is, until you reach convergence), and the result of that will be your final clusters.

Model optimization

Given the number of clusters K defined by the user, the algorithm minimizes the error until convergence.

However, while 2 clusters made sense when looking at the data visually, there might be a better solution involving more clusters. To visualize these possibilities, one can use a cree plot.

Summing up all the squared distances to the nearest cluster provides a measure of error.

For $K$ = 1, the error will be at the highest point, since there will be one unique center from which distances will be measured.

Error will decrease for increasing value of $K$, and reaches a value = 0 when $K$ = N data points, since you would have one cluster per data point.

The cree plot will display the error plotted against the number of clusters $K$.

In [21]:
plt.plot(x_value_elbow, y_value_elbow, marker='x')
plt.plot(x_value_gradual, y_value_gradual, marker='o')
plt.legend(["Elbow example", "Gradual example"])
plt.xlabel("Number of clusters K")
plt.ylabel("Error")
plt.yticks(y_value_gradual, " ");

As can be seen in the plot above, you can have different versions of the cree plot.

In the elbow example, you see that the error decreases sharply with increasing $K$, but the slope attenuates after $K$ = 3. In such cases, you observe a so-called elbow (this approach is sometimes referred to as the elbow method) and the optimal number of clusters here should be $K$ = 3.

In the gradual example, the error steadily decreases, since you increase the number of clusters, but there is no clear breaking point. Hence, it will be more difficult to find an optimal number of clusters. In those cases, I recommend to use common sense (but also in the other example :D) and visualize your data thoroughly. Often, that will already give you a good idea of the possible clusters.

There are other method that could help you determine the optimal cluster number, one of which I will be exploring below.

Practical Demonstration

I downloaded a dataset containing information about twitter users. Let's see whether we can cluster the data based on a few data features using K Means. You can find the dataset in my repo, in the file "twitter_user.csv".

In [100]:
import os
os.chdir("/home/julien/website/datasets/")
In [101]:
import pandas as pd
In [102]:
df = pd.read_csv("twitter_users.csv")
In [103]:
df
Out[103]:
user op co ex ag ne wordcount category
0 3gerardpique 34.297953 28.148819 41.948819 29.370315 9.841575 37.0945 7
1 aguerosergiokun 44.986842 20.525865 37.938947 24.279098 10.362406 78.7970 7
2 albertochicote 41.733854 13.745417 38.999896 34.645521 8.836979 49.2604 4
3 AlejandroSanz 40.377154 15.377462 52.337538 31.082154 5.032231 80.4538 2
4 alfredocasero1 36.664677 19.642258 48.530806 31.138871 7.305968 47.0645 4
... ... ... ... ... ... ... ... ...
135 XabiAlonso 35.569389 22.133740 38.904885 31.624351 12.201221 47.5420 7
136 XaviRodriguez 31.960417 15.416979 48.533125 40.096458 11.764583 47.5625 4
137 xoellopez 71.696129 12.489274 27.571210 19.093548 3.241935 74.3065 2
138 ZacEfron 51.537405 26.009695 36.465344 23.008168 7.284962 118.6107 1
139 _josecoronado_ 36.258913 18.769348 45.225652 39.427283 9.252065 113.7391 1

140 rows × 8 columns

The file contains 9 data features from Twitter user, that sums up behavioral tendencies and a few features. For instance, the category feature describes the type of job.

Let's try to use KMeans to group individuals based on their jobs

In [133]:
pd.plotting.scatter_matrix(df);

Let's estimate the optimal number of clusters

In [168]:
from sklearn.cluster import KMeans
In [263]:
# Drop the user name and the category feature
X = df.drop(columns={"wordcount","category", "user"}).values
In [264]:
# Define number of clusters to test
nClusters = range(1, 10)
# Load KMeans
kmeans = [KMeans(n_clusters=i) for i in nClusters]
# Compute score
score = [kmeans[i].fit(X).score(X) for i in range(len(kmeans))]
# Plot curve
plt.plot(nClusters,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

Here is a case where the elbow is not very salient, and it is a bit more complicated to detect the optimal number of clusters. Nonetheless, I would eye ball the optimal number of clusters to $K$ = 4.

Note that there are other ways one could use to select the optimal number of clusters. For example, a silhouette analysis can be used to measure the separation (using a distance function) between the number of clusters K chosen.

This measure has a range of [-1, 1].

Silhouette coefficients neighboring +1 represent data that is far away from the neighboring clusters.

Values around 0 represent data that is very close to the decision boundary between two neighboring clusters.

Negative values indicate that those samples might have been assigned to the wrong cluster.

I will be using the silhouette analysis below while keeping the number of clusters to 3 as suggested by the elbow analysis

In [268]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

n_clusters = 4

X = df.drop(columns={"wordcount","category", "user"}).values
y = df.category
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.7)

pipe = Pipeline([('scaler', StandardScaler()),
                  ('kmean', KMeans(n_clusters=n_clusters, init='k-means++'))
                 ]
                )
      

cluster_labels = pipe.fit_predict(X_train)

from sklearn.metrics import silhouette_score, silhouette_samples


# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
silhouette_avg = silhouette_score(X_train, cluster_labels)
print("For n_clusters =", n_clusters,
      "The average silhouette_score is :", silhouette_avg.round(decimals=2))

# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X_train, cluster_labels)
For n_clusters = 4 The average silhouette_score is : 0.26

It is particularly difficult to assess accuracy of unsupervised clustering algorithms since they use by definition unlabelled data. If you are not sure what you are looking for, it's difficult to measure the success of the results. In these cases (but also in all other situations!), it makes real sense to talk to stakeholders (including non tech) and try to grasp as much information as possible about what could be expected. Knowledge from more experienced users will be valuable to identify whether your approach made sense.

On top of glaning information before hand, it makes also sense to plot the data to observe your clusters.

If you are working with low dimensional data, one good idea might to plot it to visualize the result of the algorith. That won't be possible if you evolve in high dimensional space (since visualization will become messy), but it could help you realize aspect of the data unknown to you so far.

Below, an example where I plot three data features together with the cluster center

In [165]:
from mpl_toolkits.mplot3d import Axes3D
plt.rcParams['figure.figsize'] = (16, 9)
In [270]:
# Fit model
kmeans = KMeans(n_clusters=n_clusters).fit(X_train)
# Predicting the clusters
labels = kmeans.predict(X_train)
# Getting the cluster centers
clust = kmeans.cluster_centers_
In [280]:
# Define colors
colors=['red','green','blue','cyan']
assign=[]
for row in labels:
    assign.append(colors[row])

## Plot
# Define the columns you want to visualize
x_axis = 0
y_axis = 1
z_axis = 3
# Make plot
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 3], c=assign,s=60)
ax.scatter(clust[:, x_axis], clust[:, y_axis], clust[:, z_axis], marker='*', c=colors, s=1000)
# Get labels
ax.set_xlabel(df.columns[x_axis+1])
ax.set_ylabel(df.columns[y_axis+1])
ax.set_zlabel(df.columns[z_axis+1])
Out[280]:
Text(0.5, 0, 'ag')

In that case, one might argue that the two middle clusters are not very informative since they overlap quite a bit. I typically use both elbow method and silhouette analysis to determine the optimal number of clusters, together with prior business knowledge, and of course, common sense ;)