K-Means Clustering in Python: A Practical Guide (original) (raw)

The _k_-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but _k_-means is one of the oldest and most approachable. These traits make implementing _k_-means clustering in Python reasonably straightforward, even for novice programmers and data scientists.

If you’re interested in learning how and when to implement _k_-means clustering in Python, then this is the right place. You’ll walk through an end-to-end example of _k_-means clustering using Python, from preprocessing the data to evaluating results.

In this tutorial, you’ll learn:

What _k_-means clustering is
When to use _k_-means clustering to analyze your data
How to implement _k_-means clustering in Python with scikit-learn
How to select a meaningful number of clusters

Click the link below to download the code you’ll use to follow along with the examples in this tutorial and implement your own _k_-means clustering pipeline:

What Is Clustering?

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data:

Meaningfulness
Usefulness

Meaningful clusters expand domain knowledge. For example, in the medical field, researchers applied clustering to gene expression experiments. The clustering results identified groups of patients who respond differently to medical treatments.

Useful clusters, on the other hand, serve as an intermediate step in a data pipeline. For example, businesses use clustering for customer segmentation. The clustering results segment customers into groups with similar purchase histories, which businesses can then use to create targeted advertising campaigns.

There are many other applications of clustering, such as document clustering and social network analysis. These applications are relevant in nearly every industry, making clustering a valuable skill for professionals working with data in any field.

Overview of Clustering Techniques

You can perform clustering using many different approaches—so many, in fact, that there are entire categories of clustering algorithms. Each of these categories has its own unique strengths and weaknesses. This means that certain clustering algorithms will result in more natural cluster assignments depending on the input data.

Selecting an appropriate clustering algorithm for your dataset is often difficult due to the number of choices available. Some important factors that affect this decision include the characteristics of the clusters, the features of the dataset, the number of outliers, and the number of data objects.

You’ll explore how these factors help determine which approach is most appropriate by looking at three popular categories of clustering algorithms:

Partitional clustering
Hierarchical clustering
Density-based clustering

It’s worth reviewing these categories at a high level before jumping right into _k_-means. You’ll learn the strengths and weaknesses of each category to provide context for how _k_-means fits into the landscape of clustering algorithms.

Partitional Clustering

Partitional clustering divides data objects into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.

These techniques require the user to specify the number of clusters, indicated by the variable k. Many partitional clustering algorithms work through an iterative process to assign subsets of data points into k clusters. Two examples of partitional clustering algorithms are _k_-means and _k_-medoids.

These algorithms are both nondeterministic, meaning they could produce different results from two separate runs even if the runs were based on the same input.

Partitional clustering methods have several strengths:

They work well when clusters have a spherical shape.
They’re scalable with respect to algorithm complexity.

They also have several weaknesses:

They’re not well suited for clusters with complex shapes and different sizes.
They break down when used with clusters of different densities.

Hierarchical Clustering

Hierarchical clustering determines cluster assignments by building a hierarchy. This is implemented by either a bottom-up or a top-down approach:

Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar until all points have been merged into a single cluster.
Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least similar clusters at each step until only single data points remain.

These methods produce a tree-based hierarchy of points called a dendrogram. Similar to partitional clustering, in hierarchical clustering the number of clusters (k) is often predetermined by the user. Clusters are assigned by cutting the dendrogram at a specified depth that results in k groups of smaller dendrograms.

Unlike many partitional clustering techniques, hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run an algorithm twice on the same input data.

The strengths of hierarchical clustering methods include the following:

They often reveal the finer details about the relationships between data objects.
They provide an interpretable dendrogram.

The weaknesses of hierarchical clustering methods include the following:

They’re computationally expensive with respect to algorithm complexity.
They’re sensitive to noise and outliers.

Density-Based Clustering

Density-based clustering determines cluster assignments based on the density of data points in a region. Clusters are assigned where there are high densities of data points separated by low-density regions.

Unlike the other clustering categories, this approach doesn’t require the user to specify the number of clusters. Instead, there is a distance-based parameter that acts as a tunable threshold. This threshold determines how close points must be to be considered a cluster member.

Examples of density-based clustering algorithms include Density-Based Spatial Clustering of Applications with Noise, or DBSCAN, and Ordering Points To Identify the Clustering Structure, or OPTICS.

The strengths of density-based clustering methods include the following:

They excel at identifying clusters of nonspherical shapes.
They’re resistant to outliers.

The weaknesses of density-based clustering methods include the following:

They aren’t well suited for clustering in high-dimensional spaces.
They have trouble identifying clusters of varying densities.

How to Perform K-Means Clustering in Python

In this section, you’ll take a step-by-step tour of the conventional version of the _k_-means algorithm. Understanding the details of the algorithm is a fundamental step in the process of writing your _k_-means clustering pipeline in Python. What you learn in this section will help you decide if _k_-means is the right choice to solve your clustering problem.

Understanding the K-Means Algorithm

Conventional _k_-means requires only a few steps. The first step is to randomly select k centroids, where k is equal to the number of clusters you choose. Centroids are data points representing the center of a cluster.

The main element of the algorithm works by a two-step process called expectation-maximization. The expectation step assigns each data point to its nearest centroid. Then, the maximization step computes the mean of all the points for each cluster and sets the new centroid. Here’s what the conventional version of the _k_-means algorithm looks like:

The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge, or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of _k_-means is to try to minimize this value.

The figure below shows the centroids and SSE updating through the first five iterations from two different runs of the _k_-means algorithm on the same dataset:

The purpose of this figure is to show that the initialization of the centroids is an important step. It also highlights the use of SSE as a measure of clustering performance. After choosing a number of clusters and the initial centroids, the expectation-maximization step is repeated until the centroid positions reach convergence and are unchanged.

The random initialization step causes the _k_-means algorithm to be nondeterministic, meaning that cluster assignments will vary if you run the same algorithm twice on the same dataset. Researchers commonly run several initializations of the entire _k_-means algorithm and choose the cluster assignments from the initialization with the lowest SSE.

Writing Your First K-Means Clustering Code in Python

Thankfully, there’s a robust implementation of _k_-means clustering in Python from the popular machine learning package scikit-learn. You’ll learn how to write a practical implementation of the _k_-means algorithm using the scikit-learn version of the algorithm.

The code in this tutorial requires some popular external Python packages and assumes that you’ve installed Python with Anaconda. For more information on setting up your Python environment for machine learning in Windows, read through Setting Up Python for Machine Learning on Windows.

Otherwise, you can begin by installing the required packages:

The code is presented so that you can follow along in an ipython console or Jupyter Notebook. Click the prompt (>>>) at the top right of each code block to see the code formatted for copy-paste. You can also download the source code used in this article by clicking on the link below:

This step will import the modules needed for all the code in this section:

You can generate the data from the above GIF using make_blobs(), a convenience function in scikit-learn used to generate synthetic clusters. make_blobs() uses these parameters:

n_samples is the total number of samples to generate.
centers is the number of centers to generate.
cluster_std is the standard deviation.

make_blobs() returns a tuple of two values:

A two-dimensional NumPy array with the x- and y-values for each of the samples
A one-dimensional NumPy array containing the cluster labels for each sample

Generate the synthetic data and labels:

Nondeterministic machine learning algorithms like _k_-means are difficult to reproduce. The random_state parameter is set to an integer value so you can follow the data presented in the tutorial. In practice, it’s best to leave random_state as the default value, None.

Here’s a look at the first five elements for each of the variables returned by make_blobs():

Data sets usually contain numerical features that have been measured in different units, such as height (in inches) and weight (in pounds). A machine learning algorithm would consider weight more important than height only because the values for weight are larger and have higher variability from person to person.

Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling. It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm.

There are several approaches to implementing feature scaling. A great way to determine which technique is appropriate for your dataset is to read scikit-learn’s preprocessing documentation.

In this example, you’ll use the StandardScaler class. This class implements a type of feature scaling called standardization. Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1:

Take a look at how the values have been scaled in scaled_features:

Now the data are ready to be clustered. The KMeans estimator class in scikit-learn is where you set the algorithm parameters before fitting the estimator to the data. The scikit-learn implementation is flexible, providing several parameters that can be tuned.

Here are the parameters used in this example:

init controls the initialization technique. The standard version of the _k_-means algorithm is implemented by setting init to "random". Setting this to "k-means++" employs an advanced trick to speed up convergence, which you’ll use later.
n_clusters sets k for the clustering step. This is the most important parameter for _k_-means.
n_init sets the number of initializations to perform. This is important because two runs can converge on different cluster assignments. The default behavior for the scikit-learn algorithm is to perform ten _k_-means runs and return the results of the one with the lowest SSE.
max_iter sets the number of maximum iterations for each initialization of the _k_-means algorithm.

Instantiate the KMeans class with the following arguments:

The parameter names match the language that was used to describe the _k_-means algorithm earlier in the tutorial. Now that the _k_-means class is ready, the next step is to fit it to the data in scaled_features. This will perform ten runs of the _k_-means algorithm on your data with a maximum of 300 iterations per run:

Statistics from the initialization run with the lowest SSE are available as attributes of kmeans after calling .fit():

Finally, the cluster assignments are stored as a one-dimensional NumPy array in kmeans.labels_. Here’s a look at the first five predicted labels:

Note that the order of the cluster labels for the first two data objects was flipped. The order was [1, 0] in true_labels but [0, 1] in kmeans.labels_ even though those data objects are still members of their original clusters in kmeans.lables_.

This behavior is normal, as the ordering of cluster labels is dependent on the initialization. Cluster 0 from the first run could be labeled cluster 1 in the second run and vice versa. This doesn’t affect clustering evaluation metrics.

Choosing the Appropriate Number of Clusters

In this section, you’ll look at two methods that are commonly used to evaluate the appropriate number of clusters:

The elbow method
The silhouette coefficient

These are often used as complementary evaluation techniques rather than one being preferred over the other. To perform the elbow method, run several _k_-means, increment k with each iteration, and record the SSE:

The previous code block made use of Python’s dictionary unpacking operator (**). To learn more about this powerful Python operator, check out How to Iterate Through a Dictionary in Python.

When you plot SSE as a function of the number of clusters, notice that SSE continues to decrease as you increase k. As more centroids are added, the distance from each point to its closest centroid will decrease.

There’s a sweet spot where the SSE curve starts to bend known as the elbow point. The x-value of this point is thought to be a reasonable trade-off between error and number of clusters. In this example, the elbow is located at x=3:

The above code produces the following plot:

Determining the elbow point in the SSE curve isn’t always straightforward. If you’re having trouble choosing the elbow point of the curve, then you could use a Python package, kneed, to identify the elbow point programmatically:

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies how well a data point fits into its assigned cluster based on two factors:

How close the data point is to other points in the cluster
How far away the data point is from points in other clusters

Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their clusters than they are to other clusters.

In the scikit-learn implementation of the silhouette coefficient, the average silhouette coefficient of all the samples is summarized into one score. The silhouette score() function needs a minimum of two clusters, or it will raise an exception.

Loop through values of k again. This time, instead of computing SSE, compute the silhouette coefficient:

Plotting the average silhouette scores for each k shows that the best choice for k is 3 since it has the maximum score:

The above code produces the following plot:

Ultimately, your decision on the number of clusters to use should be guided by a combination of domain knowledge and clustering evaluation metrics.

Evaluating Clustering Performance Using Advanced Techniques

The elbow method and silhouette coefficient evaluate clustering performance without the use of ground truth labels. Ground truth labels categorize data points into groups based on assignment by a human or an existing algorithm. These types of metrics do their best to suggest the correct number of clusters but can be deceiving when used without context.

When comparing _k_-means against a density-based approach on nonspherical clusters, the results from the elbow method and silhouette coefficient rarely match human intuition. This scenario highlights why advanced clustering evaluation techniques are necessary. To visualize an example, import these additional modules:

This time, use make_moons() to generate synthetic data in the shape of crescents:

Fit both a _k_-means and a DBSCAN algorithm to the new data and visually assess the performance by plotting the cluster assignments with Matplotlib:

Print the silhouette coefficient for each of the two algorithms and compare them. A higher silhouette coefficient suggests better clusters, which is misleading in this scenario:

The silhouette coefficient is higher for the _k_-means algorithm. The DBSCAN algorithm appears to find more natural clusters according to the shape of the data:

This suggests that you need a better method to compare the performance of these two clustering algorithms.

If you’re interested, you can find the code for the above plot by expanding the box below.

To learn more about plotting with Matplotlib and Python, check out Python Plotting with Matplotlib (Guide). Here’s how you can plot the comparison of the two algorithms in the crescent moons example:

Since the ground truth labels are known, it’s possible to use a clustering metric that considers labels in its evaluation. You can use the scikit-learn implementation of a common metric called the adjusted rand index (ARI). Unlike the silhouette coefficient, the ARI uses true cluster assignments to measure the similarity between true and predicted labels.

Compare the clustering results of DBSCAN and _k_-means using ARI as the performance metric:

The ARI output values range between -1 and 1. A score close to 0.0 indicates random assignments, and a score close to 1 indicates perfectly labeled clusters.

Based on the above output, you can see that the silhouette coefficient was misleading. ARI shows that DBSCAN is the best choice for the synthetic crescents example as compared to _k_-means.

There are several metrics that evaluate the quality of clustering algorithms. Reading through the implementations in scikit-learn will help you select an appropriate clustering evaluation metric.

How to Build a K-Means Clustering Pipeline in Python

Now that you have a basic understanding of _k_-means clustering in Python, it’s time to perform _k_-means clustering on a real-world dataset. These data contain gene expression values from a manuscript authored by The Cancer Genome Atlas (TCGA) Pan-Cancer analysis project investigators.

There are 881 samples (rows) representing five distinct cancer subtypes. Each sample has gene expression values for 20,531 genes (columns). The dataset is available from the UC Irvine Machine Learning Repository, but you can use the Python code below to obtain the data programmatically.

To follow along with the examples below, you can download the source code by clicking on the following link:

In this section, you’ll build a robust _k_-means clustering pipeline. Since you’ll perform multiple transformations of the original input data, your pipeline will also serve as a practical clustering framework.

Building a K-Means Clustering Pipeline

Assuming you want to start with a fresh namespace, import all the modules needed to build and evaluate the pipeline, including pandas and seaborn for more advanced visualizations:

Download and extract the TCGA dataset from UCI:

After the download and extraction is completed, you should have a directory that looks like this:

TCGA-PANCAN-HiSeq-801x20531/ | ├── data.csv └── labels.csv

The KMeans class in scikit-learn requires a NumPy array as an argument. The NumPy package has a helper function to load the data from the text file into memory as NumPy arrays:

Check out the first three columns of data for the first five samples as well as the labels for the first five samples:

The data variable contains all the gene expression values from 20,531 genes. The true_label_names are the cancer types for each of the 881 samples. The first record in data corresponds with the first label in true_labels.

The labels are strings containing abbreviations of cancer types:

BRCA: Breast invasive carcinoma
COAD: Colon adenocarcinoma
KIRC: Kidney renal clear cell carcinoma
LUAD: Lung adenocarcinoma
PRAD: Prostate adenocarcinoma

To use these labels in the evaluation methods, you first need to convert the abbreviations to integers with LabelEncoder:

Since the label_encoder has been fitted to the data, you can see the unique classes represented using .classes_. Store the length of the array to the variable n_clusters for later use:

In practical machine learning pipelines, it’s common for the data to undergo multiple sequences of transformations before it feeds into a clustering algorithm. You learned about the importance of one of these transformation steps, feature scaling, earlier in this tutorial. An equally important data transformation technique is dimensionality reduction, which reduces the number of features in the dataset by either removing or combining them.

Dimensionality reduction techniques help to address a problem with machine learning algorithms known as the curse of dimensionality. In short, as the number of features increases, the feature space becomes sparse. This sparsity makes it difficult for algorithms to find data objects near one another in higher-dimensional space. Since the gene expression dataset has over 20,000 features, it qualifies as a great candidate for dimensionality reduction.

Principal Component Analysis (PCA) is one of many dimensionality reduction techniques. PCA transforms the input data by projecting it into a lower number of dimensions called components. The components capture the variability of the input data through a linear combination of the input data’s features.

The next code block introduces you to the concept of scikit-learn pipelines. The scikit-learn Pipeline class is a concrete implementation of the abstract idea of a machine learning pipeline.

Your gene expression data aren’t in the optimal format for the KMeans class, so you’ll need to build a preprocessing pipeline. The pipeline will implement an alternative to the StandardScaler class called MinMaxScaler for feature scaling. You use MinMaxScaler when you do not assume that the shape of all your features follows a normal distribution.

The next step in your preprocessing pipeline will implement the PCA class to perform dimensionality reduction:

Now that you’ve built a pipeline to process the data, you’ll build a separate pipeline to perform _k_-means clustering. You’ll override the following default arguments of the KMeans class:

init: You’ll use "k-means++" instead of "random" to ensure centroids are initialized with some distance between them. In most cases, this will be an improvement over "random".
n_init: You’ll increase the number of initializations to ensure you find a stable solution.
max_iter: You’ll increase the number of iterations per initialization to ensure that _k_-means will converge.

Build the _k_-means clustering pipeline with user-defined arguments in the KMeans constructor:

The Pipeline class can be chained to form a larger pipeline. Build an end-to-end _k_-means clustering pipeline by passing the "preprocessor" and "clusterer" pipelines to Pipeline:

Calling .fit() with data as the argument performs all the pipeline steps on the data:

The pipeline performs all the necessary steps to execute _k_-means clustering on the gene expression data! Depending on your Python REPL, .fit() may print a summary of the pipeline. Objects defined inside pipelines are accessible using their step name.

Evaluate the performance by calculating the silhouette coefficient:

Calculate ARI, too, since the ground truth cluster labels are available:

As mentioned earlier, the scale for each of these clustering performance metrics ranges from -1 to 1. A silhouette coefficient of 0 indicates that clusters are significantly overlapping one another, and a silhouette coefficient of 1 indicates clusters are well-separated. An ARI score of 0 indicates that cluster labels are randomly assigned, and an ARI score of 1 means that the true labels and predicted labels form identical clusters.

Since you specified n_components=2 in the PCA step of the _k_-means clustering pipeline, you can also visualize the data in the context of the true labels and predicted labels. Plot the results using a pandas DataFrame and the seaborn plotting library:

Here’s what the plot looks like:

The visual representation of the clusters confirms the results of the two clustering evaluation metrics. The performance of your pipeline was pretty good. The clusters only slightly overlapped, and cluster assignments were much better than random.

Tuning a K-Means Clustering Pipeline

Your first _k_-means clustering pipeline performed well, but there’s still room to improve. That’s why you went through the trouble of building the pipeline: you can tune the parameters to get the most desirable clustering results.

The process of parameter tuning consists of sequentially altering one of the input values of the algorithm’s parameters and recording the results. At the end of the parameter tuning process, you’ll have a set of performance scores, one for each new value of a given parameter. Parameter tuning is a powerful method to maximize performance from your clustering pipeline.

By setting the PCA parameter n_components=2, you squished all the features into two components, or dimensions. This value was convenient for visualization on a two-dimensional plot. But using only two components means that the PCA step won’t capture all of the explained variance of the input data.

Explained variance measures the discrepancy between the PCA-transformed data and the actual input data. The relationship between n_components and explained variance can be visualized in a plot to show you how many components you need in your PCA to capture a certain percentage of the variance in the input data. You can also use clustering performance metrics to evaluate how many components are necessary to achieve satisfactory clustering results.

In this example, you’ll use clustering performance metrics to identify the appropriate number of components in the PCA step. The Pipeline class is powerful in this situation. It allows you to perform basic parameter tuning using a for loop.

Iterate over a range of n_components and record evaluation metrics for each iteration:

Plot the evaluation metrics as a function of n_components to visualize the relationship between adding components and the performance of the _k_-means clustering results:

The above code generates the a plot showing performance metrics as a function of n_components:

There are two takeaways from this figure:

The silhouette coefficient decreases linearly. The silhouette coefficient depends on the distance between points, so as the number of dimensions increases, the sparsity increases.
The ARI improves significantly as you add components. It appears to start tapering off after n_components=7, so that would be the value to use for presenting the best clustering results from this pipeline.

Like most machine learning decisions, you must balance optimizing clustering evaluation metrics with the goal of the clustering task. In situations when cluster labels are available, as is the case with the cancer dataset used in this tutorial, ARI is a reasonable choice. ARI quantifies how accurately your pipeline was able to reassign the cluster labels.

The silhouette coefficient, on the other hand, is a good choice for exploratory clustering because it helps to identify subclusters. These subclusters warrant additional investigation, which can lead to new and important insights.

Conclusion

You now know how to perform _k_-means clustering in Python. Your final _k_-means clustering pipeline was able to cluster patients with different cancer types using real-world gene expression data. You can use the techniques you learned here to cluster your own data, understand how to get the best clustering results, and share insights with others.

In this tutorial, you learned:

What the popular clustering techniques are and when to use them
What the _k_-means algorithm is
How to implement _k_-means clustering in Python
How to evaluate the performance of clustering algorithms
How to build and tune a robust _k_-means clustering pipeline in Python
How to analyze and present clustering results from the _k_-means algorithm

You also took a whirlwind tour of scikit-learn, an accessible and extensible tool for implementing _k_-means clustering in Python. If you’d like to reproduce the examples you saw above, then be sure to download the source code by clicking on the following link:

You’re now ready to perform _k_-means clustering on datasets you find interesting. Be sure to share your results in the comments below!