Gap statistics for optimal number of cluster
To get the optimal number of clusters in a dataset, we use Gap Statistics. It compares the performance of clustering algorithms against a null reference distribution of the data, allowing for a more objective decision on the number of clusters.
Let’s explore Gap Statistics in more detail and discover how it can help us determine the optimal number of clusters for our data.
What is Gap Statistic?
In K-means, choosing the right number of clusters is important because it can greatly impact the results. The Gap Statistic helps solve this problem by comparing how much the data within each cluster varies for different numbers of clusters (k).
Gap Statistic compares how well the clusters formed from your actual data stand out against what you would expect if the data were randomly distributed. In other words, it looks at how tightly packed your real clusters are compared to clusters created from random data. This helps us understand if the patterns we see in our data are meaningful or just due to chance
Why we use Gap Statistic?
The Gap Statistic provides a more objective basis for determining the optimal number of clusters compared to subjective methods like trial-and-error or guesswork. By statistically assessing how well-defined your clusters are against random distributions, you can make informed decisions that enhance model performance.
How Does the Gap Statistic Work?
- Clustering the Data: Perform clustering on the actual data using a clustering algorithm, such as k-means, for a range of potential cluster numbers (k).
- Create Reference Data: Generate a set of reference datasets by simulating random data with the same distribution and dimensionality as the original dataset. These datasets are assumed to contain no inherent structure (i.e., they are uniformly distributed).
- Compute Clustering Performance: For both the actual data and the reference datasets, compute the clustering performance using a cost function, typically the within-cluster sum of squared errors (WSS), also known as inertia in k-means clustering. This measures the compactness of clusters, where lower values indicate better clustering.
- Gap Statistic Calculation: The Gap Statistic for each number of clusters k is computed as:
Gap(k) = \frac{1}{B} \sum_{b=1}^{B} W_k^b - W_k
Where:W_k is the WSS for the actual data with k clusters.W_k^b is the WSS for the reference data with k clusters.B is the number of reference datasets used.
- Selecting the Optimal Number of Clusters: The optimal number of clusters is chosen as the smallest value of k such that:
Gap(k) \geq Gap(k+1) - \text{SE}(k+1)
Where\text{SE}(k+1) is the standard error of the Gap Statistic for k+1 clusters.
This condition ensures that the chosen k provides a significant improvement over the random configuration of data.
Optimizing the Number of Clusters in K-Mean Clustering using Gap Statistics
Step 1: Generate Sample Data
The first step is to generate sample data for clustering. We'll use the make_blobs
function from sklearn.datasets
to create a synthetic dataset with 300 samples and 4 centers.
# Generate Sample Data
from sklearn.datasets import make_blobs
# Generate a synthetic dataset with 4 centers
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
Step 2: Define the Gap Statistic Calculation Function
Next, we define a function to compute the Gap Statistic for a range of cluster numbers. The Gap Statistic compares the clustering result to random data (uniform distribution).
# Define the Gap Statistic Calculation Function
import numpy as np
def compute_gap_statistic(X, k_max, n_replicates=10):
"""
Compute the Gap Statistic for a range of cluster numbers.
Parameters:
X: array-like, shape (n_samples, n_features)
The input data.
k_max: int
The maximum number of clusters to evaluate.
n_replicates: int
The number of bootstrap samples.
Returns:
gap_values: list
The calculated gap values for each k.
"""
# Generate reference data from a uniform distribution
def generate_reference_data(X):
return np.random.uniform(low=X.min(axis=0), high=X.max(axis=0), size=X.shape)
gap_values = []
# Loop over a range of k values (1 to k_max)
for k in range(1, k_max + 1):
# Fit KMeans to the original data
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
original_inertia = kmeans.inertia_
# Compute the average inertia for the reference datasets
reference_inertia = []
for _ in range(n_replicates):
random_data = generate_reference_data(X)
kmeans.fit(random_data)
reference_inertia.append(kmeans.inertia_)
# Calculate the Gap statistic
gap = np.log(np.mean(reference_inertia)) - np.log(original_inertia)
gap_values.append(gap)
return gap_values
Step 3: Calculate Gap Statistic for Different k Values
Now that the function is defined, we call it to calculate the Gap Statistic for different values of k (number of clusters). We'll evaluate for a maximum of 10 clusters (k_max = 10
).
# Calculate Gap Statistic for Different k Values
k_max = 10 # Maximum number of clusters to evaluate
gap_values = compute_gap_statistic(X, k_max)
# Plotting the Gap Statistic
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(range(1, k_max + 1), gap_values, marker='o')
plt.title('Gap Statistic vs Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Gap Statistic')
plt.grid()
plt.show()
Output:

In this output, we can see the optimal number of output is 3.
Step 4: Determine Optimal k
Once we have the Gap Statistic values, we find the optimal number of clusters based on the maximum gap. We do this by identifying where the gap between consecutive values is the largest.
# Determine Optimal k
optimal_k = np.argmax(np.diff(gap_values)) + 1 # Adding 1 as index starts from 0
print(f"Optimal number of clusters: {optimal_k}")
Output:
Optimal number of clusters: 2
Step 5: Apply K-means with the Optimal k
Once the optimal k is identified, we apply the K-means clustering algorithm with the selected number of clusters and fit it to the data.
# Apply K-means with the Optimal k
from sklearn.cluster import KMeans
kmeans_optimal = KMeans(n_clusters=optimal_k)
kmeans_optimal.fit(X)
Step 6: Visualize the K-means Clustering Result
Finally, we visualize the clustering results by plotting the data points, coloring them according to their cluster labels. The cluster centroids are also marked.
# Visualize the K-means Clustering Result
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=kmeans_optimal.labels_, cmap='viridis', marker='o')
plt.scatter(kmeans_optimal.cluster_centers_[:, 0], kmeans_optimal.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.title(f'K-means Clustering with {optimal_k} Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid()
plt.show()
Output:
Complete Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
def compute_gap_statistic(X, k_max, n_replicates=10):
"""
Compute the Gap Statistic for a range of cluster numbers.
Parameters:
X: array-like, shape (n_samples, n_features)
The input data.
k_max: int
The maximum number of clusters to evaluate.
n_replicates: int
The number of bootstrap samples.
Returns:
gap_values: list
The calculated gap values for each k.
"""
# Generate reference data from a uniform distribution
def generate_reference_data(X):
return np.random.uniform(low=X.min(axis=0), high=X.max(axis=0), size=X.shape)
gap_values = []
for k in range(1, k_max + 1):
# Fit KMeans to the original data
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
original_inertia = kmeans.inertia_
# Compute the average inertia for the reference datasets
reference_inertia = []
for _ in range(n_replicates):
random_data = generate_reference_data(X)
kmeans.fit(random_data)
reference_inertia.append(kmeans.inertia_)
# Calculate the Gap statistic
gap = np.log(np.mean(reference_inertia)) - np.log(original_inertia)
gap_values.append(gap)
return gap_values
# Example usage
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
k_max = 10
gap_values = compute_gap_statistic(X, k_max)
# Plotting the Gap Statistic
plt.figure(figsize=(8, 5))
plt.plot(range(1, k_max + 1), gap_values, marker='o')
plt.title('Gap Statistic vs Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Gap Statistic')
plt.grid()
plt.show()
# Find the optimal number of clusters (k) based on the Gap Statistic
optimal_k = np.argmax(np.diff(gap_values)) + 1 # Adding 1 as index starts from 0
print(f"Optimal number of clusters: {optimal_k}")
# Perform K-means with the optimal number of clusters
kmeans_optimal = KMeans(n_clusters=optimal_k)
kmeans_optimal.fit(X)
# Visualizing the K-means clusters
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=kmeans_optimal.labels_, cmap='viridis', marker='o')
plt.scatter(kmeans_optimal.cluster_centers_[:, 0], kmeans_optimal.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.title(f'K-means Clustering with {optimal_k} Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid()
plt.show()
Advantages of Gap Statistic
- The Gap Statistic has a strong statistical basis because it compares clustering results with a null reference distribution, making it less prone to the subjective interpretation inherent in methods like the Elbow method.
- The method offers a clear and objective criterion for determining the optimal number of clusters, based on the gap between the WSS for the actual and reference data.
- The Gap Statistic can be applied to various clustering algorithms, not just k-means, and works well with different types of data distributions.
Limitations of Gap Statistic
- Generating reference datasets and calculating WSS for multiple values of k can be computationally intensive, especially for large datasets.
- The accuracy of the Gap Statistic depends on the quality and size of the reference data, which can affect the results.
- The method assumes that the clustering algorithm chosen is appropriate for the data. The performance of the Gap Statistic can vary with different clustering methods.