Unsupervised Clustering with Unknown Number of Clusters - Coding

Clustering is a fundamental technique in unsupervised machine learning used to group similar data points into clusters. Unlike supervised learning, clustering does not rely on predefined labels, making it particularly useful for exploratory data analysis. One of the key challenges in clustering is determining the optimal number of clusters, especially when this number is unknown. This article delves into various clustering algorithms and methods to estimate the number of clusters, providing a comprehensive guide for tackling this problem.

Table of Content

Challenges in Determining Number of Clusters
Methods for Determining the Number of Clusters

1. Implementing Elbow Method
2. Using Silhouette Coefficient
3. Using DBSCAN Algorithm

Challenges and Considerations

Challenges in Determining Number of Clusters

Determining the number of clusters is crucial for the success of a clustering algorithm. Some of the challenges include:

Domain Knowledge: Often, domain-specific knowledge is required to estimate the number of clusters. Without this, it can be difficult to decide how many clusters are appropriate for the data.
Cluster Shape and Size: Clusters can vary in shape and size, making it difficult to define a clear-cut number of clusters. Some clusters may be elongated, while others may be spherical.
Noise and Outliers: Presence of noise and outliers can distort the clustering process and lead to incorrect estimation of cluster count. Identifying and handling noise appropriately is crucial.
Overlapping Clusters: When clusters overlap, it becomes challenging to distinguish between them and accurately determine the number of clusters.

Methods for Determining the Number of Clusters

When the number of clusters is unknown, several methods can be employed to estimate it:

The Elbow Method: The Elbow Method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters.
Silhouette Coefficient: The Silhouette Coefficient measures how similar a data point is to its own cluster compared to other clusters.
Gap Statistic : The Gap Statistic compares the total within intra-cluster variation for different numbers of clusters with their expected values under null reference distribution of the data.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This density-based clustering algorithm does not require specifying the number of clusters. Instead, it relies on the density of data points to form clusters, making it robust to noise and capable of finding arbitrarily shaped clusters.

1. Implementing Elbow Method

The WCSS measures the compactness of clusters. As the number of clusters increases, WCSS decreases. The optimal number of clusters is identified at the “elbow point,” where the rate of decrease sharply slows down.

Python

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Assuming data is a 2D numpy array or a pandas DataFrame
data = np.random.rand(100, 2)  # Example data

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(data)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Output:

Elbow Method

2. Using Silhouette Coefficient

The Silhouette Coefficient code is also mostly correct. However, we need to ensure that the data variable is defined and that the required libraries are imported. Here’s the corrected version:

Python

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Assuming data is a 2D numpy array or a pandas DataFrame
data = np.random.rand(100, 2)  # Example data

silhouette_scores = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(data)
    score = silhouette_score(data, kmeans.labels_)
    silhouette_scores.append(score)

plt.plot(range(2, 11), silhouette_scores)
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()

Output:

Silhouette Coefficient

3. Using DBSCAN Algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that can identify clusters based on the density of points. It requires two parameters: `eps`, which defines the maximum distance between two samples for them to be considered as in the same neighborhood, and `min_samples`, which defines the minimum number of points required to form a dense region (cluster).

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0)

# Standardize features
X = StandardScaler().fit_transform(X)

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print(f'Estimated number of clusters: {n_clusters_}')
print(f'Estimated number of noise points: {n_noise_}')

# Plotting the result
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    core_samples_mask = np.zeros_like(labels, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)

plt.title(f'Estimated number of clusters: {n_clusters_}')
plt.show()

Output:

Estimated number of clusters: 3
Estimated number of noise points: 18

Using DBSCAN Algorithm

The plot generated by the code visually shows the clusters identified by the DBSCAN algorithm. Each cluster is represented by a different color, and noise points (outliers) are shown in black.

Challenges and Considerations

Determining the optimal number of clusters is not always straightforward. The following challenges and considerations should be kept in mind:

High-Dimensional Data: High-dimensional data can complicate the clustering process due to the curse of dimensionality. Dimensionality reduction techniques like PCA or t-SNE can help.
Cluster Shape and Size: Algorithms like K-means assume spherical clusters of similar sizes, which may not be suitable for all datasets. Density-based methods like DBSCAN are more flexible in handling clusters of various shapes and sizes.
Initialization Sensitivity: Some algorithms, particularly K-means, are sensitive to the initial placement of centroids. Multiple runs with different initializations can mitigate this issue.
Scalability: The choice of algorithm can impact computational efficiency, especially for large datasets. K-means is computationally efficient but may not always yield the best results.

Conclusion

Unsupervised clustering with an unknown number of clusters is a complex yet essential task in data analysis. Various clustering algorithms and methods to estimate the number of clusters offer different strengths and weaknesses. Understanding these methods and their applicability to different types of data is crucial for effective clustering. Employing techniques like the Elbow Method, Silhouette Coefficient, and Gap Statistic can guide the determination of the optimal number of clusters, enhancing the interpretability and utility of clustering results.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
AI in Content Creation
AI in Insurance: Innovating Risk Management
AI in Media and Entertainment
Opinion Mining in NLP
Sequential Data Analysis in Python

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	18