Horje
How do I predict new data's cluster after clustering training data in R?

To predict the cluster of new data after training a clustering model in R, you generally need to use the centroids (for k-means) or the hierarchical structure (for hierarchical clustering) obtained from the trained model. Here are steps and examples for predicting new data clusters using k-means and hierarchical clustering.

What is Clustering?

Clustering is a method of partitioning a set of data points into subsets (clusters), such that points in the same cluster are more similar to each other than to those in other clusters. It’s widely used in various fields, such as customer segmentation, image recognition, and biological data analysis.

Types of Clustering

Here we discuss the main Types of Clustering in R Programming Language.

  • K-means Clustering: K-means Clustering Partition data into k clusters where each data point belongs to the cluster with the nearest mean (centroid).
  • Hierarchical Clustering: Hierarchical Clustering Create a hierarchy of clusters either in an agglomerative (bottom-up) or divisive (top-down) manner.

Predicting New Data Clusters with K-means Clustering

After fitting a k-means model to your training data, you can predict the cluster of new data points by finding the nearest centroid. Here’s how to do it in R.

Step 1: Train the K-means Model

First we will create and train the K-means Model.

R
# Load necessary library
library(stats)

# Generate example training data
set.seed(123)
training_data <- matrix(rnorm(100), ncol = 2)
colnames(training_data) <- c("x", "y")

# Perform k-means clustering on the training data
kmeans_model <- kmeans(training_data, centers = 3)

# Print the cluster centroids
print(kmeans_model$centers)

Output:

           x          y
1 1.0586072 0.2993084
2 -0.2215562 -0.8066471
3 -0.7196001 0.7962155

Step 2: Predict Clusters for New Data

Now we will Predict Clusters for New Data for our model.

R
# Generate example new data
new_data <- matrix(rnorm(20), ncol = 2)
colnames(new_data) <- c("x", "y")

# Function to predict clusters for new data points
predict_kmeans <- function(new_data, kmeans_model) {
  # Calculate distances from new data to each centroid
  distances <- sapply(1:nrow(kmeans_model$centers), function(i) {
    rowSums((new_data - kmeans_model$centers[i, ])^2)
  })
  # Assign to the nearest cluster
  apply(distances, 1, which.min)
}

# Get cluster assignments for new data
new_clusters <- predict_kmeans(new_data, kmeans_model)
print(new_clusters)

Output:

[1] 3 1 2 1 2 2 3 1 2 1

After fitting the k-means model, use the centroids to assign new data points to the nearest centroid. The predict_kmeans function calculates the Euclidean distance from each new data point to each centroid and assigns the point to the nearest centroid.

Predicting New Data Clusters with Hierarchical Clustering

Hierarchical clustering does not naturally extend to new data, but you can use a workaround by training a classification model based on the hierarchical clustering results.

Step 1: Train the Hierarchical Clustering Model

Now we will Train the Hierarchical Clustering Model.

R
# Load necessary library
library(stats)

# Generate example training data
set.seed(123)
training_data <- matrix(rnorm(100), ncol = 2)
colnames(training_data) <- c("x", "y")

# Perform hierarchical clustering on the training data
dist_matrix <- dist(training_data)
hc <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc, main = "Dendrogram of Hierarchical Clustering")

Output:

Screenshot-2024-06-19-155830

Predict new data’s cluster after clustering training data

Step 2: Cut the Dendrogram to Form Clusters

Now we will Cut the Dendrogram to Form Clusters.

R
# Cut tree into clusters
cutree_clusters <- cutree(hc, k = 3)

# Print the cluster assignments
print(cutree_clusters)

Output:

 [1] 1 1 2 1 2 2 3 1 1 1 2 2 2 3 1 2 2 1 2 1 1 3 1 1 1 1 2 3 1 2 2 1 2 2 2 2 2 1 1 1 1 1
[43] 1 2 2 1 1 1 2 3

Step 3: Train a Classification Model

In this step we will Train a Classification Model.

R
# Load necessary library
library(class)

# Train a k-NN classifier to predict clusters
knn_model <- knn(train = training_data, test = training_data, 
                 cl = cutree_clusters, k = 1)

Step 4: Predict Clusters for New Data

Now at last we will Predict Clusters for New Data.

R
# Generate example new data
new_data <- matrix(rnorm(20), ncol = 2)
colnames(new_data) <- c("x", "y")

# Predict clusters for new data using the k-NN model
new_clusters <- knn(train = training_data, test = new_data, cl = cutree_clusters, k = 1)
print(new_clusters)

Output:

 [1] 2 1 1 1 1 1 1 1 1 2
Levels: 1 2 3
  • Hierarchical clustering produces a dendrogram, which is cut to form clusters.
  • A k-NN classifier is trained on the original data and their clusters.
  • New data points are assigned to clusters using the k-NN classifier.

Conclusion

Predicting the cluster of new data points is crucial in practical applications of clustering. Using k-means clustering, you can directly use the centroids for prediction. For hierarchical clustering, you can leverage a classification model to extend the clustering to new data points. This process allows for the continuous application of clustering models in dynamic environments, enhancing their utility in real-world scenarios.




Reffered: https://www.geeksforgeeks.org


AI ML DS

Related
How to Convert List of Regression Outputs into Data Frames in R How to Convert List of Regression Outputs into Data Frames in R
Binary Classification or unknown class in Random Forest in R Binary Classification or unknown class in Random Forest in R
Fitting Logarithmic Curve in a Dataset in R Fitting Logarithmic Curve in a Dataset in R
Simulation Using R Programming Simulation Using R Programming
Full Information Maximum Likelihood for Missing Data in R Full Information Maximum Likelihood for Missing Data in R

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
14