How to Change the Value of k in KNN Using R? - Coding

The k-Nearest Neighbors (KNN) algorithm is a simple, yet powerful, non-parametric method used for classification and regression. One of the critical parameters in KNN is the value of k, which represents the number of nearest neighbors to consider when making a prediction. In this article, we’ll explore how to change the value of k in KNN using R.

Why change the value of the K?

Changing the value of k in KNN affects the model’s performance by balancing the bias-variance trade-off. A low k(e.g., k=1) results in low bias and high variance, making the model sensitive to noise and outliers, potentially leading to overfitting. Conversely, a high k reduces variance and increases bias, which might cause underfitting. Finding the optimal k through techniques like cross-validation ensures better generalization on unseen data. Additionally, k impacts computational efficiency, with higher k values requiring more calculations.

Before we dive into changing the value of k, let’s set up our environment and load the necessary libraries.

# Load necessary libraries
library(class)
library(caret)

Now we will discuss step by step to Change the Value of k in KNN Using R Programming Language.

Step 1: Creating a Sample Dataset

For demonstration purposes, we’ll use the famous Iris dataset. This dataset contains measurements of iris flowers from three different species.

# Load the Iris dataset
data(iris)

# Split the dataset into training and test sets
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

Step 2: Implementing KNN with Different Values of k

The knn function from the class package allows us to implement KNN easily. We can change the value of k by simply modifying the k parameter.

# Extract features and labels
trainFeatures <- trainData[, -5]
trainLabels <- trainData[, 5]
testFeatures <- testData[, -5]
testLabels <- testData[, 5]

# KNN with k = 3
knn3 <- knn(train = trainFeatures, test = testFeatures, cl = trainLabels, k = 3)

Step 3: Evaluating the Model

To evaluate the performance of our KNN model, we can use a confusion matrix.

# Confusion matrix for k = 3
confusionMatrix(knn3, testLabels)

Output:

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         1
  virginica       0          0         9

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8278, 0.9992)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 2.963e-13       
                                          
                  Kappa : 0.95            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.9000
Specificity                 1.0000            0.9500           1.0000
Pos Pred Value              1.0000            0.9091           1.0000
Neg Pred Value              1.0000            1.0000           0.9524
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3000
Detection Prevalence        0.3333            0.3667           0.3000
Balanced Accuracy           1.0000            0.9750           0.9500

Step 4: Changing the Value of k

We can change the value of k to see how it affects the model’s performance. Let’s try k = 5 and k = 7.

# KNN with k = 5
knn5 <- knn(train = trainFeatures, test = testFeatures, cl = trainLabels, k = 5)
confusionMatrix(knn5, testLabels)

# KNN with k = 7
knn7 <- knn(train = trainFeatures, test = testFeatures, cl = trainLabels, k = 7)
confusionMatrix(knn7, testLabels)

Output:

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000

Here we can see the accuracy is increase when we change the k value.

Advanced Approach to Change the Value of k in KNN Using R

We can automate the process of evaluating KNN with different values of k by creating a loop.

# Define a range of k values
k_values <- c(1, 3, 5, 7, 9)

# Initialize a list to store confusion matrices
results <- list()

# Loop through k values and store the results
for (k in k_values) {
  knn_pred <- knn(train = trainFeatures, test = testFeatures, cl = trainLabels, k = k)
  results[[paste0("k = ", k)]] <- confusionMatrix(knn_pred, testLabels)
}

# Print the results
results

Output:

$`k = 1`
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0          9         2
  virginica       0          1         8

Overall Statistics
                                          
               Accuracy : 0.9             
                 95% CI : (0.7347, 0.9789)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 1.665e-10       
                                          
                  Kappa : 0.85            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9000           0.8000
Specificity                 1.0000            0.9000           0.9500
Pos Pred Value              1.0000            0.8182           0.8889
Neg Pred Value              1.0000            0.9474           0.9048
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3000           0.2667
Detection Prevalence        0.3333            0.3667           0.3000
Balanced Accuracy           1.0000            0.9000           0.8750....................................................

Visualizing the Results

To better understand the impact of different k values on model performance, we can visualize the accuracy.

# Extract accuracy for each k
accuracy <- sapply(results, function(x) x$overall['Accuracy'])

# Create a data frame for plotting
accuracy_df <- data.frame(k = k_values, Accuracy = accuracy)

# Plot the accuracy
library(ggplot2)
ggplot(accuracy_df, aes(x = k, y = Accuracy)) +
  geom_line() +
  geom_point() +
  labs(title = "KNN Accuracy vs. k Value", x = "k Value", y = "Accuracy") +
  theme_minimal()

Output:

Change the Value of k in KNN Using R

Conclusion

Changing the value of k in the KNN algorithm can significantly impact the model’s performance. By experimenting with different k values, we can identify the optimal k that provides the best accuracy for our specific dataset. Using the class and caret packages in R, it’s straightforward to implement and evaluate KNN models with various k values.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
NLP Algorithms: A Beginner's Guide for 2024
Home Energy Usage Monitoring Dashboard in R
Difference Between varImp (caret) and importance (randomForest) for Random Forest in R
Hyperparameter tuning SVM parameters using Genetic Algorithm
How to Get an Internship as a Marketing Analyst

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	15