Knn Impute Using Categorical Variables with caret Package - Coding

In data science and machine learning, missing data is a common issue that can significantly impact the performance of predictive models. One effective way to handle missing values is through imputation, which involves replacing missing data with substituted values. The caret package in R provides several methods for imputation, one of which is K-Nearest Neighbors (KNN) imputation. This article will focus on using KNN imputation with categorical variables in the caret package.

What is KNN Imputation?

K-Nearest Neighbors (KNN) imputation is a method that replaces missing values with the mean (for numeric data) or the most frequent (for categorical data) value from the ‘k’ nearest neighbors. The nearest neighbors are determined based on a distance metric, typically Euclidean distance for numerical variables and Hamming distance for categorical variables.

Why Use KNN Imputation?

KNN imputation is advantageous because it considers the relationships between observations, leading to more accurate imputations than simpler methods like mean or mode imputation. This method is particularly useful when dealing with mixed-type data (both numerical and categorical variables).

Prerequisites

Before diving into KNN imputation, ensure you have the following:

R and RStudio installed
Basic understanding of R programming
The caret package installed

You can install the caret package using the following command:

install.packages("caret")

Now we will discuss step by steps to Perform KNN Imputation Using Categorical Variables in R Programming Language.

Step 1. Load Required Libraries

First we will install and load the Required Libraries.

library(caret)
library(dplyr)
library(VIM)

Step 2. Load and Explore the Data

For demonstration, we’ll use a sample dataset. You can replace this with your dataset.

data <- data.frame(
  Age = c(25, 30, NA, 35, 40, 45, NA, 55),
  Gender = as.factor(c("Male", "Female", "Female", "Male", NA, "Male", "Female",
                                                                      "Female")),
  Income = c(50000, 60000, 55000, NA, 80000, 75000, 70000, NA)
)

# Display the data
print(data)

Output:

  Age Gender Income
1  25   Male  50000
2  30 Female  60000
3  NA Female  55000
4  35   Male     NA
5  40   <NA>  80000
6  45   Male  75000
7  NA Female  70000
8  55 Female     NA

Step 3. Visualize Missing Data

Before imputation, it’s helpful to visualize the missing data pattern.

aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, 
     labels=names(data), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))

Output:

knn Impute Using Categorical Variables with caret Package

Step 4. Perform KNN imputation using the VIM package

The resulting imputed Data includes additional columns indicating which values is imputed.

# Perform KNN imputation using the VIM package
imputedData <- kNN(data, k = 3)

# The resulting imputedData includes additional columns indicating which values 
# Let's display only the original columns to see the imputed values
imputedData <- imputedData[, 1:4]

# Display the imputed data
print(imputedData)

Output:

  Age Gender Income Married
1  25   Male  50000      No
2  30 Female  60000     Yes
3  45 Female  55000     Yes
4  35   Male  75000      No
5  40   Male  80000      No
6  45   Male  75000     Yes
7  30 Female  70000      No
8  55 Female  60000     Yes

Step 5. Verify the Imputation

It’s crucial to check if the missing values have been correctly imputed.

sum(is.na(imputedData))

Output:

[1] 0

By using the kNN function from the VIM package, we can successfully impute missing values for both numeric and factor variables, ensuring the dataset is complete and ready for further analysis.

Conclusion

KNN imputation is a powerful method for handling missing data, especially when dealing with both numerical and categorical variables. The caret package in R simplifies this process, making it accessible even for those with basic R programming skills. By carefully pre-processing the data and choosing appropriate methods, you can significantly improve the quality of your datasets, leading to more accurate and reliable predictive models.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Changing Marker Size in Seaborn's lmplot
What Does cl Parameter in knn Function in R Mean?
Explain the concept of transfer learning and its application in computer vision.
Building a Stock Price Prediction Model with CatBoost: A Hands-On Tutorial
How to Make Heatmap Square in Seaborn FacetGrid

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	15