How to Use SMOTE for Imbalanced Data in R - Coding

In the field of machine learning, dealing with unbalanced datasets is a common challenge. When one class significantly outnumbers the other, models tend to be biased towards the majority class, resulting in poor predictive performance for the minority class. Synthetic Minority Over-sampling Technique (SMOTE) is an effective method to address this issue by generating synthetic samples for the minority class, thereby balancing the dataset. In this article, we will explore how to balance an unbalanced classification problem using SMOTE in R Programming Language.

Understanding Unbalanced Classification

Unbalanced classification occurs when the number of instances in one class is significantly higher than in the other class. This imbalance can lead to biased models that perform well on the majority class but poorly on the minority class. For instance, in fraud detection, fraudulent transactions (minority class) are far less frequent than legitimate transactions (majority class). Traditional classification algorithms may fail to identify fraudulent transactions due to the imbalance.

Introduction to SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a powerful technique to handle unbalanced datasets. It works by creating synthetic examples for the minority class by interpolating between existing minority instances. This helps in achieving a balanced class distribution without simply duplicating the minority instances.

Step 1: Install and Load Necessary Packages

First we will Install and Load Necessary Packages.

# Install the smotefamily package
install.packages("smotefamily")

# Load the smotefamily package
library(smotefamily)

# Install and load other necessary packages
install.packages("caret")
library(caret)
install.packages("nnet")
library(nnet)

Step 2: Load and Explore the Iris Dataset

The iris dataset is balanced. We’ll modify it to create an imbalanced dataset.

# Load the iris dataset
data(iris)

# Make the dataset imbalanced by removing a portion of one class
setosa <- iris[iris$Species == "setosa", ]
versicolor <- iris[iris$Species == "versicolor", ]
virginica <- iris[iris$Species == "virginica", ]

# Remove 80% of the 'virginica' class to create imbalance
set.seed(123)
virginica <- virginica[sample(1:nrow(virginica), size = 0.2 * nrow(virginica)), ]

# Combine the classes to create a new imbalanced dataset
iris_imbalanced <- rbind(setosa, versicolor, virginica)

# Check the distribution of the target variable
table(iris_imbalanced$Species)

Output:

setosa versicolor  virginica 
        50         50         10

Step 3: Apply SMOTE

Apply the SMOTE function to the imbalanced iris dataset using the SMOTE function from the smotefamily package.

# Convert Species to a factor
iris_imbalanced$Species <- as.factor(iris_imbalanced$Species)

# Apply SMOTE
set.seed(123) # For reproducibility
smote_result <- SMOTE(X = iris_imbalanced[, -5], target = iris_imbalanced$Species, 
                      K = 5, dup_size = 2)

# Combine the SMOTE result into a new data frame
iris_smote <- data.frame(smote_result$data)
names(iris_smote)[ncol(iris_smote)] <- "Species"
iris_smote$Species <- as.factor(iris_smote$Species)

# Check the distribution of the target variable after SMOTE
table(iris_smote$Species)

Output:

 setosa versicolor  virginica 
        50         50         30

Step 4: Train a Model on the Balanced Dataset

Now that you have a balanced dataset, you can train a machine learning model on it. For this example, we’ll use a simple logistic regression model.

# Split the dataset into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(iris_smote$Species, p = .8, list = FALSE, times = 1)
train_data <- iris_smote[trainIndex,]
test_data <- iris_smote[-trainIndex,]

# Train a logistic regression model
model <- train(Species ~ ., data = train_data, method = "multinom")

# Make predictions on the test set
predictions <- predict(model, newdata = test_data)

# Ensure predictions and test_data$Species are factors with the same levels
predictions <- factor(predictions, levels = levels(test_data$Species))
test_data$Species <- factor(test_data$Species, levels = levels(predictions))

# Evaluate the model
confusionMatrix(predictions, test_data$Species)

Output:

stopped after 100 iterations
# weights:  18 (10 variable)
initial  value 114.255678 
iter  10 value 11.828471
iter  20 value 0.265536
iter  30 value 0.023720
iter  40 value 0.009716
iter  50 value 0.008655
iter  60 value 0.007052
iter  70 value 0.005485
iter  80 value 0.004491
iter  90 value 0.003137
iter 100 value 0.002813
final  value 0.002813 
stopped after 100 iterations
# weights:  18 (10 variable)
initial  value 114.255678 
iter  10 value 18.002449
iter  20 value 16.511306
final  value 16.511023 
converged
# weights:  18 (10 variable)
initial  value 114.255678 
iter  10 value 12.128063
iter  20 value 0.661270
iter  30 value 0.479209
iter  40 value 0.423484
iter  50 value 0.389637
iter  60 value 0.347469
iter  70 value 0.336718
iter  80 value 0.328449
iter  90 value 0.326521
iter 100 value 0.324678
final  value 0.324678 
stopped after 100 iterations
# weights:  18 (10 variable)
initial  value 114.255678 
iter  10 value 14.002492
iter  20 value 0.902828
iter  30 value 0.739173
iter  40 value 0.613520
iter  50 value 0.531325
iter  60 value 0.468994
iter  70 value 0.436254
iter  80 value 0.432734
iter  90 value 0.432322
iter 100 value 0.431562
final  value 0.431562 
stopped after 100 iterations
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0         6

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8677, 1)
    No Information Rate : 0.3846     
    P-Value [Acc > NIR] : 1.624e-11  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3846            0.3846           0.2308
Detection Rate              0.3846            0.3846           0.2308
Detection Prevalence        0.3846            0.3846           0.2308
Balanced Accuracy           1.0000            1.0000           1.0000

Conclusion

Balancing unbalanced datasets is crucial for developing accurate and reliable machine learning models. SMOTE is a powerful technique to generate synthetic samples for the minority class, leading to a balanced dataset. In this article, we demonstrated how to use SMOTE in R to balance an unbalanced classification problem. By following these steps, you can enhance your model’s performance on unbalanced datasets and achieve more reliable results.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
CHAID analysis for OS in R?
How do we print percentage accuracy for SVM in R
Z-Score Normalization: Definition and Examples
Unsupervised Clustering with Unknown Number of Clusters
AI in Content Creation

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	18