![]() |
In the field of machine learning, dealing with unbalanced datasets is a common challenge. When one class significantly outnumbers the other, models tend to be biased towards the majority class, resulting in poor predictive performance for the minority class. Synthetic Minority Over-sampling Technique (SMOTE) is an effective method to address this issue by generating synthetic samples for the minority class, thereby balancing the dataset. In this article, we will explore how to balance an unbalanced classification problem using SMOTE in R Programming Language. Understanding Unbalanced ClassificationUnbalanced classification occurs when the number of instances in one class is significantly higher than in the other class. This imbalance can lead to biased models that perform well on the majority class but poorly on the minority class. For instance, in fraud detection, fraudulent transactions (minority class) are far less frequent than legitimate transactions (majority class). Traditional classification algorithms may fail to identify fraudulent transactions due to the imbalance. Introduction to SMOTESMOTE (Synthetic Minority Over-sampling Technique) is a powerful technique to handle unbalanced datasets. It works by creating synthetic examples for the minority class by interpolating between existing minority instances. This helps in achieving a balanced class distribution without simply duplicating the minority instances. Step 1: Install and Load Necessary PackagesFirst we will Install and Load Necessary Packages.
Step 2: Load and Explore the Iris DatasetThe iris dataset is balanced. We’ll modify it to create an imbalanced dataset.
Output: setosa versicolor virginica
50 50 10 Step 3: Apply SMOTEApply the SMOTE function to the imbalanced iris dataset using the SMOTE function from the smotefamily package.
Output: setosa versicolor virginica
50 50 30 Step 4: Train a Model on the Balanced DatasetNow that you have a balanced dataset, you can train a machine learning model on it. For this example, we’ll use a simple logistic regression model.
Output: stopped after 100 iterations
# weights: 18 (10 variable)
initial value 114.255678
iter 10 value 11.828471
iter 20 value 0.265536
iter 30 value 0.023720
iter 40 value 0.009716
iter 50 value 0.008655
iter 60 value 0.007052
iter 70 value 0.005485
iter 80 value 0.004491
iter 90 value 0.003137
iter 100 value 0.002813
final value 0.002813
stopped after 100 iterations
# weights: 18 (10 variable)
initial value 114.255678
iter 10 value 18.002449
iter 20 value 16.511306
final value 16.511023
converged
# weights: 18 (10 variable)
initial value 114.255678
iter 10 value 12.128063
iter 20 value 0.661270
iter 30 value 0.479209
iter 40 value 0.423484
iter 50 value 0.389637
iter 60 value 0.347469
iter 70 value 0.336718
iter 80 value 0.328449
iter 90 value 0.326521
iter 100 value 0.324678
final value 0.324678
stopped after 100 iterations
# weights: 18 (10 variable)
initial value 114.255678
iter 10 value 14.002492
iter 20 value 0.902828
iter 30 value 0.739173
iter 40 value 0.613520
iter 50 value 0.531325
iter 60 value 0.468994
iter 70 value 0.436254
iter 80 value 0.432734
iter 90 value 0.432322
iter 100 value 0.431562
final value 0.431562
stopped after 100 iterations
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 10 0 0
versicolor 0 10 0
virginica 0 0 6
Overall Statistics
Accuracy : 1
95% CI : (0.8677, 1)
No Information Rate : 0.3846
P-Value [Acc > NIR] : 1.624e-11
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000
Prevalence 0.3846 0.3846 0.2308
Detection Rate 0.3846 0.3846 0.2308
Detection Prevalence 0.3846 0.3846 0.2308
Balanced Accuracy 1.0000 1.0000 1.0000 ConclusionBalancing unbalanced datasets is crucial for developing accurate and reliable machine learning models. SMOTE is a powerful technique to generate synthetic samples for the minority class, leading to a balanced dataset. In this article, we demonstrated how to use SMOTE in R to balance an unbalanced classification problem. By following these steps, you can enhance your model’s performance on unbalanced datasets and achieve more reliable results. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 18 |