Easy Ensemble Classifier in Machine Learning - Coding

The Easy Ensemble Classifier (EEC) is an advanced ensemble learning algorithm specifically designed to address class imbalance issues in classification tasks. It enhances the performance of models on imbalanced datasets by leveraging oversampling and ensembling techniques to improve classification accuracy for the minority class, which is often critical in applications such as fraud detection and medical diagnosis.

Handling Imbalanced Datasets using Easy Ensemble Classifier

Class imbalance is a common challenge in machine learning, where traditional algorithms may exhibit a bias towards the majority class, leading to suboptimal performance on the minority class. EEC tackles this issue through the following strategies:

Balancing Training Data: EEC creates balanced subsets by under-sampling the majority class, ensuring that each subset has an almost equal representation of both classes.
Ensembling for Robustness: By combining predictions from classifiers trained on these balanced subsets, EEC improves generalization, robustness, and reduces model bias towards the majority class.
Boosting Technique: EEC often incorporates boosting methods like AdaBoost, focusing on misclassified instances to enhance the handling of class imbalance.

Operational Mechanics of Easy Ensemble Classifier

The Easy Ensemble Classifier follows a systematic process to mitigate class imbalance:

1. Iterative Under-Sampling

Subset Creation: The majority class is repeatedly under-sampled to form several balanced subsets. For example, in a dataset with 1000 majority and 100 minority instances, 10 subsets of 100 majority instances each are created.
Random Sampling: Each subset is generated through random sampling of majority class instances, ensuring diversity among subsets.

2. Training Base Classifiers

Independent Training: Each balanced subset is used to train a separate base classifier, ensuring a balanced perspective on the data.
Diverse Classifiers: Simple classifiers such as decision trees are commonly used and combined to form the ensemble.

3. Boosting in Performance

Sequential Training: Boosting algorithms train classifiers sequentially, with each subsequent classifier focusing on errors made by the previous ones.
Weight Adjustment: Boosting adjusts the weights of misclassified instances to improve classification accuracy.
Combining Weak Learners: Many weak classifiers are aggregated into a strong ensemble with high precision.

4. Aggregation of Classifier Predictions

Voting Mechanism: Predictions from base classifiers are combined using majority voting, where the class with the most votes is chosen.
Weighted Voting: Some methods weight classifier predictions based on individual performance or confidence.
Boosted Predictions: When boosting is applied, the aggregation takes into account the accuracy of classifiers within the boosting sequence.

Advantages of Easy Ensemble Classifier

Improved Performance on Imbalanced Datasets: Enhances model accuracy for the minority class.
Boosting: Increases model accuracy by focusing on difficult cases.
Robustness: The ensemble approach mitigates the weaknesses of individual classifiers.
Versatility: Compatible with various base classifiers and data types.
Lower Bias: Reduces bias towards the majority class.

Disadvantages of Easy Ensemble Classifier

Higher Complexity: Training multiple classifiers and combining their predictions is computationally intensive.
Resource-Intensive: Requires significant computational resources and memory.
Overfitting: Risk of overfitting due to boosting, affecting performance on unseen data.
Model Interpretability: The ensemble and boosting techniques reduce model interpretability compared to simpler models.
Under-Sampling Limitations: Loss of information from under-sampling the majority class may affect predictive performance.

Implementing Easy Ensemble Classifier for Heart Failure Prediction

Step 1: Import Necessary Libraries

Import necessary libraries for data manipulation, visualization, and machine learning.

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Step 2: Load and Explore Data

Load the dataset and check the first few rows, data types, and summary statistics. The dataset can be downloaded from here.

Python

df = pd.read_csv('/content/heart_failure_clinical_records_dataset.csv')
df.head(10)
df.info()
df.describe()
df.isnull().sum()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

Step 3: Explore Class Distribution

Check the distribution of the target variable to understand class imbalance.

Python

df.DEATH_EVENT.value_counts()

Output:

DEATH_EVENT
0    203
1     96
Name: count, dtype: int64

Step 4: Prepare Data for Resampling

Separate features and target variable.

Python

X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

Step 5: Apply SMOTETomek for Resampling

Use SMOTETomek to handle class imbalance by oversampling the minority class and under-sampling the majority class.

Python

smk = SMOTETomek(random_state=42)
X_res, y_res = smk.fit_resample(X, y)

Step 6: Split Data into Training and Testing Sets

Divide the resampled data into training and test sets.

Python

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.32, random_state=42)

Step 7: Initialize and Train EasyEnsembleClassifier

Create an instance of EasyEnsembleClassifier and train it on the training data.

Python

eec = EasyEnsembleClassifier(random_state=42)
eec.fit(X_train, y_train)

Output:

Step 8: Predict and Evaluate Model

Generate predictions on the test data and evaluate the model’s performance.

Python

y_pred = eec.predict(X_test)

# Classification Report
print('Classification Report')
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred))

# Accuracy Score
print("Accuracy Score")
print(accuracy_score(y_test, y_pred))

Output:

Classification Report
              precision    recall  f1-score   support

           0       0.88      0.82      0.85        56
           1       0.81      0.88      0.84        48

    accuracy                           0.85       104
   macro avg       0.85      0.85      0.85       104
weighted avg       0.85      0.85      0.85       104

Confusion Matrix [[46 10]
 [ 6 42]]
Accuracy Score 0.8461538461538461

Step 9: Predict Using New Data

Prepare new data for prediction, reshape it as required, and use the trained model to make predictions.

Python

input_data = (75, 0, 582, 0, 20, 1, 265000, 1.9, 130, 1, 0, 4)
input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)
prediction = eec.predict(input_data_reshaped)
print("Prediction Class: ", prediction)

Output:

Prediction Class:  [1]

Conclusion

The Easy Ensemble Classifier (EEC) effectively addresses the challenge of class imbalance, a common issue in classification tasks where the distribution of classes is skewed. By employing a combination of under-sampling, ensembling, and boosting techniques, EEC enhances the performance of machine learning models on imbalanced datasets, making it particularly useful for applications like fraud detection and medical diagnosis.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Financial Analyst vs. Data Analyst
Confusion Matrix from rpart
Flight Delay Prediction Using R
Can AI replace Flutter developers ?
10 R Skills you need to know in 2024

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	20