Handling Class Imbalance in PyTorch - Coding

Class imbalance is a common challenge in machine learning, where certain classes are underrepresented compared to others. This can lead to biased models that perform poorly on minority classes. In this article, we will explore various techniques to handle class imbalance in PyTorch, ensuring your models are robust and generalize well across all classes.

Table of Content

Understanding Class Imbalance Problem
Techniques to Handle Class Imbalance in PyTorch

1. Resampling Techniques
2. Class Weighting
3. Weighted Random Sampler
4. Synthetic Data Generation

Step-by-Step Practical Implementation in PyTorch

Understanding Class Imbalance Problem

Class imbalance occurs when the distribution of classes in a dataset is uneven. For instance, in a binary classification problem, if 90% of the samples belong to class A and only 10% belong to class B, the model may become biased towards class A. This bias can result in poor performance on class B, which is often more critical in real-world applications.

When dealing with imbalanced datasets, standard machine learning models tend to favor the majority class. This happens because the loss function is dominated by the majority class’s errors, leading to suboptimal performance on the minority class.

Techniques to Handle Class Imbalance in PyTorch

There are several techniques to address class imbalance in PyTorch, including:

Resampling Techniques
- Oversampling
- Undersampling
Class Weighting
- Weighted Loss Functions
- Weighted Random Sampler
Synthetic Data Generation
- SMOTE (Synthetic Minority Over-sampling Technique)
- GANs (Generative Adversarial Networks)

1. Resampling Techniques

Oversampling involves increasing the number of samples in the minority class by duplicating existing samples or generating new ones through data augmentation.
Undersampling reduces the number of samples in the majority class to balance the dataset.

Example of Oversampling in PyTorch:

Python

import torch
from torch.utils.data import DataLoader, WeightedRandomSampler, TensorDataset
import numpy as np

data = torch.randn(1000, 10)
targets = torch.cat((torch.zeros(900), torch.ones(100)))  # Imbalanced targets

# Create a dataset
train_dataset = TensorDataset(data, targets)

# Calculate weights for each class
class_sample_count = np.array([len(np.where(targets.numpy() == t)[0]) for t in np.unique(targets.numpy())])
weight = 1. / class_sample_count
samples_weight = np.array([weight[int(t)] for t in targets.numpy()])

samples_weight = torch.from_numpy(samples_weight)
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(train_dataset, batch_size=64, sampler=sampler)

for batch_data, batch_target in train_loader:
    print(batch_data, batch_target)

Output:

tensor([[-1.3654e+00,  2.0988e+00, -1.0405e+00,  4.9436e-01, -6.7986e-01,
         -5.5502e-01,  7.9303e-01,  1.5505e+00,  4.6447e-01, -4.0292e-01],
        [-1.6516e+00,  2.2920e+00, -6.0501e-03,  7.3922e-01,  5.6008e-01,
         -1.3300e+00, -1.0784e+00,  8.0359e-02,  1.0341e-01,  1.4301e+00],
        [-3.1976e-01,  1.3244e+00,  5.3613e-01, -4.8656e-02,  7.4445e-02,
         -2.5417e-01, -2.4022e-01,  8.8676e-01,  7.2845e-01, -1.5441e+00],
        [-5.4181e-01,  7.0553e-01,  4.2019e-01,  7.4735e-01,  1.8736e+00,
          2.1299e+00,  1.4738e+00, -5.1831e-01, -9.4831e-01,  4.6648e-01],
.
.
.
        [ 1.6522, -0.6508, -0.7066, -1.0904,  0.5138,  0.4304,  0.8378,  0.6380,
         -0.0063, -0.8115],
        [ 0.5680,  1.3122, -1.1694, -0.1602,  0.6708,  0.3561, -0.2780, -0.2240,
          0.0845,  0.7573],
        [-0.6904, -3.1126, -0.4480, -1.7536, -0.2844, -0.9535,  0.1079,  1.0787,
          0.9399, -0.1004],
        [ 0.0784, -0.6072, -0.6378, -0.2630,  0.1182,  0.7324,  0.4181, -0.4501,
          0.1779, -0.9345]]) tensor([1., 1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,
        0., 1., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0.,
        1., 0., 1., 0.])

Example of Undersampling in PyTorch:

Python

from imblearn.under_sampling import RandomUnderSampler
import numpy as np
import torch

# Example data and targets
X = np.random.randn(1000, 10)
y = np.array([0]*900 + [1]*100)

rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X, y)

# Convert back to PyTorch tensors
X_res = torch.tensor(X_res, dtype=torch.float32)
y_res = torch.tensor(y_res, dtype=torch.long)

print(X_res, y_res)

Output:

tensor([[-0.0529,  0.7972, -0.5212,  ..., -0.5436,  0.6600, -0.2462],
        [ 0.3518,  1.4803, -0.5319,  ...,  2.0695, -0.4088,  0.8578],
        [ 1.0514,  0.0408, -0.3043,  ...,  1.1470,  0.9427,  0.7008],
        ...,
        [ 1.1087,  0.3033,  0.8691,  ..., -0.3177,  0.2189,  1.6276],
        [ 1.4176, -0.2956,  1.7604,  ...,  1.7049, -1.1794, -0.3242],
        [ 0.3839, -0.4644, -0.1465,  ..., -0.6247,  1.1085, -1.2942]]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])

2. Class Weighting

Class weighting adjusts the loss function to penalize the model more for misclassifying minority classes. This can be done by setting the weight parameter in loss functions like CrossEntropyLoss.

Example of Weighted Loss Function:

Python

import torch.nn as nn
import torch

class_weights = torch.tensor([0.1, 0.9])

# Use the weights in CrossEntropyLoss
criterion = nn.CrossEntropyLoss(weight=class_weights)
outputs = torch.randn(10, 2)
labels = torch.randint(0, 2, (10,))

loss = criterion(outputs, labels)
print(loss.item())

Output:

0.8986063599586487

3. Weighted Random Sampler

The WeightedRandomSampler in PyTorch can be used to ensure that each batch has a balanced representation of classes.

Example of Weighted Random Sampler:

Python

from torch.utils.data import DataLoader, WeightedRandomSampler, TensorDataset
import numpy as np
import torch

# Example data and targets
data = torch.randn(1000, 10)
targets = torch.cat((torch.zeros(900), torch.ones(100)))  # Imbalanced targets

# Create a dataset
train_dataset = TensorDataset(data, targets)

# Calculate weights for each class
class_sample_count = np.array([len(np.where(targets.numpy() == t)[0]) for t in np.unique(targets.numpy())])
weight = 1. / class_sample_count
samples_weight = np.array([weight[int(t)] for t in targets.numpy()])

samples_weight = torch.from_numpy(samples_weight)
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(train_dataset, batch_size=64, sampler=sampler)
for batch_data, batch_target in train_loader:
    print(batch_data, batch_target)

Output:

tensor([[-4.2689e-01,  3.3830e-01,  6.0396e-03, -1.4052e-01,  1.0600e+00,
         -1.4388e+00,  6.6914e-02, -3.2933e-02,  1.3498e+00,  1.3142e+00],
        [ 8.4668e-01, -1.4698e-01, -3.5705e-01,  1.0168e+00,  6.5028e-01,
          4.1976e-01, -9.7244e-01, -5.4900e-01, -8.7519e-01, -7.5931e-01],
        [ 1.6669e-01, -3.6750e-01,  2.7809e+00, -1.7411e+00, -1.1054e+00,
          1.2962e+00,  6.3433e-01, -3.2507e-02, -2.5889e-01,  1.4207e+00],
        [ 1.8596e-01, -1.6354e-01,  6.7141e-01, -4.7348e-02,  6.6376e-01,
         -1.4234e+00,  6.0774e-01, -2.2348e-01, -2.2053e+00, -1.1837e+00],
        [-3.4800e-01,  8.8325e-01, -1.9079e+00, -4.4495e-01, -4.3775e-01,
         -4.5938e-01,  3.7062e-01, -1.1976e+00,  1.2333e+00,  1.4009e+00],
        [ 2.0557e+00, -8.8572e-01, -5.3733e-01, -3.8578e-01, -1.6796e+00,
.
.
.
  [-0.2795,  0.3005, -0.4412,  0.8036, -1.8333, -0.8897,  0.0272,  0.8428,
          1.2359, -0.4372],
        [ 1.6522, -0.6508, -0.7066, -1.0904,  0.5138,  0.4304,  0.8378,  0.6380,
         -0.0063, -0.8115],
        [ 0.5680,  1.3122, -1.1694, -0.1602,  0.6708,  0.3561, -0.2780, -0.2240,
          0.0845,  0.7573],
        [-0.6904, -3.1126, -0.4480, -1.7536, -0.2844, -0.9535,  0.1079,  1.0787,
          0.9399, -0.1004],
        [ 0.0784, -0.6072, -0.6378, -0.2630,  0.1182,  0.7324,  0.4181, -0.4501,
          0.1779, -0.9345]]) tensor([1., 1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,
        0., 1., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0.,
        1., 0., 1., 0.])

4. Synthetic Data Generation

SMOTE generates synthetic samples for the minority class by interpolating between existing samples.
GANs can also be used to generate new, realistic samples for the minority class.

Example of SMOTE:

Python

from imblearn.over_sampling import SMOTE
import numpy as np
import torch

X = np.random.randn(1000, 10)
y = np.array([0]*900 + [1]*100)

smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)

# Convert back to PyTorch tensors
X_res = torch.tensor(X_res, dtype=torch.float32)
y_res = torch.tensor(y_res, dtype=torch.long)

print(X_res, y_res)

Output:

tensor([[ 0.2406, -0.7238, -2.0000,  ...,  0.4000,  0.8167,  0.5230],
        [-0.8474, -0.4665,  0.7510,  ...,  0.1358,  1.3370,  1.5177],
        [-0.5717, -0.4534,  0.7563,  ...,  0.6926,  1.4012,  1.4177],
        ...,
        [-1.0626,  0.0230,  2.3072,  ...,  1.0812,  1.4202,  0.0867],
        [ 0.9111, -0.6970,  0.4518,  ..., -0.6681,  0.4710,  0.9381],
        [-0.7216,  0.1150,  0.6139,  ...,  0.6164, -0.7479,  2.1608]]) tensor([0, 0, 0,  ..., 1, 1, 1])

Step-by-Step Practical Implementation in PyTorch

Let’s walk through a practical implementation of handling class imbalance in a PyTorch project. We’ll use a simple neural network for a classification task.

Step 1: Prepare the Dataset

Python

import torch
from torch.utils.data import Dataset, DataLoader

class ImbalancedDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.targets[idx]

# Example data
data = torch.randn(1000, 10)
targets = torch.cat((torch.zeros(900), torch.ones(100)))  # Imbalanced targets

dataset = ImbalancedDataset(data, targets)

Step 2: Apply Weighted Random Sampler

Python

class_sample_count = torch.tensor([(targets == t).sum() for t in torch.unique(targets)])
weight = 1. / class_sample_count.float()
samples_weight = torch.tensor([weight[int(t)] for t in targets])

sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(dataset, batch_size=64, sampler=sampler)

for batch_data, batch_target in train_loader:
    print(batch_data, batch_target)

Output:

tensor([[-9.2573e-01,  1.3661e+00,  1.8957e+00, -6.0163e-01, -1.0795e+00,
         -2.9709e-01,  6.4180e-01, -6.0223e-01, -1.0173e+00, -6.7902e-01],
        [-1.3580e+00, -7.5121e-01,  6.0977e-01,  2.7208e-01,  2.8799e-01,
         -1.1380e+00,  3.5168e-01, -5.4055e-01,  1.4824e+00, -7.8375e-03],
        [-2.2738e-01,  7.7970e-01,  3.2662e-01,  1.1474e+00, -2.3966e+00,
          7.3966e-01, -7.9589e-01, -5.1916e-01,  6.8310e-01, -1.0050e+00],
.
.
.
        [ 9.2717e-01,  9.3561e-02,  5.3306e-01, -3.3107e-01, -5.6605e-01,
          2.9753e-01,  9.1074e-01,  1.0241e+00, -8.9280e-01,  1.1524e+00],
        [-7.1160e-01,  8.4537e-01, -2.8062e-01, -4.1471e-01, -1.7021e+00,
          8.1715e-01,  7.1224e-01,  1.6675e-01,  2.4430e-01, -1.5401e+00],
        [ 2.0947e+00,  7.5216e-01, -6.6363e-01,  1.4187e-01, -9.8227e-01,
         -2.0121e-01,  3.1274e-01,  7.8528e-01, -1.1350e+00, -2.8751e-01]]) tensor([0., 0., 1., 0., 0., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1.,
        1., 0., 0., 0.])

Step 3: Define the Model and Loss Function

Python

import torch.nn as nn
import torch.optim as optim

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 2)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Step 4: Train the Model

Python

num_epochs = 10

for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer.step()
    
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')

Output:

Epoch 1/10, Loss: 0.6750123500823975
Epoch 2/10, Loss: 0.6318653225898743
Epoch 3/10, Loss: 0.6800233721733093
Epoch 4/10, Loss: 0.6590889096260071
Epoch 5/10, Loss: 0.6862348318099976
Epoch 6/10, Loss: 0.6653190851211548
Epoch 7/10, Loss: 0.5675309896469116
Epoch 8/10, Loss: 0.6686651706695557
Epoch 9/10, Loss: 0.6834089756011963
Epoch 10/10, Loss: 0.7011194825172424

Conclusion

Handling class imbalance is crucial for building robust machine learning models. In PyTorch, techniques like resampling, class weighting, and synthetic data generation can effectively address this issue. By implementing these strategies, you can ensure that your models perform well across all classes, leading to more accurate and fair predictions.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Nonlinear Time Series Models
Detecting bills by using OpenCV
What are different evaluation metrics used to evaluate image segmentation models?
An Easy Approach to TF-IDF Using R
What is the difference between sliding window and anchor boxes approach in Object detection?

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	10