Handling Class Imbalance
Methods for handling class imbalance in data sets
DATA_URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data'
import pandas as pd
Handling Imbalanced Classes
From this article
Class imbalances are a common occurence and can very easily lead to skewing of models as well as models that provide inaccurate accuracies for performance
An example of a class imbalence would be something like in the balance scale dataset
from here:
df = pd.read_csv(
DATA_URL,
names=['balance', 'var1', 'var2', 'var3', 'var4']
)
# transform into binary classification
df['balance'] = [1 if b=='B' else 0 for b in df.balance]
df.head()
balance | var1 | var2 | var3 | var4 | |
---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 1 |
1 | 0 | 1 | 1 | 1 | 2 |
2 | 0 | 1 | 1 | 1 | 3 |
3 | 0 | 1 | 1 | 1 | 4 |
4 | 0 | 1 | 1 | 1 | 5 |
df.balance.count()
625
In the above dataset we have the case where we have an imbalance of the class distribution in balance
, we have 657
of the False
class but only 49
of the True
class. We can see this below:
df.balance.value_counts()
0 576
1 49
Name: balance, dtype: int64
This means that even a model that only returns False
will be correct 576/265
times, this isn't very meaningful to us. Since most ML algorithms try to optimize accuracy it will potentially yield something close to the above ratio
A model as described would have a good accuracy overall, but be very bad at predicting the True
values
In order to counter this effect we need to balance our data in some way, we have two ways we can do this:
- Down-sample the majority class - potential data loss
- Up-sample the minority class - potential overfitting on minority class
- Penalize imbalanced predictions (if your model supports it)
- Use a tree-based algorithm (I've had issues with this in practice but claims to work in theory)
- Use Synthetic Samples for minority data (kind of like up-sampling)
- Use an Anomaly Detection algorithm to identify your minority classes
We can do both of the above using the sklearn.utils.resample
function
Try both for your model and see which yields better results
from sklearn.utils import resample
1. Down-sample the majority class
df_minority = df[df.balance == 1]
df_majority = df[df.balance == 0]
df_majority_downsampled = resample(
df_majority,
replace=False,
n_samples=len(df_minority),
random_state=0
)
df_balanced = pd.concat([df_minority, df_majority_downsampled])
df_balanced.balance.value_counts()
1 49
0 49
Name: balance, dtype: int64
2. Up-sample the minority class
df_minority = df[df.balance == 1]
df_majority = df[df.balance == 0]
df_minority_upsampled = resample(
df_minority,
replace=True,
n_samples=len(df_majority),
random_state=0
)
df_balanced = pd.concat([df_minority_upsampled, df_majority])
df_balanced.balance.value_counts()
1 576
0 576
Name: balance, dtype: int64
3. Penalize mistakes on minority classes
If we're using a classifier like an SVM that supports penalization for incorrect predictions on minority classes we can use that too, for example the sklearn.svm.SVC
we can set the class_weight='balanced'
to make this happen
from sklearn.svm import SVC
X = df.drop('balance', axis=1)
y = df.balance
model = SVC(
kernel='linear',
class_weight='balanced',
probability=True
)
Other Mehtods
The methods above can be helpful at a general level, however some other things that can be looked into are:
Synthetic Sampling
Synthetic sampling is a method of upsampling that slightly disturbs the samples so as not to be identical to the initial sample
Anomaly Detection
If trying to detect a specific occurence of a class that isn't very common it may be useful to use an anomaly detection algorithm to identify these class instances, these might work well in cases where the class you're trying to identify has some 'abnormal' characteristics