Class imbalance
problem
Prabhant Singh & Junaid Ahmed
Slide 2
Slide 2 text
Class imbalance problem
● Majority class
○ Far more examples
● Minority class
○ Far less examples
● Level of imbalance
○ Number of Majority classes example / Number of Minority examples
Slide 3
Slide 3 text
Class imbalance problem
● Can’t achieve high level accuracy when there is class imbalance
○ Majority dominates the dataset
○ Even small level of imbalance make a difference
● When Learning process suffer due to imbalanced data set
○ Minority classes are more important or can’t be sacrifice
Slide 4
Slide 4 text
Class imbalance problem
Class imbalance problem refers to the scenario when the data is imbalanced
● Minority classes are important
● Minority class has higher cost than the majority class
Slide 5
Slide 5 text
Class imbalance problem
Strategies to counter Class imbalance problem
● Undersampling
○ Decreases the majority class examples
■ may lose useful information.
● Oversampling
○ Increases the minority class examples
■ Overfitting
Slide 6
Slide 6 text
Class imbalance problem
To Improve under sampling:
selectively remove the majority class examples such that more informative examples are kept.
● One sided sampling
○ Initial subset contains all the examples
○ Randomly remove the example from subset
○ Constructed 1-NN classifier to classify
Slide 7
Slide 7 text
Class imbalance problem
To Improve Over sampling
Some method use synthetic examples instead of exact copies to reduce the risk of overfitting
● SMOTE
○ Select random example
○ Select one of its neighbors
○ Create a new example
Performance Evaluation with Class Imbalance
ROC
● Receiver Operating characteristic.
● True positive rate(TPR) on the y-axis
● the false positive rate(FPR) on the x-axis.
AUC
● Area under curve
Slide 10
Slide 10 text
Performance Evaluation with Class Imbalance
G-mean or Geometric mean
● Geometric mean of the accuracy of each class
● Good candidate for evaluating class-imbalance learning performance
Slide 11
Slide 11 text
Performance Evaluation with Class Imbalance
Precision
Examples classified as positive are really positive
Recall
Positive examples are correctly classified as positive
Slide 12
Slide 12 text
Performance Evaluation with Class Imbalance
F-measure
F-measure is defined as the harmonic mean of precision and recall
● Precision does not contain any information about F N
● Recall does not contain any information about F P
Slide 13
Slide 13 text
Ensemble Methods for
Class-Imbalance Learning
Slide 14
Slide 14 text
Data preprocessing methods
● Random Under and over sampling
● SMOTE: Synthetic minority oversampling technique
● MSMOTE
● SPIDER: Selective preprocessing of imbalanced data
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
Addressing Class Imbalance Problem
With Classifier Ensembles
● Cost sensitive boosting ensembles
● Boosting based ensembles
● Bagging based ensembles
● Hybrid
Slide 17
Slide 17 text
Cost sensitive ensembles
● Adacost
● CSB
● RareBoost
● Ada C1,C2,C3
(covered in previous lectures)
Slide 18
Slide 18 text
Boosting based ensembles
● SMOTEBoost
● RUSBoost
Slide 19
Slide 19 text
SMOTEBoost
● Just like Adaboost
● Use SMOTE for preprocessing
● Algorithm
○ Generating synthetic examples in each iteration
○ Training classifier on new data
○ Computing the loss of new classifier
○ Update the weights(adaboost)
● Advantages: More diversity to the data
Slide 20
Slide 20 text
RUSBoost
● Performs similarly to SMOTEBoost
● Removes instances from the majority class by random undersampling the
data in each iteration.
● Not necessarily assigns new weights to the instances.
Slide 21
Slide 21 text
Bagging based ensembles
● Underbagging
● Overbagging
● Underoverbagging