Slide 1

Slide 1 text

Class imbalance problem Prabhant Singh & Junaid Ahmed

Slide 2

Slide 2 text

Class imbalance problem ● Majority class ○ Far more examples ● Minority class ○ Far less examples ● Level of imbalance ○ Number of Majority classes example / Number of Minority examples

Slide 3

Slide 3 text

Class imbalance problem ● Can’t achieve high level accuracy when there is class imbalance ○ Majority dominates the dataset ○ Even small level of imbalance make a difference ● When Learning process suffer due to imbalanced data set ○ Minority classes are more important or can’t be sacrifice

Slide 4

Slide 4 text

Class imbalance problem Class imbalance problem refers to the scenario when the data is imbalanced ● Minority classes are important ● Minority class has higher cost than the majority class

Slide 5

Slide 5 text

Class imbalance problem Strategies to counter Class imbalance problem ● Undersampling ○ Decreases the majority class examples ■ may lose useful information. ● Oversampling ○ Increases the minority class examples ■ Overfitting

Slide 6

Slide 6 text

Class imbalance problem To Improve under sampling: selectively remove the majority class examples such that more informative examples are kept. ● One sided sampling ○ Initial subset contains all the examples ○ Randomly remove the example from subset ○ Constructed 1-NN classifier to classify

Slide 7

Slide 7 text

Class imbalance problem To Improve Over sampling Some method use synthetic examples instead of exact copies to reduce the risk of overfitting ● SMOTE ○ Select random example ○ Select one of its neighbors ○ Create a new example

Slide 8

Slide 8 text

Performance Evaluation with Class Imbalance ● AUC ● ROC ● F-measure ● G-mean ● Precision recall curve ○ Precision ○ Recall ●

Slide 9

Slide 9 text

Performance Evaluation with Class Imbalance ROC ● Receiver Operating characteristic. ● True positive rate(TPR) on the y-axis ● the false positive rate(FPR) on the x-axis. AUC ● Area under curve

Slide 10

Slide 10 text

Performance Evaluation with Class Imbalance G-mean or Geometric mean ● Geometric mean of the accuracy of each class ● Good candidate for evaluating class-imbalance learning performance

Slide 11

Slide 11 text

Performance Evaluation with Class Imbalance Precision Examples classified as positive are really positive Recall Positive examples are correctly classified as positive

Slide 12

Slide 12 text

Performance Evaluation with Class Imbalance F-measure F-measure is defined as the harmonic mean of precision and recall ● Precision does not contain any information about F N ● Recall does not contain any information about F P

Slide 13

Slide 13 text

Ensemble Methods for Class-Imbalance Learning

Slide 14

Slide 14 text

Data preprocessing methods ● Random Under and over sampling ● SMOTE: Synthetic minority oversampling technique ● MSMOTE ● SPIDER: Selective preprocessing of imbalanced data

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Addressing Class Imbalance Problem With Classifier Ensembles ● Cost sensitive boosting ensembles ● Boosting based ensembles ● Bagging based ensembles ● Hybrid

Slide 17

Slide 17 text

Cost sensitive ensembles ● Adacost ● CSB ● RareBoost ● Ada C1,C2,C3 (covered in previous lectures)

Slide 18

Slide 18 text

Boosting based ensembles ● SMOTEBoost ● RUSBoost

Slide 19

Slide 19 text

SMOTEBoost ● Just like Adaboost ● Use SMOTE for preprocessing ● Algorithm ○ Generating synthetic examples in each iteration ○ Training classifier on new data ○ Computing the loss of new classifier ○ Update the weights(adaboost) ● Advantages: More diversity to the data

Slide 20

Slide 20 text

RUSBoost ● Performs similarly to SMOTEBoost ● Removes instances from the majority class by random undersampling the data in each iteration. ● Not necessarily assigns new weights to the instances.

Slide 21

Slide 21 text

Bagging based ensembles ● Underbagging ● Overbagging ● Underoverbagging

Slide 22

Slide 22 text

Hybrid Ensembles ● Easy Ensembles ● Balance Cascade

Slide 23

Slide 23 text

Thank You