Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Class Imbalance Problem

prabhant
November 30, 2017

Class Imbalance Problem

Class imbalance problem presented in ML seminar on Ensemble machine learning @UT winter 2017

prabhant

November 30, 2017
Tweet

More Decks by prabhant

Other Decks in Science

Transcript

  1. Class imbalance problem • Majority class ◦ Far more examples

    • Minority class ◦ Far less examples • Level of imbalance ◦ Number of Majority classes example / Number of Minority examples
  2. Class imbalance problem • Can’t achieve high level accuracy when

    there is class imbalance ◦ Majority dominates the dataset ◦ Even small level of imbalance make a difference • When Learning process suffer due to imbalanced data set ◦ Minority classes are more important or can’t be sacrifice
  3. Class imbalance problem Class imbalance problem refers to the scenario

    when the data is imbalanced • Minority classes are important • Minority class has higher cost than the majority class
  4. Class imbalance problem Strategies to counter Class imbalance problem •

    Undersampling ◦ Decreases the majority class examples ▪ may lose useful information. • Oversampling ◦ Increases the minority class examples ▪ Overfitting
  5. Class imbalance problem To Improve under sampling: selectively remove the

    majority class examples such that more informative examples are kept. • One sided sampling ◦ Initial subset contains all the examples ◦ Randomly remove the example from subset ◦ Constructed 1-NN classifier to classify
  6. Class imbalance problem To Improve Over sampling Some method use

    synthetic examples instead of exact copies to reduce the risk of overfitting • SMOTE ◦ Select random example ◦ Select one of its neighbors ◦ Create a new example
  7. Performance Evaluation with Class Imbalance • AUC • ROC •

    F-measure • G-mean • Precision recall curve ◦ Precision ◦ Recall •
  8. Performance Evaluation with Class Imbalance ROC • Receiver Operating characteristic.

    • True positive rate(TPR) on the y-axis • the false positive rate(FPR) on the x-axis. AUC • Area under curve
  9. Performance Evaluation with Class Imbalance G-mean or Geometric mean •

    Geometric mean of the accuracy of each class • Good candidate for evaluating class-imbalance learning performance
  10. Performance Evaluation with Class Imbalance Precision Examples classified as positive

    are really positive Recall Positive examples are correctly classified as positive
  11. Performance Evaluation with Class Imbalance F-measure F-measure is defined as

    the harmonic mean of precision and recall • Precision does not contain any information about F N • Recall does not contain any information about F P
  12. Data preprocessing methods • Random Under and over sampling •

    SMOTE: Synthetic minority oversampling technique • MSMOTE • SPIDER: Selective preprocessing of imbalanced data
  13. Addressing Class Imbalance Problem With Classifier Ensembles • Cost sensitive

    boosting ensembles • Boosting based ensembles • Bagging based ensembles • Hybrid
  14. Cost sensitive ensembles • Adacost • CSB • RareBoost •

    Ada C1,C2,C3 (covered in previous lectures)
  15. SMOTEBoost • Just like Adaboost • Use SMOTE for preprocessing

    • Algorithm ◦ Generating synthetic examples in each iteration ◦ Training classifier on new data ◦ Computing the loss of new classifier ◦ Update the weights(adaboost) • Advantages: More diversity to the data
  16. RUSBoost • Performs similarly to SMOTEBoost • Removes instances from

    the majority class by random undersampling the data in each iteration. • Not necessarily assigns new weights to the instances.