Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning from imbalanced data

Toni Pizà
January 25, 2017
16

Learning from imbalanced data

Lightning Talk / PyData Mallorca, January 2017 at Parc Bit

Toni Pizà

January 25, 2017
Tweet

Transcript

  1. Class‐imbalance problem One of the classes is strongly underrepresented Examples

    Quality control Fraud / Network intrusion detection Detection of oil spills Medical diagnosis Customer churn 2
  2. Class‐imbalance problem Most classi cation algorithms assume balanced distributions Di

    culties learning the concepts related to minority class Different cost of misclassi cation Accuracy paradox 3
  3. Approaches to the problem Methods at algorithm level (cost function

    based) Methods at data level (sampling based) 5
  4. Approaches to the problem Methods at algorithm level (cost function

    based) Methods at data level (sampling based) Under sampling Over sampling Ensemble methods (Sampling + Boosting) Change the distribution of the imbalanced data sets, to provide the learner with balanced data to improve the detection rate of the minority class. 6
  5. References · Learning from Imbalanced Data http://www.cs.utah.edu/~piyush/teaching/Imbalanc edLearning.pdf · SMOTE:

    Synthetic Minority Over-sampling Technique https://www.jair.org/media/953/live-953-2037- jair.pdf · On the Class Imbalance Problem http://sci2s.ugr.es/keel/pdf/speci c/congreso/guo_o n_2008.pdf 8