Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling Imbalanced Datasets

Handling Imbalanced Datasets

Allen Akinkunle

July 05, 2017
Tweet

More Decks by Allen Akinkunle

Other Decks in Technology

Transcript

  1. Speaker Bio • Data Scientist at Ernst & Young •

    Former Software Developer • MSc Data Science from Lancaster University • Hobbyist Singer
  2. Session Objectives At the end of this session, we should

    understand: • What imbalanced datasets are and how to handle them • How to pick the right performance evaluation metric for your predictive models • The different methods of dealing with imbalanced datasets ◦ How SMOTE works in particular • How to deploy a built model behind a REST API
  3. Session Overview • What are imbalanced datasets? • Domains with

    class imbalance problem • The problem with imbalanced datasets • Accuracy Paradox • Model performance metrics • Methods of dealing with imbalanced data • What is SMOTE? • How does SMOTE work? • Demo
  4. What are Imbalanced Datasets? A dataset is imbalanced if the

    classes are not approximately equally represented* • Class with more cases (Majority Class) • Class with less cases (Minority Class) * We’re exploring class imbalance in a binary classification problem Balance Scale Image courtesy of winnifredxoxo on Flickr
  5. Domains with Class Imbalance Problem • Fraud detection in financial

    transactions • Detection of malignant tumors in medical image scans • Email spam detection • Andela’s fellowship admission dataset
  6. The Problem with Imbalanced Datasets • The aim is to

    detect/predict the rare but very important cases of the minority class. • We require a high rate of correct prediction in the minority class (the critical class). • When ML algorithms are trained on imbalanced datasets, they give biased predictions and misleading accuracy scores.
  7. Accuracy Paradox “...states that predictive models with a given level

    of accuracy may have greater predictive power than models with higher accuracy.” [1] • The performance of machine learning algorithms is typically evaluated using predictive accuracy. • Not appropriate when the data is imbalanced and/or the costs of different errors vary markedly. We favour other performance
  8. Model Performance Metrics • Accuracy • Error Rate is 1

    - Accuracy • Precision: measure of correctness achieved in positive predictions. Precision = TP / (TP + FP) • Recall: also called Sensitivity or the True Positive Rate. It measures how many of the true positives are predicted correctly. Recall = TP / (TP + FN) • F-Score • ROC Curve and Area Under Curve • Precision-Recall Curve
  9. Methods of dealing with Imbalanced Data • Sampling Methods ◦

    Under-sampling ◦ Over-sampling e.g. SMOTE • Cost Sensitive Learning
  10. What is SMOTE? SMOTE (Synthetic Minority Over-sampling Technique) is a

    technique in which the minority class is over-sampled by creating “synthetic examples” using original data.
  11. STEP 1: For every data point in minority class, the

    algorithm finds k samples closest in distance (standard Euclidean to the selected minority data point. STEP 2: Synthetic new samples (x n ) are generated by calculating the distance between the minority sample x i and its nearest neighbour x j . The distance is then multiplied by a random number between 0 and 1. x n = x i + (x j - x i ) * rand(0, 1) How does SMOTE work?