Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling Imbalanced Datasets

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Handling Imbalanced Datasets

Avatar for Allen Akinkunle

Allen Akinkunle

July 05, 2017
Tweet

More Decks by Allen Akinkunle

Other Decks in Technology

Transcript

  1. Speaker Bio • Data Scientist at Ernst & Young •

    Former Software Developer • MSc Data Science from Lancaster University • Hobbyist Singer
  2. Session Objectives At the end of this session, we should

    understand: • What imbalanced datasets are and how to handle them • How to pick the right performance evaluation metric for your predictive models • The different methods of dealing with imbalanced datasets ◦ How SMOTE works in particular • How to deploy a built model behind a REST API
  3. Session Overview • What are imbalanced datasets? • Domains with

    class imbalance problem • The problem with imbalanced datasets • Accuracy Paradox • Model performance metrics • Methods of dealing with imbalanced data • What is SMOTE? • How does SMOTE work? • Demo
  4. What are Imbalanced Datasets? A dataset is imbalanced if the

    classes are not approximately equally represented* • Class with more cases (Majority Class) • Class with less cases (Minority Class) * We’re exploring class imbalance in a binary classification problem Balance Scale Image courtesy of winnifredxoxo on Flickr
  5. Domains with Class Imbalance Problem • Fraud detection in financial

    transactions • Detection of malignant tumors in medical image scans • Email spam detection • Andela’s fellowship admission dataset
  6. The Problem with Imbalanced Datasets • The aim is to

    detect/predict the rare but very important cases of the minority class. • We require a high rate of correct prediction in the minority class (the critical class). • When ML algorithms are trained on imbalanced datasets, they give biased predictions and misleading accuracy scores.
  7. Accuracy Paradox “...states that predictive models with a given level

    of accuracy may have greater predictive power than models with higher accuracy.” [1] • The performance of machine learning algorithms is typically evaluated using predictive accuracy. • Not appropriate when the data is imbalanced and/or the costs of different errors vary markedly. We favour other performance
  8. Model Performance Metrics • Accuracy • Error Rate is 1

    - Accuracy • Precision: measure of correctness achieved in positive predictions. Precision = TP / (TP + FP) • Recall: also called Sensitivity or the True Positive Rate. It measures how many of the true positives are predicted correctly. Recall = TP / (TP + FN) • F-Score • ROC Curve and Area Under Curve • Precision-Recall Curve
  9. Methods of dealing with Imbalanced Data • Sampling Methods ◦

    Under-sampling ◦ Over-sampling e.g. SMOTE • Cost Sensitive Learning
  10. What is SMOTE? SMOTE (Synthetic Minority Over-sampling Technique) is a

    technique in which the minority class is over-sampled by creating “synthetic examples” using original data.
  11. STEP 1: For every data point in minority class, the

    algorithm finds k samples closest in distance (standard Euclidean to the selected minority data point. STEP 2: Synthetic new samples (x n ) are generated by calculating the distance between the minority sample x i and its nearest neighbour x j . The distance is then multiplied by a random number between 0 and 1. x n = x i + (x j - x i ) * rand(0, 1) How does SMOTE work?