Handling Imbalanced Datasets

Handling Imbalanced Datasets Allen Akinkunle @allenakinkunle http://www.allenkunle.me

Speaker Bio • Data Scientist at Ernst & Young •
Former Software Developer • MSc Data Science from Lancaster University • Hobbyist Singer

Session Objectives At the end of this session, we should
understand: • What imbalanced datasets are and how to handle them • How to pick the right performance evaluation metric for your predictive models • The different methods of dealing with imbalanced datasets ◦ How SMOTE works in particular • How to deploy a built model behind a REST API

Session Overview • What are imbalanced datasets? • Domains with
class imbalance problem • The problem with imbalanced datasets • Accuracy Paradox • Model performance metrics • Methods of dealing with imbalanced data • What is SMOTE? • How does SMOTE work? • Demo

What are Imbalanced Datasets? A dataset is imbalanced if the
classes are not approximately equally represented* • Class with more cases (Majority Class) • Class with less cases (Minority Class) * We’re exploring class imbalance in a binary classification problem Balance Scale Image courtesy of winnifredxoxo on Flickr

Domains with Class Imbalance Problem • Fraud detection in financial
transactions • Detection of malignant tumors in medical image scans • Email spam detection • Andela’s fellowship admission dataset

The Problem with Imbalanced Datasets • The aim is to
detect/predict the rare but very important cases of the minority class. • We require a high rate of correct prediction in the minority class (the critical class). • When ML algorithms are trained on imbalanced datasets, they give biased predictions and misleading accuracy scores.

Accuracy Paradox “...states that predictive models with a given level
of accuracy may have greater predictive power than models with higher accuracy.” [1] • The performance of machine learning algorithms is typically evaluated using predictive accuracy. • Not appropriate when the data is imbalanced and/or the costs of different errors vary markedly. We favour other performance

Model Performance Metrics • Accuracy • Error Rate is 1
- Accuracy • Precision: measure of correctness achieved in positive predictions. Precision = TP / (TP + FP) • Recall: also called Sensitivity or the True Positive Rate. It measures how many of the true positives are predicted correctly. Recall = TP / (TP + FN) • F-Score • ROC Curve and Area Under Curve • Precision-Recall Curve

Methods of dealing with Imbalanced Data • Sampling Methods ◦
Under-sampling ◦ Over-sampling e.g. SMOTE • Cost Sensitive Learning

What is SMOTE? SMOTE (Synthetic Minority Over-sampling Technique) is a
technique in which the minority class is over-sampled by creating “synthetic examples” using original data.

STEP 1: For every data point in minority class, the
algorithm finds k samples closest in distance (standard Euclidean to the selected minority data point. STEP 2: Synthetic new samples (x n ) are generated by calculating the distance between the minority sample x i and its nearest neighbour x j . The distance is then multiplied by a random number between 0 and 1. x n = x i + (x j - x i ) * rand(0, 1) How does SMOTE work?

QUESTIONS, PLEASE?

Handling Imbalanced Datasets

Handling Imbalanced Datasets

Allen Akinkunle

More Decks by Allen Akinkunle

Other Decks in Technology

Featured

Transcript

Handling Imbalanced Datasets Allen Akinkunle @allenakinkunle http://www.allenkunle.me

Speaker Bio • Data Scientist at Ernst & Young •

Session Objectives At the end of this session, we should

Session Overview • What are imbalanced datasets? • Domains with

What are Imbalanced Datasets? A dataset is imbalanced if the

Domains with Class Imbalance Problem • Fraud detection in financial

The Problem with Imbalanced Datasets • The aim is to

Accuracy Paradox “...states that predictive models with a given level

Model Performance Metrics • Accuracy • Error Rate is 1

Methods of dealing with Imbalanced Data • Sampling Methods ◦

What is SMOTE? SMOTE (Synthetic Minority Over-sampling Technique) is a

STEP 1: For every data point in minority class, the

QUESTIONS, PLEASE?

DEMO