Class Imbalance Problem

Class Imbalance Problem Eric Chio “Log0” 18 August 2013 http://www.chioka.in/

Overview • Definition - What is the Class Imbalance Problem?
• Metrics - How to compare models? • Solutions - How to mitigate?

WHAT IS THE CLASS IMBALANCE PROBLEM?

What is the Class Imbalance Problem? • It is a
problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). • Very common in practice (fraud detection, anomaly detection, medical diagnosis, oil spillage detection, etc)

Why is it a problem? • Most machine learning algorithms
and works best when the number of instances of each classes are roughly equal • If an algorithm has two possible results to choose from: a) Model 1: 7 out of 10 minority and 10 out of 10000 majority class WRONG b) Model 2: 2 out of 10 minority and 100 out of 10000 majority class WRONG • Algorithm will prefer Model 1 (17 mistakes) over Model 2 (102 mistakes), but we are more interested in having more minority class predicted correctly. • Example by numbers later.

Example : Fraud Detection • Given many transaction data, find
out which is fraudulent. – Real-life data usually have only 1~5% of fraud data, the others are genuine. – Very costly (losing money and angry customers) if we miss the fraudulent transactions, so we do want to catch all of them • But the class imbalance problem will cause the algorithms to miss many of the fraudulent transactions

HOW TO COMPARE MODELS?

How to Compare Solutions? • Before we go to solutions,
we need to know how to compare two models. • From the numbers example earlier, we know that conventional error rate are not helpful here. • We need better metrics.

Metrics • True Positive (TP) – An example that is
positive and is classified correctly as positive • True Negative (TN) – An example that is negative and is classified correctly as negative • False Positive (FP) – An example that is negative but is classified wrongly as positive • False Negative (FN) – An example that is positive but is classified wrongly as negative

Metrics Name Formula Explanation True Positive Rate (TP rate) TP
/ (TP + FP) The closer to 1, the better. TP rate = 1 when FP = 0. (No false positives) True Negative Rate (TN rate) TN / (TN + FN) The closer to 1, the better. TN rate = 1 when FN = 0. (No false negatives) False Positive Rate (FP rate) FP / (FP + TN) The closer to 0, the better. FP rate = 0 when FP = 0. (No false positives) False Negative Rate (FN rate) FN / (FN + TP) The closer to 0, the better. FN rate = 0 when FN = 0. (No false negatives)

Example of how error rate fails • Using traditional metrics…
Model 1 Predicted Class + - Actual Class + 3 (TP) 7 (FN) - 10 (FP) 9990 (TN) Error(Model 1) = (FN + FP) / total dataset size = (7 + 10) / 10010 = 0.0017 = (0.1% error) Model 2 Predicted Class + - Actual Class + 8 (TP) 2 (FN) - 100 (FP) 9900 (TN) Error(Model 2) = (FN + FP) / total dataset size = (2 + 100) / 10010 = 0.01 (= 1% error) Model 1 looks like a better model. But we actually want model 2, because we want to maximize TP as much as possible, and not maximize TN.

Example of using new metrics • Using new metrics Model
1 Predicted Class + - Actual Class + 3 (TP) 7 (FN) - 10 (FP) 9990 (TN) TP_rate(M1) = 3/(3+7) = 0.3 TN_rate(M1) = 9990/(10+9990) = 0.999 FP_rate(M1) = 10/(10+9990) = 0.001 FN_rate(M1) = 7/(7+3) = 0.7 Model 2 Predicted Class + - Actual Class + 8 (TP) 2 (FN) - 100 (FP) 9900 (TN) TP_rate(M2) = 8/(8+2) = 0.8 TN_rate(M2) = 9900/(100+9900) = 0.990 FP_rate(M2) = 100/(100+9900) = 0.01 FN_rate(M2) = 2/(2+8) = 0.2 Since we want to predict positives correctly as much as possible, we need to minimize False Negatives (fewer misses on classifying positive classes correctly). The FN rate of M1 is 0.7, and FN rate of M2 is 0.2, so we will pick M2 over M1.

HOW TO MITIGATE?

How to Mitigate? • Two Major Approaches: – Sampling based
approaches – Cost function based approaches • Of course, if possible, get more data of the minority class (easily).

Cost-function Based Approaches • Assign a different cost function to
the classifier when it gets a TP, TN, FP, FN. – Intuition: If you get one FN, it is 100 times costlier than if you get one FP. Algorithm then minimizes FN at the expense of more FP (but still having a better score).

Sampling Approaches • Over-sampling – Add more minority class so
it has more weight (and affect the metrics more) • Under-sampling – Remove some of the majority class so it has less weight (and affect the metrics more) • Hybrid – Apply both above

Drawbacks • Just removing randomly majority classes losses information –
Simply removing majority classes randomly discards useful information • Just replicating randomly minority classes could cause overfit – The classifier becomes very specific about the minority classes • SMOTE tries to address the drawbacks

Example: SMOTE • SMOTE (Synthetic Minority Over-sampling Technique) (2002) –
Remove majority class instances randomly – Add new minority class instances by: • For each minority class instance c – neighbours = Get KNN(5) – n = Random pick one from neighbours – Create a new minority class r instance using c’s feature vector and the feature vector’s difference of n and c multiplied by a random number »i.e. r.feats = c.feats + (c.feats – n.feats) * rand(0,1)

Why SMOTE works? • Oversampling is a valid technique but
overfitting is the main problem due to the classifier becoming too specific – Classifier seeing too many times the exact same instance, thinking it to be very important • SMOTE generates similar but not exact minority instances – Classifier no longer overfits to one example, but can classify more generally

Other Techniques • There are many other techniques available now
: RUSBoost, SMOTEBagging, UnderBagging and other techniques • SMOTE is still preferred for many cases, simple to implement.

Summary • Class Imbalance Problem is a common problem affecting
machine learning due to having disproportionate number of class instances • Traditional metrics do not work • Mitigations are based on two general approaches: Sampling-based and Cost- sensitive-based

References • Introduction to Data Mining • SMOTE paper •
The Class Imbalance Problem Significance and Strategies • A Review on Ensembles for the Class Imbalance Problem Bagging-, Boosting-, and Hybrid-Based Approaches

Class Imbalance Problem

Class Imbalance Problem

log0

More Decks by log0

Other Decks in Technology

Featured

Transcript