UNBALANCED
DATA: SAME
ALGORITHMS
DIFFERENT
TECHNIQUES
Eric Martin
Slide 3
Slide 3 text
UNBALANCED DATA
• Fraud
• Illness detection
• Anomalies
2
Y =
0
Y = 1
Slide 4
Slide 4 text
ALGORITHMS
POINT OF VIEW
3
▪ Accuracy
▪ 1,000,000 total TRX
▪ 10 Fraud TRX
= 99.9999%
Recall, f1score,
detection probability
Slide 5
Slide 5 text
UNDERSTANDING THE
PROBLEM
4
▪ Scattering Matrix:
Real 0
Real 1
Pron.0 Pron.1
LESS
ACCURACY
!
Trading Illness
Detection
Real 0
Real 1
Pron.0 Pron.1
Slide 6
Slide 6 text
IT DEPENDS
ON THE
PROBLEM!!
5
Slide 7
Slide 7 text
MOST COMMON PRACTISES
6
▪ Dimensionality
reduction:
▫ Smote
▫ Sintetic samples
creation
Y = 0 Y = 1 Y = 0 Y = 1
Slide 8
Slide 8 text
SAME
ALGORITHMS
DIFFERENT
TECHNIQUES
▪ If you expect different results
you have to do different
things
▪ Explote all data you have
▪ Bagging Algo: First step
Random Forest
7
EM FOREST: The new results
14
▪ Result improvement: Better score
( at least the same ) than Random
Forest
▪ Result flexibility: Better in balanced and
unbalanced data (Trading and illness
detection )
Slide 16
Slide 16 text
EM FOREST: Adventages
15
▪ Open Source
▪ Scalability
▪ More possibilities
Slide 17
Slide 17 text
EM FOREST: Use cases
16
▪ Real projects:
Credit card usage trends
▪ Demo projects:
Bank fraud
Alcohol in students dataset
Slide 18
Slide 18 text
THANKS!
Any questions?
You can find me at:
Eric Martin
[email protected]
17