Unbalanced data: Same algorithms different techniques by Eric Martín at Big Data Spain 2017

UNBALANCED DATA: SAME ALGORITHMS DIFFERENT TECHNIQUES Eric Martin

UNBALANCED DATA • Fraud • Illness detection • Anomalies 2
Y = 0 Y = 1

ALGORITHMS POINT OF VIEW 3 ▪ Accuracy ▪ 1,000,000 total
TRX ▪ 10 Fraud TRX = 99.9999% Recall, f1score, detection probability

UNDERSTANDING THE PROBLEM 4 ▪ Scattering Matrix: Real 0 Real
1 Pron.0 Pron.1 LESS ACCURACY ! Trading Illness Detection Real 0 Real 1 Pron.0 Pron.1

IT DEPENDS ON THE PROBLEM!! 5

MOST COMMON PRACTISES 6 ▪ Dimensionality reduction: ▫ Smote ▫
Sintetic samples creation Y = 0 Y = 1 Y = 0 Y = 1

SAME ALGORITHMS DIFFERENT TECHNIQUES ▪ If you expect different results
you have to do different things ▪ Explote all data you have ▪ Bagging Algo: First step Random Forest 7

RANDOM FOREST 8 F1 F2 F3 …… … FN Y
1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1

RANDOM FOREST 9 F1 F2 F3 … … … FN
Y 1.5 25 False … 0.185 ??? 1 1 0 MAJORITY VOTE 1

EM FOREST 10 F1 F2 F3 …… … FN Y
1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1

Tree1 Tree2 Tree3 Y 1 1 1 0 1 2
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 EM FOREST: Transforming the problem 11 F1 F2 F3 …… … FN Y 1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1 0 1 0 1

EM FOREST: The new problem 12 Tree1 Tree2 Tree3 Y
1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 9 1 0 1 1 10 1 1 0 0 11 0 1 0 1 12 0 0 1 0 13 1 0 1 1 14 1 1 0 1 15 1 1 0 1 16 0 0 1 0 17 0 1 0 1 18 1 0 0 0

EM FOREST: The new possibilities 13 Tree1 Tree2 Tree3 Y
1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 ▪ Vector vs. Aggregated Agg Y 1 2 1 2 2 1 3 3 0 4 0 1 5 1 0 6 2 0 7 1 1 8 1 1

EM FOREST: The new results 14 ▪ Result improvement: Better
score ( at least the same ) than Random Forest ▪ Result flexibility: Better in balanced and unbalanced data (Trading and illness detection )

EM FOREST: Adventages 15 ▪ Open Source ▪ Scalability ▪
More possibilities

EM FOREST: Use cases 16 ▪ Real projects: Credit card
usage trends ▪ Demo projects: Bank fraud Alcohol in students dataset

THANKS! Any questions? You can find me at: Eric Martin
[email protected] 17

Unbalanced data: Same algorithms different tech...

Unbalanced data: Same algorithms different techniques by Eric Martín at Big Data Spain 2017

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

UNBALANCED DATA: SAME ALGORITHMS DIFFERENT TECHNIQUES Eric Martin

UNBALANCED DATA • Fraud • Illness detection • Anomalies 2

ALGORITHMS POINT OF VIEW 3 ▪ Accuracy ▪ 1,000,000 total

UNDERSTANDING THE PROBLEM 4 ▪ Scattering Matrix: Real 0 Real

IT DEPENDS ON THE PROBLEM!! 5

MOST COMMON PRACTISES 6 ▪ Dimensionality reduction: ▫ Smote ▫

SAME ALGORITHMS DIFFERENT TECHNIQUES ▪ If you expect different results

RANDOM FOREST 8 F1 F2 F3 …… … FN Y

RANDOM FOREST 9 F1 F2 F3 … … … FN

EM FOREST 10 F1 F2 F3 …… … FN Y

Tree1 Tree2 Tree3 Y 1 1 1 0 1 2

EM FOREST: The new problem 12 Tree1 Tree2 Tree3 Y

EM FOREST: The new possibilities 13 Tree1 Tree2 Tree3 Y

EM FOREST: The new results 14 ▪ Result improvement: Better

EM FOREST: Adventages 15 ▪ Open Source ▪ Scalability ▪

EM FOREST: Use cases 16 ▪ Real projects: Credit card

THANKS! Any questions? You can find me at: Eric Martin