Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unbalanced data: Same algorithms different techniques by Eric Martín at Big Data Spain 2017

Unbalanced data: Same algorithms different techniques by Eric Martín at Big Data Spain 2017

Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.

https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

December 04, 2017
Tweet

Transcript

  1. None
  2. UNBALANCED DATA: SAME ALGORITHMS DIFFERENT TECHNIQUES Eric Martin

  3. UNBALANCED DATA • Fraud • Illness detection • Anomalies 2

    Y = 0 Y = 1
  4. ALGORITHMS POINT OF VIEW 3 ▪ Accuracy ▪ 1,000,000 total

    TRX ▪ 10 Fraud TRX = 99.9999% Recall, f1score, detection probability
  5. UNDERSTANDING THE PROBLEM 4 ▪ Scattering Matrix: Real 0 Real

    1 Pron.0 Pron.1 LESS ACCURACY ! Trading Illness Detection Real 0 Real 1 Pron.0 Pron.1
  6. IT DEPENDS ON THE PROBLEM!! 5

  7. MOST COMMON PRACTISES 6 ▪ Dimensionality reduction: ▫ Smote ▫

    Sintetic samples creation Y = 0 Y = 1 Y = 0 Y = 1
  8. SAME ALGORITHMS DIFFERENT TECHNIQUES ▪ If you expect different results

    you have to do different things ▪ Explote all data you have ▪ Bagging Algo: First step Random Forest 7
  9. RANDOM FOREST 8 F1 F2 F3 …… … FN Y

    1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1
  10. RANDOM FOREST 9 F1 F2 F3 … … … FN

    Y 1.5 25 False … 0.185 ??? 1 1 0 MAJORITY VOTE 1
  11. EM FOREST 10 F1 F2 F3 …… … FN Y

    1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1
  12. Tree1 Tree2 Tree3 Y 1 1 1 0 1 2

    3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 EM FOREST: Transforming the problem 11 F1 F2 F3 …… … FN Y 1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1 0 1 0 1
  13. EM FOREST: The new problem 12 Tree1 Tree2 Tree3 Y

    1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 9 1 0 1 1 10 1 1 0 0 11 0 1 0 1 12 0 0 1 0 13 1 0 1 1 14 1 1 0 1 15 1 1 0 1 16 0 0 1 0 17 0 1 0 1 18 1 0 0 0
  14. EM FOREST: The new possibilities 13 Tree1 Tree2 Tree3 Y

    1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 ▪ Vector vs. Aggregated Agg Y 1 2 1 2 2 1 3 3 0 4 0 1 5 1 0 6 2 0 7 1 1 8 1 1
  15. EM FOREST: The new results 14 ▪ Result improvement: Better

    score ( at least the same ) than Random Forest ▪ Result flexibility: Better in balanced and unbalanced data (Trading and illness detection )
  16. EM FOREST: Adventages 15 ▪ Open Source ▪ Scalability ▪

    More possibilities
  17. EM FOREST: Use cases 16 ▪ Real projects: Credit card

    usage trends ▪ Demo projects: Bank fraud Alcohol in students dataset
  18. THANKS! Any questions? You can find me at: Eric Martin

    ericmartinct@gmail.com 17