Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

E936a58f495e26123f9f537ea31968f7?s=47 Data Science LA
October 28, 2015
9.5k

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

E936a58f495e26123f9f537ea31968f7?s=128

Data Science LA

October 28, 2015
Tweet

Transcript

  1. WINNING DATA SCIENCE COMPETITIONS Jeong-Yoon Lee @Conversion Logic

  2. None
  3. DATA SCIENCE COMPETITIONS

  4. DATA SCIENCE COMPETITIONS Since 1997 2006 - 2009 Since 2010

  5. KAGGLE COMPETITIONS • 227 competitions since 2010 • 397,870 competitors

    • $3MM+ prize paid out
  6. KAGGLE COMPETITIONS

  7. Ph.D or CS degree is NOT required to win!

  8. Sang Su Lee @Retention Science Hang Li @Hulu Feng Qi

    @Quora (x-Hulu) KAGGLER IN TOWN
  9. COMPETITION STRUCTURE Training Data Test Data Feature Label Provided Submission

    Public LB Score Private LB Score
  10. BEST PRACTICES

  11. BEST PRACTICES • Feature Engineering • Machine Learning • Cross

    Validation • Ensemble
  12. FEATURE ENGINEERING • Numerical - Log, Log(1 + x), Normalization,

    Binarization • Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence • Timeseries - Stats, FFT, MFCC (audio), ERP (EEG) • Numerical/Timeseries to Categorical - RF/GBM* * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
  13. MACHINE LEARNING Algorithm Tool Note Gradient Boosting Machine XGBoost The

    best out-of-the-box solution Random Forests Scikit-Learn, randomForest Extra Trees Scikit-Learn Regularized Greedy Forest Tong Zhang’s Neural Networks Keras, Lasagne, MXNet Blends well with GBM. Best at image recognition competitions. Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit, tinrtgu’s Competitive solution for CTR estimation competitions Factorization Machine libFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions
  14. CROSS VALIDATION Training data are split into five folds where

    the sample size and dropout rate are preserved (stratified).
  15. ENSEMBLE - STACKING * for other types of ensemble, see

    http://mlwave.com/kaggle-ensembling-guide/
  16. KDD CUP 2015 WINNING SOLUTION* InterContinental Ensemble * originally presented

    by Kohei Ozaki, Mert Bay, and Tam T. Nguyen at KDD 2015
  17. KDD CUP 2015 • To predict dropouts in MOOCs. •

    Student activity logs and meta data are provided. • 821 teams. • $20K total prize.
  18. KDD CUP 2015 - DATA • 39 courses, 27K objects,

    112K students. • 200K enrollments, 13MM activities.
  19. None
  20. INTERCONTINENTAL ENSEMBLE Jeong-Yoon Lee Mert Bay Conversion Logic Song Chen

    AIG Andreas Toescher Michael Jahrer Opera Solutions Peng Yang Xiacong Zhou NetEase Tsinghua University Kohei Ozaki AIG Japan Tam T. Nguyen I2R A*STAR
  21. COLLABORATION

  22. STORY ABOUT LAST 28 HOURS

  23. Story about last 28 hours (1 of 3) 28 hours

    before the deadline: Ensemble framework and feature engineering worked great. But we were still in the 3rd place.
  24. Story about last 28 hours (2 of 3) 28 hours

    before the deadline: Continued working on feature engineering and single models with a new feature made a great improvement.
  25. Story about last 28 hours (3 of 3) 27 hours

    before the deadline: Ensemble models were trained with new single models and we jumped up to the 1st!
  26. FEATURE ENGINEERING

  27. Sequential Data Cube !me$ • hour$ • day$ • week$ • month$ event$ • navigate$

    • access$ • problem$ • page$close$ • video$ • discussion$ • wiki$ object$ • user$ • course$ • source$ • user:course$ • …$
  28. Data Slicing and Dicing

  29. None
  30. None
  31. SINGLE MODEL TRAINING

  32. ALGORITHMS Algorithms # of Single Models Gradient Boosting Machine 26

    Neural Network 14 Factorization Machine 12 Logistic Regression 6 Kernel Ridge Regression 2 Extra Trees 2 Random Forest 2 K-Nearest Neighbor 1 • A total of 64 single models were used in the final solution.
  33. Single Model Training Training'Data' CV'Transformed'Data' Test'Data' Transformed' Test'Data' CV'Predic4on' Test'Predic4on'

    Transformed'Training' Data' Feature'Selec4on' Single'Model'Training'
  34. ENSEMBLE MODEL TRAINING

  35. Ensemble Model Training

  36. ENSEMBLE FRAMEWORK

  37. ENSEMBLE FRAMEWORK

  38. CV VS. LB SCORES LB AUC = 1.03 x CV

    AUC - 0.03
  39. IMPROVEMENTS BY ENSEMBLE 5-Fold CV Public Leaderboard Single Best 0.906721

    0.907765 Stage-I Best 0.907688 0.908796 Stage-II Best 0.907968 N/A Stage-III Best 0.908194 0.909181 • Single best to Stage-III ensemble best score is 0.0014 improvement!
  40. SUMMARY • The following items helped tremendously in winning KDD

    Cup 2015: • Good team dynamics and collaboration. • Hand crafted features. • Multi-stage ensemble.
  41. None
  42. Thank you!

  43. jeongyoon.lee1@gmail.com @jeongyoonlee linkedin.com/in/jeongyoonlee kaggle.com/jeongyoonlee github.com/jeongyoonlee/