Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

Data Science LA
October 28, 2015
10k

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

Data Science LA

October 28, 2015
Tweet

More Decks by Data Science LA

Transcript

  1. FEATURE ENGINEERING • Numerical - Log, Log(1 + x), Normalization,

    Binarization • Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence • Timeseries - Stats, FFT, MFCC (audio), ERP (EEG) • Numerical/Timeseries to Categorical - RF/GBM* * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
  2. MACHINE LEARNING Algorithm Tool Note Gradient Boosting Machine XGBoost The

    best out-of-the-box solution Random Forests Scikit-Learn, randomForest Extra Trees Scikit-Learn Regularized Greedy Forest Tong Zhang’s Neural Networks Keras, Lasagne, MXNet Blends well with GBM. Best at image recognition competitions. Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit, tinrtgu’s Competitive solution for CTR estimation competitions Factorization Machine libFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions
  3. CROSS VALIDATION Training data are split into five folds where

    the sample size and dropout rate are preserved (stratified).
  4. ENSEMBLE - STACKING * for other types of ensemble, see

    http://mlwave.com/kaggle-ensembling-guide/
  5. KDD CUP 2015 WINNING SOLUTION* InterContinental Ensemble * originally presented

    by Kohei Ozaki, Mert Bay, and Tam T. Nguyen at KDD 2015
  6. KDD CUP 2015 • To predict dropouts in MOOCs. •

    Student activity logs and meta data are provided. • 821 teams. • $20K total prize.
  7. KDD CUP 2015 - DATA • 39 courses, 27K objects,

    112K students. • 200K enrollments, 13MM activities.
  8. INTERCONTINENTAL ENSEMBLE Jeong-Yoon Lee Mert Bay Conversion Logic Song Chen

    AIG Andreas Toescher Michael Jahrer Opera Solutions Peng Yang Xiacong Zhou NetEase Tsinghua University Kohei Ozaki AIG Japan Tam T. Nguyen I2R A*STAR
  9. Story about last 28 hours (1 of 3) 28 hours

    before the deadline: Ensemble framework and feature engineering worked great. But we were still in the 3rd place.
  10. Story about last 28 hours (2 of 3) 28 hours

    before the deadline: Continued working on feature engineering and single models with a new feature made a great improvement.
  11. Story about last 28 hours (3 of 3) 27 hours

    before the deadline: Ensemble models were trained with new single models and we jumped up to the 1st!
  12. Sequential Data Cube !me$ • hour$ • day$ • week$ • month$ event$ • navigate$

    • access$ • problem$ • page$close$ • video$ • discussion$ • wiki$ object$ • user$ • course$ • source$ • user:course$ • …$
  13. ALGORITHMS Algorithms # of Single Models Gradient Boosting Machine 26

    Neural Network 14 Factorization Machine 12 Logistic Regression 6 Kernel Ridge Regression 2 Extra Trees 2 Random Forest 2 K-Nearest Neighbor 1 • A total of 64 single models were used in the final solution.
  14. IMPROVEMENTS BY ENSEMBLE 5-Fold CV Public Leaderboard Single Best 0.906721

    0.907765 Stage-I Best 0.907688 0.908796 Stage-II Best 0.907968 N/A Stage-III Best 0.908194 0.909181 • Single best to Stage-III ensemble best score is 0.0014 improvement!
  15. SUMMARY • The following items helped tremendously in winning KDD

    Cup 2015: • Good team dynamics and collaboration. • Hand crafted features. • Multi-stage ensemble.