Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

WINNING DATA SCIENCE COMPETITIONS Jeong-Yoon Lee @Conversion Logic

DATA SCIENCE COMPETITIONS

DATA SCIENCE COMPETITIONS Since 1997 2006 - 2009 Since 2010

KAGGLE COMPETITIONS • 227 competitions since 2010 • 397,870 competitors
• $3MM+ prize paid out

KAGGLE COMPETITIONS

Ph.D or CS degree is NOT required to win!

Sang Su Lee @Retention Science Hang Li @Hulu Feng Qi
@Quora (x-Hulu) KAGGLER IN TOWN

COMPETITION STRUCTURE Training Data Test Data Feature Label Provided Submission
Public LB Score Private LB Score

BEST PRACTICES

BEST PRACTICES • Feature Engineering • Machine Learning • Cross
Validation • Ensemble

FEATURE ENGINEERING • Numerical - Log, Log(1 + x), Normalization,
Binarization • Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence • Timeseries - Stats, FFT, MFCC (audio), ERP (EEG) • Numerical/Timeseries to Categorical - RF/GBM* * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

MACHINE LEARNING Algorithm Tool Note Gradient Boosting Machine XGBoost The
best out-of-the-box solution Random Forests Scikit-Learn, randomForest Extra Trees Scikit-Learn Regularized Greedy Forest Tong Zhang’s Neural Networks Keras, Lasagne, MXNet Blends well with GBM. Best at image recognition competitions. Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit, tinrtgu’s Competitive solution for CTR estimation competitions Factorization Machine libFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions

CROSS VALIDATION Training data are split into ﬁve folds where
the sample size and dropout rate are preserved (stratiﬁed).

ENSEMBLE - STACKING * for other types of ensemble, see
http://mlwave.com/kaggle-ensembling-guide/

KDD CUP 2015 WINNING SOLUTION* InterContinental Ensemble * originally presented
by Kohei Ozaki, Mert Bay, and Tam T. Nguyen at KDD 2015

KDD CUP 2015 • To predict dropouts in MOOCs. •
Student activity logs and meta data are provided. • 821 teams. • $20K total prize.

KDD CUP 2015 - DATA • 39 courses, 27K objects,
112K students. • 200K enrollments, 13MM activities.

INTERCONTINENTAL ENSEMBLE Jeong-Yoon Lee Mert Bay Conversion Logic Song Chen
AIG Andreas Toescher Michael Jahrer Opera Solutions Peng Yang Xiacong Zhou NetEase Tsinghua University Kohei Ozaki AIG Japan Tam T. Nguyen I2R A*STAR

COLLABORATION

STORY ABOUT LAST 28 HOURS

Story about last 28 hours (1 of 3) 28 hours
before the deadline: Ensemble framework and feature engineering worked great. But we were still in the 3rd place.

before the deadline: Continued working on feature engineering and single models with a new feature made a great improvement.

before the deadline: Ensemble models were trained with new single models and we jumped up to the 1st!

FEATURE ENGINEERING

Sequential Data Cube !me$ • hour$ • day$ • week$ • month$ event$ • navigate$
• access$ • problem$ • page$close$ • video$ • discussion$ • wiki$ object$ • user$ • course$ • source$ • user:course$ • …$

Data Slicing and Dicing

SINGLE MODEL TRAINING

ALGORITHMS Algorithms # of Single Models Gradient Boosting Machine 26
Neural Network 14 Factorization Machine 12 Logistic Regression 6 Kernel Ridge Regression 2 Extra Trees 2 Random Forest 2 K-Nearest Neighbor 1 • A total of 64 single models were used in the ﬁnal solution.

Single Model Training Training'Data' CV'Transformed'Data' Test'Data' Transformed' Test'Data' CV'Predic4on' Test'Predic4on'
Transformed'Training' Data' Feature'Selec4on' Single'Model'Training'

ENSEMBLE MODEL TRAINING

Ensemble Model Training

ENSEMBLE FRAMEWORK

CV VS. LB SCORES LB AUC = 1.03 x CV
AUC - 0.03

IMPROVEMENTS BY ENSEMBLE 5-Fold CV Public Leaderboard Single Best 0.906721
0.907765 Stage-I Best 0.907688 0.908796 Stage-II Best 0.907968 N/A Stage-III Best 0.908194 0.909181 • Single best to Stage-III ensemble best score is 0.0014 improvement!

SUMMARY • The following items helped tremendously in winning KDD
Cup 2015: • Good team dynamics and collaboration. • Hand crafted features. • Multi-stage ensemble.

Thank you!

[email protected] @jeongyoonlee linkedin.com/in/jeongyoonlee kaggle.com/jeongyoonlee github.com/jeongyoonlee/

Jeong-Yoon Lee - Winning Data Science Competiti...

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

More Decks by Data Science LA

Featured

Transcript