Slide 1

Slide 1 text

WINNING DATA SCIENCE COMPETITIONS Jeong-Yoon Lee @Conversion Logic

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

DATA SCIENCE COMPETITIONS

Slide 4

Slide 4 text

DATA SCIENCE COMPETITIONS Since 1997 2006 - 2009 Since 2010

Slide 5

Slide 5 text

KAGGLE COMPETITIONS • 227 competitions since 2010 • 397,870 competitors • $3MM+ prize paid out

Slide 6

Slide 6 text

KAGGLE COMPETITIONS

Slide 7

Slide 7 text

Ph.D or CS degree is NOT required to win!

Slide 8

Slide 8 text

Sang Su Lee @Retention Science Hang Li @Hulu Feng Qi @Quora (x-Hulu) KAGGLER IN TOWN

Slide 9

Slide 9 text

COMPETITION STRUCTURE Training Data Test Data Feature Label Provided Submission Public LB Score Private LB Score

Slide 10

Slide 10 text

BEST PRACTICES

Slide 11

Slide 11 text

BEST PRACTICES • Feature Engineering • Machine Learning • Cross Validation • Ensemble

Slide 12

Slide 12 text

FEATURE ENGINEERING • Numerical - Log, Log(1 + x), Normalization, Binarization • Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence • Timeseries - Stats, FFT, MFCC (audio), ERP (EEG) • Numerical/Timeseries to Categorical - RF/GBM* * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

Slide 13

Slide 13 text

MACHINE LEARNING Algorithm Tool Note Gradient Boosting Machine XGBoost The best out-of-the-box solution Random Forests Scikit-Learn, randomForest Extra Trees Scikit-Learn Regularized Greedy Forest Tong Zhang’s Neural Networks Keras, Lasagne, MXNet Blends well with GBM. Best at image recognition competitions. Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit, tinrtgu’s Competitive solution for CTR estimation competitions Factorization Machine libFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions

Slide 14

Slide 14 text

CROSS VALIDATION Training data are split into five folds where the sample size and dropout rate are preserved (stratified).

Slide 15

Slide 15 text

ENSEMBLE - STACKING * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/

Slide 16

Slide 16 text

KDD CUP 2015 WINNING SOLUTION* InterContinental Ensemble * originally presented by Kohei Ozaki, Mert Bay, and Tam T. Nguyen at KDD 2015

Slide 17

Slide 17 text

KDD CUP 2015 • To predict dropouts in MOOCs. • Student activity logs and meta data are provided. • 821 teams. • $20K total prize.

Slide 18

Slide 18 text

KDD CUP 2015 - DATA • 39 courses, 27K objects, 112K students. • 200K enrollments, 13MM activities.

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

INTERCONTINENTAL ENSEMBLE Jeong-Yoon Lee Mert Bay Conversion Logic Song Chen AIG Andreas Toescher Michael Jahrer Opera Solutions Peng Yang Xiacong Zhou NetEase Tsinghua University Kohei Ozaki AIG Japan Tam T. Nguyen I2R A*STAR

Slide 21

Slide 21 text

COLLABORATION

Slide 22

Slide 22 text

STORY ABOUT LAST 28 HOURS

Slide 23

Slide 23 text

Story about last 28 hours (1 of 3) 28 hours before the deadline: Ensemble framework and feature engineering worked great. But we were still in the 3rd place.

Slide 24

Slide 24 text

Story about last 28 hours (2 of 3) 28 hours before the deadline: Continued working on feature engineering and single models with a new feature made a great improvement.

Slide 25

Slide 25 text

Story about last 28 hours (3 of 3) 27 hours before the deadline: Ensemble models were trained with new single models and we jumped up to the 1st!

Slide 26

Slide 26 text

FEATURE ENGINEERING

Slide 27

Slide 27 text

Sequential Data Cube !me$ • hour$ • day$ • week$ • month$ event$ • navigate$ • access$ • problem$ • page$close$ • video$ • discussion$ • wiki$ object$ • user$ • course$ • source$ • user:course$ • …$

Slide 28

Slide 28 text

Data Slicing and Dicing

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

SINGLE MODEL TRAINING

Slide 32

Slide 32 text

ALGORITHMS Algorithms # of Single Models Gradient Boosting Machine 26 Neural Network 14 Factorization Machine 12 Logistic Regression 6 Kernel Ridge Regression 2 Extra Trees 2 Random Forest 2 K-Nearest Neighbor 1 • A total of 64 single models were used in the final solution.

Slide 33

Slide 33 text

Single Model Training Training'Data' CV'Transformed'Data' Test'Data' Transformed' Test'Data' CV'Predic4on' Test'Predic4on' Transformed'Training' Data' Feature'Selec4on' Single'Model'Training'

Slide 34

Slide 34 text

ENSEMBLE MODEL TRAINING

Slide 35

Slide 35 text

Ensemble Model Training

Slide 36

Slide 36 text

ENSEMBLE FRAMEWORK

Slide 37

Slide 37 text

ENSEMBLE FRAMEWORK

Slide 38

Slide 38 text

CV VS. LB SCORES LB AUC = 1.03 x CV AUC - 0.03

Slide 39

Slide 39 text

IMPROVEMENTS BY ENSEMBLE 5-Fold CV Public Leaderboard Single Best 0.906721 0.907765 Stage-I Best 0.907688 0.908796 Stage-II Best 0.907968 N/A Stage-III Best 0.908194 0.909181 • Single best to Stage-III ensemble best score is 0.0014 improvement!

Slide 40

Slide 40 text

SUMMARY • The following items helped tremendously in winning KDD Cup 2015: • Good team dynamics and collaboration. • Hand crafted features. • Multi-stage ensemble.

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Thank you!

Slide 43

Slide 43 text

[email protected] @jeongyoonlee linkedin.com/in/jeongyoonlee kaggle.com/jeongyoonlee github.com/jeongyoonlee/