Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

Data Science LA
October 28, 2015
10k

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

Data Science LA

October 28, 2015
Tweet

More Decks by Data Science LA

Transcript

  1. WINNING DATA SCIENCE
    COMPETITIONS
    Jeong-Yoon Lee
    @Conversion Logic

    View full-size slide

  2. DATA SCIENCE
    COMPETITIONS

    View full-size slide

  3. DATA SCIENCE COMPETITIONS
    Since 1997
    2006 - 2009
    Since 2010

    View full-size slide

  4. KAGGLE COMPETITIONS
    • 227 competitions since 2010
    • 397,870 competitors
    • $3MM+ prize paid out

    View full-size slide

  5. KAGGLE COMPETITIONS

    View full-size slide

  6. Ph.D or CS degree is NOT required to win!

    View full-size slide

  7. Sang Su Lee
    @Retention Science
    Hang Li
    @Hulu
    Feng Qi
    @Quora (x-Hulu)
    KAGGLER IN TOWN

    View full-size slide

  8. COMPETITION STRUCTURE
    Training Data
    Test Data
    Feature Label
    Provided Submission
    Public LB
    Score
    Private LB
    Score

    View full-size slide

  9. BEST PRACTICES

    View full-size slide

  10. BEST PRACTICES
    • Feature Engineering
    • Machine Learning
    • Cross Validation
    • Ensemble

    View full-size slide

  11. FEATURE ENGINEERING
    • Numerical - Log, Log(1 + x), Normalization, Binarization
    • Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence
    • Timeseries - Stats, FFT, MFCC (audio), ERP (EEG)
    • Numerical/Timeseries to Categorical - RF/GBM*
    * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

    View full-size slide

  12. MACHINE LEARNING
    Algorithm Tool Note
    Gradient Boosting Machine XGBoost The best out-of-the-box solution
    Random Forests Scikit-Learn, randomForest
    Extra Trees Scikit-Learn
    Regularized Greedy Forest Tong Zhang’s
    Neural Networks Keras, Lasagne, MXNet Blends well with GBM. Best at image
    recognition competitions.
    Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
    Support Vector Machine Scikit-Learn
    FTRL Vowpal Wabbit, tinrtgu’s Competitive solution for CTR
    estimation competitions
    Factorization Machine libFM Winning solution for KDD Cup 2012
    Field-aware Factorization Machine libFFM Winning solution for CTR estimation
    competitions

    View full-size slide

  13. CROSS VALIDATION
    Training data are split into five folds where the sample size and
    dropout rate are preserved (stratified).

    View full-size slide

  14. ENSEMBLE - STACKING
    * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/

    View full-size slide

  15. KDD CUP 2015
    WINNING SOLUTION*
    InterContinental Ensemble
    * originally presented by Kohei Ozaki, Mert Bay, and Tam T. Nguyen at KDD 2015

    View full-size slide

  16. KDD CUP 2015
    • To predict dropouts in MOOCs.
    • Student activity logs and meta data are provided.
    • 821 teams.
    • $20K total prize.

    View full-size slide

  17. KDD CUP 2015 - DATA
    • 39 courses, 27K objects, 112K students.
    • 200K enrollments, 13MM activities.

    View full-size slide

  18. INTERCONTINENTAL ENSEMBLE
    Jeong-Yoon Lee Mert Bay
    Conversion Logic
    Song Chen
    AIG
    Andreas Toescher Michael Jahrer
    Opera Solutions
    Peng Yang Xiacong Zhou
    NetEase Tsinghua University
    Kohei Ozaki
    AIG Japan
    Tam T. Nguyen
    I2R A*STAR

    View full-size slide

  19. COLLABORATION

    View full-size slide

  20. STORY ABOUT LAST
    28 HOURS

    View full-size slide

  21. Story about last 28 hours (1 of 3)
    28 hours before the deadline:
    Ensemble framework and feature engineering worked great.
    But we were still in the 3rd place.

    View full-size slide

  22. Story about last 28 hours (2 of 3)
    28 hours before the deadline:
    Continued working on feature engineering and single
    models with a new feature made a great improvement.

    View full-size slide

  23. Story about last 28 hours (3 of 3)
    27 hours before the deadline:
    Ensemble models were trained with new single models and
    we jumped up to the 1st!

    View full-size slide

  24. FEATURE ENGINEERING

    View full-size slide

  25. Sequential Data Cube
    !me$
    • hour$
    • day$
    • week$
    • month$
    event$
    • navigate$
    • access$
    • problem$
    • page$close$
    • video$
    • discussion$
    • wiki$
    object$
    • user$
    • course$
    • source$
    • user:course$
    • …$

    View full-size slide

  26. Data Slicing and Dicing

    View full-size slide

  27. SINGLE MODEL TRAINING

    View full-size slide

  28. ALGORITHMS
    Algorithms # of Single Models
    Gradient Boosting Machine 26
    Neural Network 14
    Factorization Machine 12
    Logistic Regression 6
    Kernel Ridge Regression 2
    Extra Trees 2
    Random Forest 2
    K-Nearest Neighbor 1
    • A total of 64 single models were used in the final solution.

    View full-size slide

  29. Single Model Training
    Training'Data'
    CV'Transformed'Data'
    Test'Data'
    Transformed'
    Test'Data'
    CV'Predic4on'
    Test'Predic4on'
    Transformed'Training'
    Data'
    Feature'Selec4on' Single'Model'Training'

    View full-size slide

  30. ENSEMBLE MODEL TRAINING

    View full-size slide

  31. Ensemble Model Training

    View full-size slide

  32. ENSEMBLE FRAMEWORK

    View full-size slide

  33. ENSEMBLE FRAMEWORK

    View full-size slide

  34. CV VS. LB SCORES
    LB AUC = 1.03 x CV AUC - 0.03

    View full-size slide

  35. IMPROVEMENTS BY ENSEMBLE
    5-Fold CV Public Leaderboard
    Single Best 0.906721 0.907765
    Stage-I Best 0.907688 0.908796
    Stage-II Best 0.907968 N/A
    Stage-III Best 0.908194 0.909181
    • Single best to Stage-III ensemble best score is 0.0014 improvement!

    View full-size slide

  36. SUMMARY
    • The following items helped tremendously in
    winning KDD Cup 2015:
    • Good team dynamics and collaboration.
    • Hand crafted features.
    • Multi-stage ensemble.

    View full-size slide

  37. [email protected]
    @jeongyoonlee
    linkedin.com/in/jeongyoonlee
    kaggle.com/jeongyoonlee
    github.com/jeongyoonlee/

    View full-size slide