Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

Data Science LA
October 28, 2015
10k

Jeong-Yoon Lee - Winning Data Science Competitions - Data Science Meetup - Oct 2015

Data Science LA

October 28, 2015
Tweet

More Decks by Data Science LA

Transcript

  1. WINNING DATA SCIENCE
    COMPETITIONS
    Jeong-Yoon Lee
    @Conversion Logic

    View Slide

  2. View Slide

  3. DATA SCIENCE
    COMPETITIONS

    View Slide

  4. DATA SCIENCE COMPETITIONS
    Since 1997
    2006 - 2009
    Since 2010

    View Slide

  5. KAGGLE COMPETITIONS
    • 227 competitions since 2010
    • 397,870 competitors
    • $3MM+ prize paid out

    View Slide

  6. KAGGLE COMPETITIONS

    View Slide

  7. Ph.D or CS degree is NOT required to win!

    View Slide

  8. Sang Su Lee
    @Retention Science
    Hang Li
    @Hulu
    Feng Qi
    @Quora (x-Hulu)
    KAGGLER IN TOWN

    View Slide

  9. COMPETITION STRUCTURE
    Training Data
    Test Data
    Feature Label
    Provided Submission
    Public LB
    Score
    Private LB
    Score

    View Slide

  10. BEST PRACTICES

    View Slide

  11. BEST PRACTICES
    • Feature Engineering
    • Machine Learning
    • Cross Validation
    • Ensemble

    View Slide

  12. FEATURE ENGINEERING
    • Numerical - Log, Log(1 + x), Normalization, Binarization
    • Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence
    • Timeseries - Stats, FFT, MFCC (audio), ERP (EEG)
    • Numerical/Timeseries to Categorical - RF/GBM*
    * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

    View Slide

  13. MACHINE LEARNING
    Algorithm Tool Note
    Gradient Boosting Machine XGBoost The best out-of-the-box solution
    Random Forests Scikit-Learn, randomForest
    Extra Trees Scikit-Learn
    Regularized Greedy Forest Tong Zhang’s
    Neural Networks Keras, Lasagne, MXNet Blends well with GBM. Best at image
    recognition competitions.
    Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
    Support Vector Machine Scikit-Learn
    FTRL Vowpal Wabbit, tinrtgu’s Competitive solution for CTR
    estimation competitions
    Factorization Machine libFM Winning solution for KDD Cup 2012
    Field-aware Factorization Machine libFFM Winning solution for CTR estimation
    competitions

    View Slide

  14. CROSS VALIDATION
    Training data are split into five folds where the sample size and
    dropout rate are preserved (stratified).

    View Slide

  15. ENSEMBLE - STACKING
    * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/

    View Slide

  16. KDD CUP 2015
    WINNING SOLUTION*
    InterContinental Ensemble
    * originally presented by Kohei Ozaki, Mert Bay, and Tam T. Nguyen at KDD 2015

    View Slide

  17. KDD CUP 2015
    • To predict dropouts in MOOCs.
    • Student activity logs and meta data are provided.
    • 821 teams.
    • $20K total prize.

    View Slide

  18. KDD CUP 2015 - DATA
    • 39 courses, 27K objects, 112K students.
    • 200K enrollments, 13MM activities.

    View Slide

  19. View Slide

  20. INTERCONTINENTAL ENSEMBLE
    Jeong-Yoon Lee Mert Bay
    Conversion Logic
    Song Chen
    AIG
    Andreas Toescher Michael Jahrer
    Opera Solutions
    Peng Yang Xiacong Zhou
    NetEase Tsinghua University
    Kohei Ozaki
    AIG Japan
    Tam T. Nguyen
    I2R A*STAR

    View Slide

  21. COLLABORATION

    View Slide

  22. STORY ABOUT LAST
    28 HOURS

    View Slide

  23. Story about last 28 hours (1 of 3)
    28 hours before the deadline:
    Ensemble framework and feature engineering worked great.
    But we were still in the 3rd place.

    View Slide

  24. Story about last 28 hours (2 of 3)
    28 hours before the deadline:
    Continued working on feature engineering and single
    models with a new feature made a great improvement.

    View Slide

  25. Story about last 28 hours (3 of 3)
    27 hours before the deadline:
    Ensemble models were trained with new single models and
    we jumped up to the 1st!

    View Slide

  26. FEATURE ENGINEERING

    View Slide

  27. Sequential Data Cube
    !me$
    • hour$
    • day$
    • week$
    • month$
    event$
    • navigate$
    • access$
    • problem$
    • page$close$
    • video$
    • discussion$
    • wiki$
    object$
    • user$
    • course$
    • source$
    • user:course$
    • …$

    View Slide

  28. Data Slicing and Dicing

    View Slide

  29. View Slide

  30. View Slide

  31. SINGLE MODEL TRAINING

    View Slide

  32. ALGORITHMS
    Algorithms # of Single Models
    Gradient Boosting Machine 26
    Neural Network 14
    Factorization Machine 12
    Logistic Regression 6
    Kernel Ridge Regression 2
    Extra Trees 2
    Random Forest 2
    K-Nearest Neighbor 1
    • A total of 64 single models were used in the final solution.

    View Slide

  33. Single Model Training
    Training'Data'
    CV'Transformed'Data'
    Test'Data'
    Transformed'
    Test'Data'
    CV'Predic4on'
    Test'Predic4on'
    Transformed'Training'
    Data'
    Feature'Selec4on' Single'Model'Training'

    View Slide

  34. ENSEMBLE MODEL TRAINING

    View Slide

  35. Ensemble Model Training

    View Slide

  36. ENSEMBLE FRAMEWORK

    View Slide

  37. ENSEMBLE FRAMEWORK

    View Slide

  38. CV VS. LB SCORES
    LB AUC = 1.03 x CV AUC - 0.03

    View Slide

  39. IMPROVEMENTS BY ENSEMBLE
    5-Fold CV Public Leaderboard
    Single Best 0.906721 0.907765
    Stage-I Best 0.907688 0.908796
    Stage-II Best 0.907968 N/A
    Stage-III Best 0.908194 0.909181
    • Single best to Stage-III ensemble best score is 0.0014 improvement!

    View Slide

  40. SUMMARY
    • The following items helped tremendously in
    winning KDD Cup 2015:
    • Good team dynamics and collaboration.
    • Hand crafted features.
    • Multi-stage ensemble.

    View Slide

  41. View Slide

  42. Thank you!

    View Slide

  43. [email protected]
    @jeongyoonlee
    linkedin.com/in/jeongyoonlee
    kaggle.com/jeongyoonlee
    github.com/jeongyoonlee/

    View Slide