Techniques (Tricks) for Data Mining Competitions

4742812a011db89b01a52af6722640b8?s=47 @smly
October 16, 2015

Techniques (Tricks) for Data Mining Competitions

4742812a011db89b01a52af6722640b8?s=128

@smly

October 16, 2015
Tweet

Transcript

  1. 2.

    Kohei Ozaki Kaggle Enthusiast Work Experience: •  Insurance Fraud Detection

    •  Predictive Modeling for Online Advertising •  Recommendation System for SNS •  etc. (screenshot on https://www.kaggle.com/confirm) 2
  2. 3.

    Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from

    Winning Solutions Trend on Kaggle 3 4 - 12 13 - 28 29 - 72 73 - 78 
  3. 4.

    Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from

    Winning Solutions Trend on Kaggle 4 4 - 12 13 - 28 29 - 72 73 - 78 
  4. 5.

    Data Mining Competitions Participants compete their score of predictive model.

    A competition normally runs for 2 or 3 months. Many kind of tasks/datasets in real world. Insurance, Credit Scoring, Loan default, Medical, EEG, MEG, Image Classification, HealthCare, High-energy physics, Social Good, Marketing, Advertising, Trajectory, Telematics, etc… 5
  5. 8.

    Step3: Check Your Rank After you make the submission, your

    models are evaluated immediately and ranked on the Public Leaderboard. 8
  6. 9.

    Huge Amount of Prize Pool Netflix Prize 2009 ($1M) Recommend

    movies Heritage Health Prize 2011 ($3M) Predict days in hospital GE Flight Quest Challenge Part 1: Predict gate/arrival time 2012 ($250k) Part 2: Optimize flight plan 2014 $(220k) ↑ Predic)ve Modeling World Yet Another World ↓ DARPA Grand Challenge ($2M) Autonomous vehicle Google Lunar XPRIZE ($30M) Autonomous robotic spacecraft 9
  7. 10.

    Who Host Competitions? Kaggle is a platform for data prediction

    competition. (cloud sourcing community of 360k+ data scientists) In addition to prize money, many data scientists use Kaggle to learn and collaborate with experts. 10
  8. 11.

    Great Place to Try out Your Ideas (1/2) Many researchers/developers

    also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) Our original motivation for entering the contest was to try out our new tree ensemble regularized greedy forest (RGF) in a competitive setting. Rie Johnson (RJ Research Consulting) an prize winner in Heritage Health Prize  (quote from h;p://www.heritagecaliforniaaco.com/?p=hpn-today&ar)cle=45)
  9. 12.

    Great Place to Try out Your Ideas (2/2) Many researchers/developers

    also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) My intention of participating in this competition is to evaluate the performance of recurrent convolutional neural network (RCNN) in processing time series data. Ming Liang (Tsinghua University) an prize winner in Grasp-and-Lift EEG Detection (quote from h;ps://www.kaggle.com/c/grasp-and-liL-eeg-detec)on/forums/t/16617/team-daheimao-solu)on)
  10. 13.

    Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from

    Winning Solutions Trend on Kaggle 13 4 - 12 13 - 28 29 - 72 73 - 78 
  11. 14.

    Two Main Factors The quality of the individual model &

    the ensemble idea. No sophisticated individual models, no victory. Both of the individual model & the ensemble idea are keys. 14 Individual Model Ensemble Model
  12. 15.

    Hyper Parameter Tuning & Feature Engineering Read Owen Zhang’s slide

    (textbook) carefully :-) http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions 15
  13. 16.

    Greedy Forward Selection (GFS) Greedy Forward Selection (GFS) is simple

    and works well for feature selection and model selection on ensemble. 1: Initialize feature set Fk = at k = 0 Fk k = k + 1 Fk = Fk 1 [ {j} 2: Iterate 3: Find best feature to add to with most significant cost reduction. 4: and j / 2 Fk 16
  14. 17.

    GBDT: RGF-L2 and XGBoost have L2-Regularization for Leaf Coefficient L2

    regularization works great on noisy dataset & ensemble model. Parameter Regression tree (CART) +8.0 -0.2 +8.0 -0.2 +8.0 -0.2 -0.2 +8.0 -0.2 ˆ yi = f1( xi) f2( xi) fK( xi) · · · + + + Obj(⇥) = l(⇥) + ⌦(⇥) ⇥ = {f1, f2, · · · , fK } Objective Loss term Regularization term (heuristics including L0 (# of leaves) and L2) ⌦(⇥) 17
  15. 19.

    Ensemble Techniques: Stacking (1/2) Stacking uses different methods’ predictions as

    “meta-features”. To obtain meta-features for training of ensemble model, use K-1 parts for training and 1 part for making a meta-feature. 1D Meta-feature
  16. 21.

    Netflix Blending (Quiz Blending) [1] Andreas Töescher and Michael Jahrer

    “The BigChaos Solution to the Netflix Grand Prize”. Assume that the task is regression and the prediction is evaluated by RMSE. What can we do for improving our score? Predic)on Predic)on Predic)on Predic)on Predic)on 21
  17. 24.

    Utilize Quiz feedback for blending 1/4 Our goal is to

    find the linear combination of predicted results that best predicts y (target variable). N-by-p matrix predictions combined with p individual models ( ) = be the unobserved vector of true target values. Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 24
  18. 25.

    Utilize Quiz feedback for blending 2/4 If y is known,

    then best estimation by linear combination is: N-by-p matrix predictions combined with p individual models be the unobserved vector of true target values. 25
  19. 26.

    ( ) Utilize Quiz feedback for blending 3/4 If y

    is known, then best estimation by linear combination is: = ( ) 呍 呍 呍 呍 j-th element = = 呍 呍 呍 呍 Can be approx. (All zero case) Can be computed exactly. Can be approx. by using quiz feedback. (N times MSE) 26
  20. 27.

    Utilize Quiz feedback for blending 4/4 Our goal is to

    find the linear combination of predicted results that best predicts y (target variable). Prediction Linear combination by using quiz feedback: ( )・ β = N-by-p matrix predictions combined with p individual models (p x 1) matrix weight parameters Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 27
  21. 28.

    OT: Amazon’s AWS for Modeling c4.8xlarge (36 CPU cores with

    64 GB RAM / $0.3 per hour) My bagging GBDT model for KDDCup takes 6 hrs (= $1.8) * The above price is for spot-instance of us-west-1c on Oct 2015. The price is dynamically changing. S 280yen (= $2.3) 28
  22. 29.

    Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from

    Winning Solutions Trend on Kaggle 29 4 - 12 13 - 28 29 - 73 74 - 78
  23. 30.

    Learn from Winning Solutions Today’s talk describes following competitions: Competition

    Name Description KDD Cup 2015 Binary Classification, Access log GE Flight Quest 2 Optimization Grasp-and-Lift EEG Detection Multi-class Classification, BCI, EEG recordings 30
  24. 31.

    About KDD Cup 2015 Annual and most prestigious competition in

    data-mining. 821 teams joined. Task: Predict the probability that a student will drop-out course in 10 days. The dataset is provided by XuetangX, one of the largest MOOC platforms in China. Date # of access records Drop-out course or not 31
  25. 33.

    Dataset (1 of 3) Pair of <username, course_id> for each

    enrollment_id. (1) Enrollment data (2) Access logs (3) Object attributes 33
  26. 34.

    Dataset (2 of 3) Application logs. Source, Event and Object

    ID are provided. (1) Enrollment data (2) Access logs (3) Object attributes 34
  27. 35.

    Dataset (3 of 3) Detailed information of Object ID. (1)

    Enrollment data (2) Access logs (3) Object attributes 35
  28. 36.

    Analyze User Activities Users who doesn’t access the course many

    times drop-out the course. #of access logs (for each enrollent_id) # of enrollment_id (histogram) 36
  29. 38.

    Initial Analysis Base Features User activity and last access make

    a big impact on the AUC score. Features Model 5-Fold CV (AUC) One Hot Encoding (course_id) GBDT 0.6118 + num_records (User Activity) GBDT 0.8485 + num_unique_object GBDT 0.8507 + num_unique_active_days GBDT 0.8595 + num_unique_active_hours GBDT 0.8601 + num_unique_problem_event GBDT 0.8621 + first and last timestamp (Last Access) GBDT 0.8821 38
  30. 39.

    Feature Engineering (MC) Multiple Courses Features Concept: Some user enrolled

    multiple courses. Date # of access records (by courses) Features Model 5-Fold CV (AUC) Base GBDT 0.8821 + (MC) first and last timestamp for each user GBDT 0.8936 + (MC) num_unique_active_days for each user GBDT 0.8946 + (MC) num_enrollment_courses for each user GBDT 0.8953 39
  31. 40.

    Feature Engineering (EP) Evaluation Period Features (a bit leaky) Concept:

    The activities after the end date of courses. Features Model 5-Fold CV (AUC) Base + MC GBDT 0.8953 Base + MC + EP GBDT 0.9027 Date # of access records (by courses) 40
  32. 41.

    Feature Engineering (PXJ) Features from Teammates (Peng, Xiaocong and Jeong):

    •  Max absent days •  Min days from first visit to next course begin •  Min days from 10 days after last visit to next course begin •  Min days from last visit to next course end •  Min days from next course to last visit •  Min days from 10 days after course end to next course begin •  Min days from 10 days after course end to next course end •  Min days from course end to next visit •  Active days from last visit to course end •  Active days in 10 days from course end •  Average hour per day •  Course drop rate •  Time span Features Model 5-Fold CV (AUC) Base + MC + EP GBDT 0.9027 Base + MC + EP + PXJ GBDT 0.9052 41
  33. 43.

    Feature Engineering (LD) Label Dependent Features (a bit leaky) Count

    number of dropped-out courses for each days on evaluation period by using target variables in training set. Features Model 5-Fold CV (AUC) Base + MC + EP + PX1 + PX2 GBDT 0.9052 Base + MC + EP + PX1 + PX2 + LD GBDT 0.9062 Base + MC + EP + PX1 + PX2 + LD Bagging GBDT 0.9067 43
  34. 45.

    Feature Engineering (TAM) Sliding window & Various aggregation + GFS

    (Tam’s work) Use sliding window to generate many features automatically. Features Model 5-Fold CV (AUC) Base + MC + EP + PXJ + LD GBDT 0.9062 Base + MC + EP + PXJ + LD + TAM GBDT 0.9067 Date # of access records (by courses) sliding window & various aggrega)on (by objects, events, etc.) 45
  35. 46.

    Last 8 hours Add TAM’s Model Into Ensemble Model Last

    4 hours Add Tam’s Single Best Into Ensemble Model 46
  36. 47.

    Three-Stage Ensemble 64 single + 15 ensemble + 2 ensemble

    + 1 blending Models 5-Fold CV (AUC) Single Best 0.9067 Final model (Three-Stage Ensemble) 0.9082 47
  37. 48.

    To Avoid Over-fitting Comparing LB and Local CV is important

    to avoid over-fitting. Warning! over-fi_ng 48
  38. 49.

    Team Framework/Guideline (1) We shared the index file of 5fold

    CV at first. (2) By using it, we uploaded the CV Prediction and Predicted result for test data to Dropbox. (3) Update the wiki to describe the CV score and LB score. Then, we all can contribute the ensemble/blending part. (If we didn’t use the same index of 5fold CV, our ensemble model should be over-fit.) 49
  39. 50.

    Summary Feature Engineering is one of the key point for

    winning. (Don’t give up a chance to improve your feature set) People can work together internationally. (The well-designed guideline is important to work as a team) 50
  40. 51.

    Grasp-and-Lift EEG Detection Task: Identify hand motions (multi-class) from time-series

    EEG records. 51 (Pic. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data)
  41. 52.

    Dataset: EEG records 32 channels EEG data 6 events to

    detect (HandStart, FirstDigitTouch, LiftOff, …) 52 (Fig. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data and https://www.kaggle.com/acshock/grasp-and-lift-eeg-detection/how-noisy-are-these-eegs)
  42. 53.

    Winners approaches 1st place: Alexandre Barachant & Rafał Cycoń (expert

    in EEF & signal processing) •  Feature Extraction: Filter bank, Neural Oscillation, ERP •  Single Models: LR, LDA, RNN, CNN 2nd place: Ming Liang (expert in image processing) •  Feature Extraction: Nothing •  Single Models: CNN, Recurrent CNN •  Model Selection: Greedy Forward Selection It seems the single best model on this contest is Recurrent CNN. CNNs can perform as well as traditional paradigm. 53
  43. 54.

    Classifying EEG signals with a Convolutional Neural Network A input

    sample is treated as height-1-images. The input sample at time t is composed of the n-dimensiontal data at times t - n + 1, t - n + 2, ..., t. 54 time t n-dimensional data
  44. 55.

    Recurrent CNN Current state of the art algorithm on image

    classification task. 55 [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. RCL (Recurrent Convolution Layer) is a natural integration of RNN and CNN. The feed-forward (blue line) and recurrent computation (red line) both take the form of convolution. (Fig. is from http://blog.kaggle.com/2015/09/29/grasp-and-lift-eeg-detection-winners-interview-2nd-place-daheimao/)
  45. 56.

    Summary Convolutional Neural Network works great on time-series signal records

    (EEG). Don’t fear the experts! •  Non-expert ML researcher might beat an expert researcher. •  Google Scholar is your friend. 56
  46. 57.

    GE FQ2: Flight Route Optimization Objective: produce a flight plan

    for each flight to reduce the average cost of plains as low as possible. 57 (Pic. is from http://www.gequest.com/)
  47. 58.

    Format of Flight Plan List of 4D (Latitude, Longitude, Altitude

    and Speed) points for each flight plan. 1:Latitude 2:Longitude 3:Altitude 4:Speed 2013-10-02 12:00:00 (Cut-off time) 1:Latitude and 2:Longitude 3:Altitude 4:Speed 58 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
  48. 59.

    Evaluation Metric (1 of 2) Objective: produce a flight plan

    for each flight to reduce the average cost of plains as low as possible. C total = C fuel + C delay + C oscillation + C turbulence 揺れ Penalty for changing al)tude 乱気流 Linear func)on of the elapsed )me in turbulent zones. 59
  49. 60.

    Evaluation Metric (2 of 2) C total = C fuel

    + C delay + C oscillation + C turbulence Evaluated by a flight simulator. A flight can take 3 kind of step: “ascending”, “descending” and “cruising”. Fuel consump)on is depend on the flight instruc)on. *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度 60
  50. 61.

    Dataset (1 of 3) Flight Information List of test flights

    to optimize. •  Arrival Airport •  Current Location •  Parameters of Cost Model 61
  51. 62.

    Dataset (2 of 3) Airport Locations produce a flight plan

    for each flight to reduce the average cost of plains as low as possible. 62
  52. 63.

    Dataset (3 of 3) Restricted Zones Airspace which is reserved

    for special use (restricted from civilian aircraft) Turbulent Zones Airspace where flights experience turbulence (accrue a USD cost for the time spent within these zones) Weather (Wind data) Vectors on 4-axes representation. (time, altitude, easting, northing) 63 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
  53. 64.

    Analyze Cost Model C total = C fuel + C

    delay + C oscillation + C turbulence The burned fuel is a function of the airspeed, but the ground speed is the sum of the velocity relative to the air and wind vector. ˠ Taking advantage on the wind can significantly reduce the fuel cost and the delay cost. 64 *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度
  54. 65.

    Example of Wind-Optimal Path The blue line is wind-optimal path

    (Reduce 15% of total cost from the red line). 65 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
  55. 66.

    5th Solution (1 of 5) [3] Christian Kiss-Toth, Gabor Takacs,

    “A Dynamic Programming Approach for 4D Flight Route Optimization” Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight * Winning solution doesn’t open in this competition. Optimize 4D parameters separately. 66
  56. 67.

    5th Solution (2 of 5) Compute the shortest path problem

    (Dijkstra’s algorithm) Vertex: the current position the destination airport the vertices of the restricted zones Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 67 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
  57. 68.

    5th Solution (3 of 5) How to find the wind-optimal

    path? Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 68 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
  58. 69.

    5th Solution (4 of 5) Create a grid in the

    airspace and divides the initial path into N parts. ˠ by Dynamic Programming Perform it recursively. (2) 2D optimization process (latitude and longitude) 69 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
  59. 70.

    5th Solution (5 of 5) Procedure: (1) Create an Initial

    Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight Optimize two variables: the descending distance and cruise speed. For this 1D optimization, the solution used an exhaustive search. 70
  60. 72.

    Summary Deep understanding to the objective and evaluation metric is

    important to solve the problem. (i.e. taking advantage on the wind is key point.) Basic knowledge of Computer Science (DP algorithm) and engineering efforts are also helpful for this kind of competitions. 72
  61. 73.

    Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from

    Winning Solutions Trend on Kaggle 73 4 - 12 13 - 28 29 - 72 73 - 78 
  62. 74.

    Improved Kaggle Rankings (1/3) The old ranking system 74 Kaggle

    users receive points for their performance in competitions. On May 2015, Kaggle rolled out an updated version of the ranking system. Penalty on being part of a team Popularity of the contest Decay Penalty on being part of a team Popularity of the contest Decay The new ranking system
  63. 75.

    Improved Kaggle Rankings (2/3) The new formula imposes a smaller

    penalty on being part of a team. 75 New Penalty Term on being part of a team Old Penalty Term
  64. 76.

    Improved Kaggle Rankings (3/3) New point system counts your achievements

    on past contests. 76 New Decay Term Old Decay Term
  65. 77.

    Forming a Team Seems Active  On CAT competition, Rank

    #1 ~ #7 are teams, no solo players. Teaming-up is active when ensemble models work well. 77
  66. 78.

    Take-away Messages Join Kaggle competitions for fun and learn techniques

    from expert data scientist in the world! RGF and XGBoost have L2-regularization and it might work well for noisy dataset (and ensemble model). Ensemble/Blending techniques are tricky. Some techniques are impractical in real world setting. l%FFQ-FBSOJOH͸΄΅࢖Θͳ͍zͱ͸ࢥ͍·ͤΜɻ࢖͍·͢ɻ 78
  67. 79.

    References [1] Andreas Töescher and Michael Jahrer “The BigChaos Solution

    to the Netflix Grand Prize”. [2] Rie Johnson and Tong Zhang “Learning Nonlinear Functions Using Regularized Greedy Forest”, TPAMI’14. [3] Christian Kiss-Toth and Gábor Takács, “A Dynamic Programming Approach for 4D Flight Route Optimization”, Big Data’14. [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. 79