Techniques (Tricks) for Data Mining Competitions

by @smly

Slide 1

Slide 1 text

Techniques (Tricks) for Data Mining Competitions Kohei Ozaki 2015-10-15 @ Kyoto University

Slide 2

Slide 2 text

Kohei Ozaki Kaggle Enthusiast Work Experience: •  Insurance Fraud Detection •  Predictive Modeling for Online Advertising •  Recommendation System for SNS •  etc. (screenshot on https://www.kaggle.com/confirm) 2

Slide 3

Slide 3 text

Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from Winning Solutions Trend on Kaggle 3 4 - 12 13 - 28 29 - 72 73 - 78

Slide 4

Slide 4 text

Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from Winning Solutions Trend on Kaggle 4 4 - 12 13 - 28 29 - 72 73 - 78

Slide 5

Slide 5 text

Data Mining Competitions Participants compete their score of predictive model. A competition normally runs for 2 or 3 months. Many kind of tasks/datasets in real world. Insurance, Credit Scoring, Loan default, Medical, EEG, MEG, Image Classification, HealthCare, High-energy physics, Social Good, Marketing, Advertising, Trajectory, Telematics, etc… 5

Slide 6

Slide 6 text

Step1: Get the Data Download datasets and Understand competitions task 6

Slide 7

Slide 7 text

Step2: Make a Submission Create your model and make a submission 7

Slide 8

Slide 8 text

Step3: Check Your Rank After you make the submission, your models are evaluated immediately and ranked on the Public Leaderboard. 8

Slide 9

Slide 9 text

Huge Amount of Prize Pool Netflix Prize 2009 ($1M) Recommend movies Heritage Health Prize 2011 ($3M) Predict days in hospital GE Flight Quest Challenge Part 1: Predict gate/arrival time 2012 ($250k) Part 2: Optimize flight plan 2014 $(220k) ↑ Predic)ve Modeling World Yet Another World ↓ DARPA Grand Challenge ($2M) Autonomous vehicle Google Lunar XPRIZE ($30M) Autonomous robotic spacecraft 9

Slide 10

Slide 10 text

Who Host Competitions? Kaggle is a platform for data prediction competition. (cloud sourcing community of 360k+ data scientists) In addition to prize money, many data scientists use Kaggle to learn and collaborate with experts. 10

Slide 11

Slide 11 text

Great Place to Try out Your Ideas (1/2) Many researchers/developers also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) Our original motivation for entering the contest was to try out our new tree ensemble regularized greedy forest (RGF) in a competitive setting. Rie Johnson (RJ Research Consulting) an prize winner in Heritage Health Prize (quote from h;p://www.heritagecaliforniaaco.com/?p=hpn-today&ar)cle=45)

Slide 12

Slide 12 text

Great Place to Try out Your Ideas (2/2) Many researchers/developers also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) My intention of participating in this competition is to evaluate the performance of recurrent convolutional neural network (RCNN) in processing time series data. Ming Liang (Tsinghua University) an prize winner in Grasp-and-Lift EEG Detection (quote from h;ps://www.kaggle.com/c/grasp-and-liL-eeg-detec)on/forums/t/16617/team-daheimao-solu)on)

Slide 13

Slide 13 text

Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from Winning Solutions Trend on Kaggle 13 4 - 12 13 - 28 29 - 72 73 - 78

Slide 14

Slide 14 text

Two Main Factors The quality of the individual model & the ensemble idea. No sophisticated individual models, no victory. Both of the individual model & the ensemble idea are keys. 14 Individual Model Ensemble Model

Slide 15

Slide 15 text

Hyper Parameter Tuning & Feature Engineering Read Owen Zhang’s slide (textbook) carefully :-) http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions 15

Slide 16

Slide 16 text

Greedy Forward Selection (GFS) Greedy Forward Selection (GFS) is simple and works well for feature selection and model selection on ensemble. 1: Initialize feature set Fk = at k = 0 Fk k = k + 1 Fk = Fk 1 [ {j} 2: Iterate 3: Find best feature to add to with most significant cost reduction. 4: and j / 2 Fk 16

Slide 17

Slide 17 text

GBDT: RGF-L2 and XGBoost have L2-Regularization for Leaf Coeﬃcient L2 regularization works great on noisy dataset & ensemble model. Parameter Regression tree (CART) +8.0 -0.2 +8.0 -0.2 +8.0 -0.2 -0.2 +8.0 -0.2 ˆ yi = f1( xi) f2( xi) fK( xi) · · · + + + Obj(⇥) = l(⇥) + ⌦(⇥) ⇥ = {f1, f2, · · · , fK } Objective Loss term Regularization term (heuristics including L0 (# of leaves) and L2) ⌦(⇥) 17

Slide 18

Slide 18 text

Reminder: 5-Fold Cross Validation Use K-1 parts for training and 1part for testing. 18

Slide 19

Slide 19 text

Ensemble Techniques: Stacking (1/2) Stacking uses diﬀerent methods’ predictions as “meta-features”. To obtain meta-features for training of ensemble model, use K-1 parts for training and 1 part for making a meta-feature. 1D Meta-feature

Slide 20

Slide 20 text

Ensemble Techniques: Stacking (2/2) You can stack more stages :-)

Slide 21

Slide 21 text

Netflix Blending (Quiz Blending) [1] Andreas Töescher and Michael Jahrer “The BigChaos Solution to the Netflix Grand Prize”. Assume that the task is regression and the prediction is evaluated by RMSE. What can we do for improving our score? Predic)on Predic)on Predic)on Predic)on Predic)on 21

Slide 22

Slide 22 text

zoom-in 22

Slide 23

Slide 23 text

Actual setting: We have feedback of quiz data (30% of test data)! 23

Slide 24

Slide 24 text

Utilize Quiz feedback for blending 1/4 Our goal is to find the linear combination of predicted results that best predicts y (target variable). N-by-p matrix predictions combined with p individual models ( ) = be the unobserved vector of true target values. Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 24

Slide 25

Slide 25 text

Utilize Quiz feedback for blending 2/4 If y is known, then best estimation by linear combination is: N-by-p matrix predictions combined with p individual models be the unobserved vector of true target values. 25

Slide 26

Slide 26 text

( ) Utilize Quiz feedback for blending 3/4 If y is known, then best estimation by linear combination is: = ( ) 呍呍呍呍 j-th element = = 呍呍呍呍 Can be approx. (All zero case) Can be computed exactly. Can be approx. by using quiz feedback. (N times MSE) 26

Slide 27

Slide 27 text

Utilize Quiz feedback for blending 4/4 Our goal is to find the linear combination of predicted results that best predicts y (target variable). Prediction Linear combination by using quiz feedback: ( )・ β = N-by-p matrix predictions combined with p individual models (p x 1) matrix weight parameters Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 27

Slide 28

Slide 28 text

OT: Amazon’s AWS for Modeling c4.8xlarge (36 CPU cores with 64 GB RAM / $0.3 per hour) My bagging GBDT model for KDDCup takes 6 hrs (= $1.8) * The above price is for spot-instance of us-west-1c on Oct 2015. The price is dynamically changing. S 280yen (= $2.3) 28

Slide 29

Slide 29 text

Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from Winning Solutions Trend on Kaggle 29 4 - 12 13 - 28 29 - 73 74 - 78

Slide 30

Slide 30 text

Learn from Winning Solutions Today’s talk describes following competitions: Competition Name Description KDD Cup 2015 Binary Classification, Access log GE Flight Quest 2 Optimization Grasp-and-Lift EEG Detection Multi-class Classification, BCI, EEG recordings 30

Slide 31

Slide 31 text

About KDD Cup 2015 Annual and most prestigious competition in data-mining. 821 teams joined. Task: Predict the probability that a student will drop-out course in 10 days. The dataset is provided by XuetangX, one of the largest MOOC platforms in China. Date # of access records Drop-out course or not 31

Slide 32

Slide 32 text

Winner: InterContinental Ensemble Jeong Mert Andreas Michael Xiaocong Peng Kohei Tam Song

Slide 33

Slide 33 text

Dataset (1 of 3) Pair of for each enrollment_id. (1) Enrollment data (2) Access logs (3) Object attributes 33

Slide 34

Slide 34 text

Dataset (2 of 3) Application logs. Source, Event and Object ID are provided. (1) Enrollment data (2) Access logs (3) Object attributes 34

Slide 35

Slide 35 text

Dataset (3 of 3) Detailed information of Object ID. (1) Enrollment data (2) Access logs (3) Object attributes 35

Slide 36

Slide 36 text

Analyze User Activities Users who doesn’t access the course many times drop-out the course. #of access logs (for each enrollent_id) # of enrollment_id (histogram) 36

Slide 37

Slide 37 text

Analyze Last Access Obviously, users who recently accesses the course continue the course. 37

Slide 38

Slide 38 text

Initial Analysis Base Features User activity and last access make a big impact on the AUC score. Features Model 5-Fold CV (AUC) One Hot Encoding (course_id) GBDT 0.6118 + num_records (User Activity) GBDT 0.8485 + num_unique_object GBDT 0.8507 + num_unique_active_days GBDT 0.8595 + num_unique_active_hours GBDT 0.8601 + num_unique_problem_event GBDT 0.8621 + first and last timestamp (Last Access) GBDT 0.8821 38

Slide 39

Slide 39 text

Feature Engineering (MC) Multiple Courses Features Concept: Some user enrolled multiple courses. Date # of access records (by courses) Features Model 5-Fold CV (AUC) Base GBDT 0.8821 + (MC) first and last timestamp for each user GBDT 0.8936 + (MC) num_unique_active_days for each user GBDT 0.8946 + (MC) num_enrollment_courses for each user GBDT 0.8953 39

Slide 40

Slide 40 text

Feature Engineering (EP) Evaluation Period Features (a bit leaky) Concept: The activities after the end date of courses. Features Model 5-Fold CV (AUC) Base + MC GBDT 0.8953 Base + MC + EP GBDT 0.9027 Date # of access records (by courses) 40

Slide 41

Slide 41 text

Feature Engineering (PXJ) Features from Teammates (Peng, Xiaocong and Jeong): •  Max absent days •  Min days from first visit to next course begin •  Min days from 10 days after last visit to next course begin •  Min days from last visit to next course end •  Min days from next course to last visit •  Min days from 10 days after course end to next course begin •  Min days from 10 days after course end to next course end •  Min days from course end to next visit •  Active days from last visit to course end •  Active days in 10 days from course end •  Average hour per day •  Course drop rate •  Time span Features Model 5-Fold CV (AUC) Base + MC + EP GBDT 0.9027 Base + MC + EP + PXJ GBDT 0.9052 41

Slide 42

Slide 42 text

Last 48 hours We’re in 3rd Place For Long Time

Slide 43

Slide 43 text

Feature Engineering (LD) Label Dependent Features (a bit leaky) Count number of dropped-out courses for each days on evaluation period by using target variables in training set. Features Model 5-Fold CV (AUC) Base + MC + EP + PX1 + PX2 GBDT 0.9052 Base + MC + EP + PX1 + PX2 + LD GBDT 0.9062 Base + MC + EP + PX1 + PX2 + LD Bagging GBDT 0.9067 43

Slide 44

Slide 44 text

Last 27 hours Add LD Feature Into Ensemble Model 44

Slide 45

Slide 45 text

Feature Engineering (TAM) Sliding window & Various aggregation + GFS (Tam’s work) Use sliding window to generate many features automatically. Features Model 5-Fold CV (AUC) Base + MC + EP + PXJ + LD GBDT 0.9062 Base + MC + EP + PXJ + LD + TAM GBDT 0.9067 Date # of access records (by courses) sliding window & various aggrega)on (by objects, events, etc.) 45

Slide 46

Slide 46 text

Last 8 hours Add TAM’s Model Into Ensemble Model Last 4 hours Add Tam’s Single Best Into Ensemble Model 46

Slide 47

Slide 47 text

Three-Stage Ensemble 64 single + 15 ensemble + 2 ensemble + 1 blending Models 5-Fold CV (AUC) Single Best 0.9067 Final model (Three-Stage Ensemble) 0.9082 47

Slide 48

Slide 48 text

To Avoid Over-fitting Comparing LB and Local CV is important to avoid over-fitting. Warning! over-ﬁ_ng 48

Slide 49

Slide 49 text

Team Framework/Guideline (1) We shared the index file of 5fold CV at first. (2) By using it, we uploaded the CV Prediction and Predicted result for test data to Dropbox. (3) Update the wiki to describe the CV score and LB score. Then, we all can contribute the ensemble/blending part. (If we didn’t use the same index of 5fold CV, our ensemble model should be over-fit.) 49

Slide 50

Slide 50 text

Summary Feature Engineering is one of the key point for winning. (Don’t give up a chance to improve your feature set) People can work together internationally. (The well-designed guideline is important to work as a team) 50

Slide 51

Slide 51 text

Grasp-and-Lift EEG Detection Task: Identify hand motions (multi-class) from time-series EEG records. 51 (Pic. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data)

Slide 52

Slide 52 text

Dataset: EEG records 32 channels EEG data 6 events to detect (HandStart, FirstDigitTouch, LiftOﬀ, …) 52 (Fig. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data and https://www.kaggle.com/acshock/grasp-and-lift-eeg-detection/how-noisy-are-these-eegs)

Slide 53

Slide 53 text

Winners approaches 1st place: Alexandre Barachant & Rafał Cycoń (expert in EEF & signal processing) •  Feature Extraction: Filter bank, Neural Oscillation, ERP •  Single Models: LR, LDA, RNN, CNN 2nd place: Ming Liang (expert in image processing) •  Feature Extraction: Nothing •  Single Models: CNN, Recurrent CNN •  Model Selection: Greedy Forward Selection It seems the single best model on this contest is Recurrent CNN. CNNs can perform as well as traditional paradigm. 53

Slide 54

Slide 54 text

Classifying EEG signals with a Convolutional Neural Network A input sample is treated as height-1-images. The input sample at time t is composed of the n-dimensiontal data at times t - n + 1, t - n + 2, ..., t. 54 time t n-dimensional data

Slide 55

Slide 55 text

Recurrent CNN Current state of the art algorithm on image classification task. 55 [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. RCL (Recurrent Convolution Layer) is a natural integration of RNN and CNN. The feed-forward (blue line) and recurrent computation (red line) both take the form of convolution. (Fig. is from http://blog.kaggle.com/2015/09/29/grasp-and-lift-eeg-detection-winners-interview-2nd-place-daheimao/)

Slide 56

Slide 56 text

Summary Convolutional Neural Network works great on time-series signal records (EEG). Don’t fear the experts! •  Non-expert ML researcher might beat an expert researcher. •  Google Scholar is your friend. 56

Slide 57

Slide 57 text

GE FQ2: Flight Route Optimization Objective: produce a flight plan for each flight to reduce the average cost of plains as low as possible. 57 (Pic. is from http://www.gequest.com/)

Slide 58

Slide 58 text

Format of Flight Plan List of 4D (Latitude, Longitude, Altitude and Speed) points for each flight plan. 1:Latitude 2:Longitude 3:Altitude 4:Speed 2013-10-02 12:00:00 (Cut-oﬀ time) 1:Latitude and 2:Longitude 3:Altitude 4:Speed 58 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Slide 59

Slide 59 text

Evaluation Metric (1 of 2) Objective: produce a flight plan for each flight to reduce the average cost of plains as low as possible. C total = C fuel + C delay + C oscillation + C turbulence 揺れ Penalty for changing al)tude 乱気流 Linear func)on of the elapsed )me in turbulent zones. 59

Slide 60

Slide 60 text

Evaluation Metric (2 of 2) C total = C fuel + C delay + C oscillation + C turbulence Evaluated by a flight simulator. A flight can take 3 kind of step: “ascending”, “descending” and “cruising”. Fuel consump)on is depend on the ﬂight instruc)on. *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度 60

Slide 61

Slide 61 text

Dataset (1 of 3) Flight Information List of test flights to optimize. •  Arrival Airport •  Current Location •  Parameters of Cost Model 61

Slide 62

Slide 62 text

Dataset (2 of 3) Airport Locations produce a flight plan for each flight to reduce the average cost of plains as low as possible. 62

Slide 63

Slide 63 text

Dataset (3 of 3) Restricted Zones Airspace which is reserved for special use (restricted from civilian aircraft) Turbulent Zones Airspace where flights experience turbulence (accrue a USD cost for the time spent within these zones) Weather (Wind data) Vectors on 4-axes representation. (time, altitude, easting, northing) 63 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Slide 64

Slide 64 text

Analyze Cost Model C total = C fuel + C delay + C oscillation + C turbulence The burned fuel is a function of the airspeed, but the ground speed is the sum of the velocity relative to the air and wind vector. ˠ Taking advantage on the wind can significantly reduce the fuel cost and the delay cost. 64 *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度

Slide 65

Slide 65 text

Example of Wind-Optimal Path The blue line is wind-optimal path (Reduce 15% of total cost from the red line). 65 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Slide 66

Slide 66 text

5th Solution (1 of 5) [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization” Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight * Winning solution doesn’t open in this competition. Optimize 4D parameters separately. 66

Slide 67

Slide 67 text

5th Solution (2 of 5) Compute the shortest path problem (Dijkstra’s algorithm) Vertex: the current position the destination airport the vertices of the restricted zones Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 67 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Slide 68

Slide 68 text

5th Solution (3 of 5) How to find the wind-optimal path? Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 68 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Slide 69

Slide 69 text

5th Solution (4 of 5) Create a grid in the airspace and divides the initial path into N parts. ˠ by Dynamic Programming Perform it recursively. (2) 2D optimization process (latitude and longitude) 69 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Slide 70

Slide 70 text

5th Solution (5 of 5) Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight Optimize two variables: the descending distance and cruise speed. For this 1D optimization, the solution used an exhaustive search. 70

Slide 71

Slide 71 text

Report from GE •  B 71 (quote from h;p://www.gereports.com/post/93139010005/underdog-scien)st-cracks-code-to-reduce-ﬂight/)

Slide 72

Slide 72 text

Summary Deep understanding to the objective and evaluation metric is important to solve the problem. (i.e. taking advantage on the wind is key point.) Basic knowledge of Computer Science (DP algorithm) and engineering eﬀorts are also helpful for this kind of competitions. 72

Slide 73

Slide 73 text

Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from Winning Solutions Trend on Kaggle 73 4 - 12 13 - 28 29 - 72 73 - 78

Slide 74

Slide 74 text

Improved Kaggle Rankings (1/3) The old ranking system 74 Kaggle users receive points for their performance in competitions. On May 2015, Kaggle rolled out an updated version of the ranking system. Penalty on being part of a team Popularity of the contest Decay Penalty on being part of a team Popularity of the contest Decay The new ranking system

Slide 75

Slide 75 text

Improved Kaggle Rankings (2/3) The new formula imposes a smaller penalty on being part of a team. 75 New Penalty Term on being part of a team Old Penalty Term

Slide 76

Slide 76 text

Improved Kaggle Rankings (3/3) New point system counts your achievements on past contests. 76 New Decay Term Old Decay Term

Slide 77

Slide 77 text

Forming a Team Seems Active On CAT competition, Rank #1 ~ #7 are teams, no solo players. Teaming-up is active when ensemble models work well. 77

Slide 78

Slide 78 text

Take-away Messages Join Kaggle competitions for fun and learn techniques from expert data scientist in the world! RGF and XGBoost have L2-regularization and it might work well for noisy dataset (and ensemble model). Ensemble/Blending techniques are tricky. Some techniques are impractical in real world setting. l%FFQ-FBSOJOH͸΄΅࢖Θͳ͍zͱ͸ࢥ͍·ͤΜɻ࢖͍·͢ɻ 78

Slide 79

Slide 79 text

References [1] Andreas Töescher and Michael Jahrer “The BigChaos Solution to the Netflix Grand Prize”. [2] Rie Johnson and Tong Zhang “Learning Nonlinear Functions Using Regularized Greedy Forest”, TPAMI’14. [3] Christian Kiss-Toth and Gábor Takács, “A Dynamic Programming Approach for 4D Flight Route Optimization”, Big Data’14. [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. 79