Techniques (Tricks) for Data Mining Competitions

Techniques (Tricks) for Data Mining Competitions Kohei Ozaki 2015-10-15 @
Kyoto University

Kohei Ozaki Kaggle Enthusiast Work Experience: •  Insurance Fraud Detection
•  Predictive Modeling for Online Advertising •  Recommendation System for SNS •  etc. (screenshot on https://www.kaggle.com/confirm) 2

Agenda Data Mining Competitions Techniques (Tricks) for competitions Learning from
Winning Solutions Trend on Kaggle 3 4 - 12 13 - 28 29 - 72 73 - 78

Data Mining Competitions Participants compete their score of predictive model.
A competition normally runs for 2 or 3 months. Many kind of tasks/datasets in real world. Insurance, Credit Scoring, Loan default, Medical, EEG, MEG, Image Classification, HealthCare, High-energy physics, Social Good, Marketing, Advertising, Trajectory, Telematics, etc… 5

Step1: Get the Data Download datasets and Understand competitions task
6

Step2: Make a Submission Create your model and make a
submission 7

Step3: Check Your Rank After you make the submission, your
models are evaluated immediately and ranked on the Public Leaderboard. 8

Huge Amount of Prize Pool Netflix Prize 2009 ($1M) Recommend
movies Heritage Health Prize 2011 ($3M) Predict days in hospital GE Flight Quest Challenge Part 1: Predict gate/arrival time 2012 ($250k) Part 2: Optimize flight plan 2014 $(220k) ↑ Predic)ve Modeling World Yet Another World ↓ DARPA Grand Challenge ($2M) Autonomous vehicle Google Lunar XPRIZE ($30M) Autonomous robotic spacecraft 9

Who Host Competitions? Kaggle is a platform for data prediction
competition. (cloud sourcing community of 360k+ data scientists) In addition to prize money, many data scientists use Kaggle to learn and collaborate with experts. 10

Great Place to Try out Your Ideas (1/2) Many researchers/developers
also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) Our original motivation for entering the contest was to try out our new tree ensemble regularized greedy forest (RGF) in a competitive setting. Rie Johnson (RJ Research Consulting) an prize winner in Heritage Health Prize (quote from h;p://www.heritagecaliforniaaco.com/?p=hpn-today&ar)cle=45)

Great Place to Try out Your Ideas (2/2) Many researchers/developers
also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) My intention of participating in this competition is to evaluate the performance of recurrent convolutional neural network (RCNN) in processing time series data. Ming Liang (Tsinghua University) an prize winner in Grasp-and-Lift EEG Detection (quote from h;ps://www.kaggle.com/c/grasp-and-liL-eeg-detec)on/forums/t/16617/team-daheimao-solu)on)

Two Main Factors The quality of the individual model &
the ensemble idea. No sophisticated individual models, no victory. Both of the individual model & the ensemble idea are keys. 14 Individual Model Ensemble Model

Hyper Parameter Tuning & Feature Engineering Read Owen Zhang’s slide
(textbook) carefully :-) http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions 15

Greedy Forward Selection (GFS) Greedy Forward Selection (GFS) is simple
and works well for feature selection and model selection on ensemble. 1: Initialize feature set Fk = at k = 0 Fk k = k + 1 Fk = Fk 1 [ {j} 2: Iterate 3: Find best feature to add to with most significant cost reduction. 4: and j / 2 Fk 16

GBDT: RGF-L2 and XGBoost have L2-Regularization for Leaf Coeﬃcient L2
regularization works great on noisy dataset & ensemble model. Parameter Regression tree (CART) +8.0 -0.2 +8.0 -0.2 +8.0 -0.2 -0.2 +8.0 -0.2 ˆ yi = f1( xi) f2( xi) fK( xi) · · · + + + Obj(⇥) = l(⇥) + ⌦(⇥) ⇥ = {f1, f2, · · · , fK } Objective Loss term Regularization term (heuristics including L0 (# of leaves) and L2) ⌦(⇥) 17

Reminder: 5-Fold Cross Validation Use K-1 parts for training and
1part for testing. 18

Ensemble Techniques: Stacking (1/2) Stacking uses diﬀerent methods’ predictions as
“meta-features”. To obtain meta-features for training of ensemble model, use K-1 parts for training and 1 part for making a meta-feature. 1D Meta-feature

Ensemble Techniques: Stacking (2/2) You can stack more stages :-)

Netflix Blending (Quiz Blending) [1] Andreas Töescher and Michael Jahrer
“The BigChaos Solution to the Netflix Grand Prize”. Assume that the task is regression and the prediction is evaluated by RMSE. What can we do for improving our score? Predic)on Predic)on Predic)on Predic)on Predic)on 21

zoom-in 22

Actual setting: We have feedback of quiz data (30% of
test data)! 23

Utilize Quiz feedback for blending 1/4 Our goal is to
find the linear combination of predicted results that best predicts y (target variable). N-by-p matrix predictions combined with p individual models ( ) = be the unobserved vector of true target values. Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 24

Utilize Quiz feedback for blending 2/4 If y is known,
then best estimation by linear combination is: N-by-p matrix predictions combined with p individual models be the unobserved vector of true target values. 25

( ) Utilize Quiz feedback for blending 3/4 If y
is known, then best estimation by linear combination is: = ( ) 呍呍呍呍 j-th element = = 呍呍呍呍 Can be approx. (All zero case) Can be computed exactly. Can be approx. by using quiz feedback. (N times MSE) 26

Utilize Quiz feedback for blending 4/4 Our goal is to
find the linear combination of predicted results that best predicts y (target variable). Prediction Linear combination by using quiz feedback: ( )・ β = N-by-p matrix predictions combined with p individual models (p x 1) matrix weight parameters Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 27

OT: Amazon’s AWS for Modeling c4.8xlarge (36 CPU cores with
64 GB RAM / $0.3 per hour) My bagging GBDT model for KDDCup takes 6 hrs (= $1.8) * The above price is for spot-instance of us-west-1c on Oct 2015. The price is dynamically changing. S 280yen (= $2.3) 28

Learn from Winning Solutions Today’s talk describes following competitions: Competition
Name Description KDD Cup 2015 Binary Classification, Access log GE Flight Quest 2 Optimization Grasp-and-Lift EEG Detection Multi-class Classification, BCI, EEG recordings 30

About KDD Cup 2015 Annual and most prestigious competition in
data-mining. 821 teams joined. Task: Predict the probability that a student will drop-out course in 10 days. The dataset is provided by XuetangX, one of the largest MOOC platforms in China. Date # of access records Drop-out course or not 31

Winner: InterContinental Ensemble Jeong Mert Andreas Michael Xiaocong Peng Kohei
Tam Song

Dataset (1 of 3) Pair of <username, course_id> for each
enrollment_id. (1) Enrollment data (2) Access logs (3) Object attributes 33

Dataset (2 of 3) Application logs. Source, Event and Object
ID are provided. (1) Enrollment data (2) Access logs (3) Object attributes 34

Dataset (3 of 3) Detailed information of Object ID. (1)
Enrollment data (2) Access logs (3) Object attributes 35

Analyze User Activities Users who doesn’t access the course many
times drop-out the course. #of access logs (for each enrollent_id) # of enrollment_id (histogram) 36

Analyze Last Access Obviously, users who recently accesses the course
continue the course. 37

Initial Analysis Base Features User activity and last access make
a big impact on the AUC score. Features Model 5-Fold CV (AUC) One Hot Encoding (course_id) GBDT 0.6118 + num_records (User Activity) GBDT 0.8485 + num_unique_object GBDT 0.8507 + num_unique_active_days GBDT 0.8595 + num_unique_active_hours GBDT 0.8601 + num_unique_problem_event GBDT 0.8621 + first and last timestamp (Last Access) GBDT 0.8821 38

Feature Engineering (MC) Multiple Courses Features Concept: Some user enrolled
multiple courses. Date # of access records (by courses) Features Model 5-Fold CV (AUC) Base GBDT 0.8821 + (MC) first and last timestamp for each user GBDT 0.8936 + (MC) num_unique_active_days for each user GBDT 0.8946 + (MC) num_enrollment_courses for each user GBDT 0.8953 39

Feature Engineering (EP) Evaluation Period Features (a bit leaky) Concept:
The activities after the end date of courses. Features Model 5-Fold CV (AUC) Base + MC GBDT 0.8953 Base + MC + EP GBDT 0.9027 Date # of access records (by courses) 40

Feature Engineering (PXJ) Features from Teammates (Peng, Xiaocong and Jeong):
•  Max absent days •  Min days from first visit to next course begin •  Min days from 10 days after last visit to next course begin •  Min days from last visit to next course end •  Min days from next course to last visit •  Min days from 10 days after course end to next course begin •  Min days from 10 days after course end to next course end •  Min days from course end to next visit •  Active days from last visit to course end •  Active days in 10 days from course end •  Average hour per day •  Course drop rate •  Time span Features Model 5-Fold CV (AUC) Base + MC + EP GBDT 0.9027 Base + MC + EP + PXJ GBDT 0.9052 41

Last 48 hours We’re in 3rd Place For Long Time

Feature Engineering (LD) Label Dependent Features (a bit leaky) Count
number of dropped-out courses for each days on evaluation period by using target variables in training set. Features Model 5-Fold CV (AUC) Base + MC + EP + PX1 + PX2 GBDT 0.9052 Base + MC + EP + PX1 + PX2 + LD GBDT 0.9062 Base + MC + EP + PX1 + PX2 + LD Bagging GBDT 0.9067 43

Last 27 hours Add LD Feature Into Ensemble Model 44

Feature Engineering (TAM) Sliding window & Various aggregation + GFS
(Tam’s work) Use sliding window to generate many features automatically. Features Model 5-Fold CV (AUC) Base + MC + EP + PXJ + LD GBDT 0.9062 Base + MC + EP + PXJ + LD + TAM GBDT 0.9067 Date # of access records (by courses) sliding window & various aggrega)on (by objects, events, etc.) 45

Last 8 hours Add TAM’s Model Into Ensemble Model Last
4 hours Add Tam’s Single Best Into Ensemble Model 46

Three-Stage Ensemble 64 single + 15 ensemble + 2 ensemble
+ 1 blending Models 5-Fold CV (AUC) Single Best 0.9067 Final model (Three-Stage Ensemble) 0.9082 47

To Avoid Over-fitting Comparing LB and Local CV is important
to avoid over-fitting. Warning! over-ﬁ_ng 48

Team Framework/Guideline (1) We shared the index file of 5fold
CV at first. (2) By using it, we uploaded the CV Prediction and Predicted result for test data to Dropbox. (3) Update the wiki to describe the CV score and LB score. Then, we all can contribute the ensemble/blending part. (If we didn’t use the same index of 5fold CV, our ensemble model should be over-fit.) 49

Summary Feature Engineering is one of the key point for
winning. (Don’t give up a chance to improve your feature set) People can work together internationally. (The well-designed guideline is important to work as a team) 50

Grasp-and-Lift EEG Detection Task: Identify hand motions (multi-class) from time-series
EEG records. 51 (Pic. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data)

Dataset: EEG records 32 channels EEG data 6 events to
detect (HandStart, FirstDigitTouch, LiftOﬀ, …) 52 (Fig. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data and https://www.kaggle.com/acshock/grasp-and-lift-eeg-detection/how-noisy-are-these-eegs)

Winners approaches 1st place: Alexandre Barachant & Rafał Cycoń (expert
in EEF & signal processing) •  Feature Extraction: Filter bank, Neural Oscillation, ERP •  Single Models: LR, LDA, RNN, CNN 2nd place: Ming Liang (expert in image processing) •  Feature Extraction: Nothing •  Single Models: CNN, Recurrent CNN •  Model Selection: Greedy Forward Selection It seems the single best model on this contest is Recurrent CNN. CNNs can perform as well as traditional paradigm. 53

Classifying EEG signals with a Convolutional Neural Network A input
sample is treated as height-1-images. The input sample at time t is composed of the n-dimensiontal data at times t - n + 1, t - n + 2, ..., t. 54 time t n-dimensional data

Recurrent CNN Current state of the art algorithm on image
classification task. 55 [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. RCL (Recurrent Convolution Layer) is a natural integration of RNN and CNN. The feed-forward (blue line) and recurrent computation (red line) both take the form of convolution. (Fig. is from http://blog.kaggle.com/2015/09/29/grasp-and-lift-eeg-detection-winners-interview-2nd-place-daheimao/)

Summary Convolutional Neural Network works great on time-series signal records
(EEG). Don’t fear the experts! •  Non-expert ML researcher might beat an expert researcher. •  Google Scholar is your friend. 56

GE FQ2: Flight Route Optimization Objective: produce a flight plan
for each flight to reduce the average cost of plains as low as possible. 57 (Pic. is from http://www.gequest.com/)

Format of Flight Plan List of 4D (Latitude, Longitude, Altitude
and Speed) points for each flight plan. 1:Latitude 2:Longitude 3:Altitude 4:Speed 2013-10-02 12:00:00 (Cut-oﬀ time) 1:Latitude and 2:Longitude 3:Altitude 4:Speed 58 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Evaluation Metric (1 of 2) Objective: produce a flight plan
for each flight to reduce the average cost of plains as low as possible. C total = C fuel + C delay + C oscillation + C turbulence 揺れ Penalty for changing al)tude 乱気流 Linear func)on of the elapsed )me in turbulent zones. 59

Evaluation Metric (2 of 2) C total = C fuel
+ C delay + C oscillation + C turbulence Evaluated by a flight simulator. A flight can take 3 kind of step: “ascending”, “descending” and “cruising”. Fuel consump)on is depend on the ﬂight instruc)on. *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度 60

Dataset (1 of 3) Flight Information List of test flights
to optimize. •  Arrival Airport •  Current Location •  Parameters of Cost Model 61

Dataset (2 of 3) Airport Locations produce a flight plan
for each flight to reduce the average cost of plains as low as possible. 62

Dataset (3 of 3) Restricted Zones Airspace which is reserved
for special use (restricted from civilian aircraft) Turbulent Zones Airspace where flights experience turbulence (accrue a USD cost for the time spent within these zones) Weather (Wind data) Vectors on 4-axes representation. (time, altitude, easting, northing) 63 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

Analyze Cost Model C total = C fuel + C
delay + C oscillation + C turbulence The burned fuel is a function of the airspeed, but the ground speed is the sum of the velocity relative to the air and wind vector. ˠ Taking advantage on the wind can significantly reduce the fuel cost and the delay cost. 64 *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度

Example of Wind-Optimal Path The blue line is wind-optimal path
(Reduce 15% of total cost from the red line). 65 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

5th Solution (1 of 5) [3] Christian Kiss-Toth, Gabor Takacs,
“A Dynamic Programming Approach for 4D Flight Route Optimization” Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight * Winning solution doesn’t open in this competition. Optimize 4D parameters separately. 66

5th Solution (2 of 5) Compute the shortest path problem
(Dijkstra’s algorithm) Vertex: the current position the destination airport the vertices of the restricted zones Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 67 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

5th Solution (3 of 5) How to find the wind-optimal
path? Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 68 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

5th Solution (4 of 5) Create a grid in the
airspace and divides the initial path into N parts. ˠ by Dynamic Programming Perform it recursively. (2) 2D optimization process (latitude and longitude) 69 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

5th Solution (5 of 5) Procedure: (1) Create an Initial
Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight Optimize two variables: the descending distance and cruise speed. For this 1D optimization, the solution used an exhaustive search. 70

Report from GE •  B 71 (quote from h;p://www.gereports.com/post/93139010005/underdog-scien)st-cracks-code-to-reduce-ﬂight/)

Summary Deep understanding to the objective and evaluation metric is
important to solve the problem. (i.e. taking advantage on the wind is key point.) Basic knowledge of Computer Science (DP algorithm) and engineering eﬀorts are also helpful for this kind of competitions. 72

Improved Kaggle Rankings (1/3) The old ranking system 74 Kaggle
users receive points for their performance in competitions. On May 2015, Kaggle rolled out an updated version of the ranking system. Penalty on being part of a team Popularity of the contest Decay Penalty on being part of a team Popularity of the contest Decay The new ranking system

Improved Kaggle Rankings (2/3) The new formula imposes a smaller
penalty on being part of a team. 75 New Penalty Term on being part of a team Old Penalty Term

Improved Kaggle Rankings (3/3) New point system counts your achievements
on past contests. 76 New Decay Term Old Decay Term

Forming a Team Seems Active On CAT competition, Rank
#1 ~ #7 are teams, no solo players. Teaming-up is active when ensemble models work well. 77

Take-away Messages Join Kaggle competitions for fun and learn techniques
from expert data scientist in the world! RGF and XGBoost have L2-regularization and it might work well for noisy dataset (and ensemble model). Ensemble/Blending techniques are tricky. Some techniques are impractical in real world setting. l%FFQ-FBSOJOH͸΄΅࢖Θͳ͍zͱ͸ࢥ͍·ͤΜɻ࢖͍·͢ɻ 78

References [1] Andreas Töescher and Michael Jahrer “The BigChaos Solution
to the Netflix Grand Prize”. [2] Rie Johnson and Tong Zhang “Learning Nonlinear Functions Using Regularized Greedy Forest”, TPAMI’14. [3] Christian Kiss-Toth and Gábor Takács, “A Dynamic Programming Approach for 4D Flight Route Optimization”, Big Data’14. [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. 79

Techniques (Tricks) for Data Mining Competitions

Techniques (Tricks) for Data Mining Competitions

More Decks by @smly

Other Decks in Technology

Featured

Transcript