A competition normally runs for 2 or 3 months. Many kind of tasks/datasets in real world. Insurance, Credit Scoring, Loan default, Medical, EEG, MEG, Image Classification, HealthCare, High-energy physics, Social Good, Marketing, Advertising, Trajectory, Telematics, etc… 5
movies Heritage Health Prize 2011 ($3M) Predict days in hospital GE Flight Quest Challenge Part 1: Predict gate/arrival time 2012 ($250k) Part 2: Optimize flight plan 2014 $(220k) ↑ Predic)ve Modeling World Yet Another World ↓ DARPA Grand Challenge ($2M) Autonomous vehicle Google Lunar XPRIZE ($30M) Autonomous robotic spacecraft 9
competition. (cloud sourcing community of 360k+ data scientists) In addition to prize money, many data scientists use Kaggle to learn and collaborate with experts. 10
also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) Our original motivation for entering the contest was to try out our new tree ensemble regularized greedy forest (RGF) in a competitive setting. Rie Johnson (RJ Research Consulting) an prize winner in Heritage Health Prize (quote from h;p://www.heritagecaliforniaaco.com/?p=hpn-today&ar)cle=45)
also use Kaggle. (XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…) My intention of participating in this competition is to evaluate the performance of recurrent convolutional neural network (RCNN) in processing time series data. Ming Liang (Tsinghua University) an prize winner in Grasp-and-Lift EEG Detection (quote from h;ps://www.kaggle.com/c/grasp-and-liL-eeg-detec)on/forums/t/16617/team-daheimao-solu)on)
the ensemble idea. No sophisticated individual models, no victory. Both of the individual model & the ensemble idea are keys. 14 Individual Model Ensemble Model
and works well for feature selection and model selection on ensemble. 1: Initialize feature set Fk = at k = 0 Fk k = k + 1 Fk = Fk 1 [ {j} 2: Iterate 3: Find best feature to add to with most significant cost reduction. 4: and j / 2 Fk 16
“meta-features”. To obtain meta-features for training of ensemble model, use K-1 parts for training and 1 part for making a meta-feature. 1D Meta-feature
“The BigChaos Solution to the Netflix Grand Prize”. Assume that the task is regression and the prediction is evaluated by RMSE. What can we do for improving our score? Predic)on Predic)on Predic)on Predic)on Predic)on 21
find the linear combination of predicted results that best predicts y (target variable). N-by-p matrix predictions combined with p individual models ( ) = be the unobserved vector of true target values. Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 24
then best estimation by linear combination is: N-by-p matrix predictions combined with p individual models be the unobserved vector of true target values. 25
is known, then best estimation by linear combination is: = ( ) 呍 呍 呍 呍 j-th element = = 呍 呍 呍 呍 Can be approx. (All zero case) Can be computed exactly. Can be approx. by using quiz feedback. (N times MSE) 26
find the linear combination of predicted results that best predicts y (target variable). Prediction Linear combination by using quiz feedback: ( )・ β = N-by-p matrix predictions combined with p individual models (p x 1) matrix weight parameters Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on Predic)on 27
64 GB RAM / $0.3 per hour) My bagging GBDT model for KDDCup takes 6 hrs (= $1.8) * The above price is for spot-instance of us-west-1c on Oct 2015. The price is dynamically changing. S 280yen (= $2.3) 28
data-mining. 821 teams joined. Task: Predict the probability that a student will drop-out course in 10 days. The dataset is provided by XuetangX, one of the largest MOOC platforms in China. Date # of access records Drop-out course or not 31
a big impact on the AUC score. Features Model 5-Fold CV (AUC) One Hot Encoding (course_id) GBDT 0.6118 + num_records (User Activity) GBDT 0.8485 + num_unique_object GBDT 0.8507 + num_unique_active_days GBDT 0.8595 + num_unique_active_hours GBDT 0.8601 + num_unique_problem_event GBDT 0.8621 + first and last timestamp (Last Access) GBDT 0.8821 38
multiple courses. Date # of access records (by courses) Features Model 5-Fold CV (AUC) Base GBDT 0.8821 + (MC) first and last timestamp for each user GBDT 0.8936 + (MC) num_unique_active_days for each user GBDT 0.8946 + (MC) num_enrollment_courses for each user GBDT 0.8953 39
The activities after the end date of courses. Features Model 5-Fold CV (AUC) Base + MC GBDT 0.8953 Base + MC + EP GBDT 0.9027 Date # of access records (by courses) 40
• Max absent days • Min days from first visit to next course begin • Min days from 10 days after last visit to next course begin • Min days from last visit to next course end • Min days from next course to last visit • Min days from 10 days after course end to next course begin • Min days from 10 days after course end to next course end • Min days from course end to next visit • Active days from last visit to course end • Active days in 10 days from course end • Average hour per day • Course drop rate • Time span Features Model 5-Fold CV (AUC) Base + MC + EP GBDT 0.9027 Base + MC + EP + PXJ GBDT 0.9052 41
number of dropped-out courses for each days on evaluation period by using target variables in training set. Features Model 5-Fold CV (AUC) Base + MC + EP + PX1 + PX2 GBDT 0.9052 Base + MC + EP + PX1 + PX2 + LD GBDT 0.9062 Base + MC + EP + PX1 + PX2 + LD Bagging GBDT 0.9067 43
(Tam’s work) Use sliding window to generate many features automatically. Features Model 5-Fold CV (AUC) Base + MC + EP + PXJ + LD GBDT 0.9062 Base + MC + EP + PXJ + LD + TAM GBDT 0.9067 Date # of access records (by courses) sliding window & various aggrega)on (by objects, events, etc.) 45
CV at first. (2) By using it, we uploaded the CV Prediction and Predicted result for test data to Dropbox. (3) Update the wiki to describe the CV score and LB score. Then, we all can contribute the ensemble/blending part. (If we didn’t use the same index of 5fold CV, our ensemble model should be over-fit.) 49
winning. (Don’t give up a chance to improve your feature set) People can work together internationally. (The well-designed guideline is important to work as a team) 50
detect (HandStart, FirstDigitTouch, LiftOff, …) 52 (Fig. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data and https://www.kaggle.com/acshock/grasp-and-lift-eeg-detection/how-noisy-are-these-eegs)
in EEF & signal processing) • Feature Extraction: Filter bank, Neural Oscillation, ERP • Single Models: LR, LDA, RNN, CNN 2nd place: Ming Liang (expert in image processing) • Feature Extraction: Nothing • Single Models: CNN, Recurrent CNN • Model Selection: Greedy Forward Selection It seems the single best model on this contest is Recurrent CNN. CNNs can perform as well as traditional paradigm. 53
sample is treated as height-1-images. The input sample at time t is composed of the n-dimensiontal data at times t - n + 1, t - n + 2, ..., t. 54 time t n-dimensional data
classification task. 55 [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. RCL (Recurrent Convolution Layer) is a natural integration of RNN and CNN. The feed-forward (blue line) and recurrent computation (red line) both take the form of convolution. (Fig. is from http://blog.kaggle.com/2015/09/29/grasp-and-lift-eeg-detection-winners-interview-2nd-place-daheimao/)
and Speed) points for each flight plan. 1:Latitude 2:Longitude 3:Altitude 4:Speed 2013-10-02 12:00:00 (Cut-off time) 1:Latitude and 2:Longitude 3:Altitude 4:Speed 58 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
for each flight to reduce the average cost of plains as low as possible. C total = C fuel + C delay + C oscillation + C turbulence 揺れ Penalty for changing al)tude 乱気流 Linear func)on of the elapsed )me in turbulent zones. 59
+ C delay + C oscillation + C turbulence Evaluated by a flight simulator. A flight can take 3 kind of step: “ascending”, “descending” and “cruising”. Fuel consump)on is depend on the flight instruc)on. *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度 60
for special use (restricted from civilian aircraft) Turbulent Zones Airspace where flights experience turbulence (accrue a USD cost for the time spent within these zones) Weather (Wind data) Vectors on 4-axes representation. (time, altitude, easting, northing) 63 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
delay + C oscillation + C turbulence The burned fuel is a function of the airspeed, but the ground speed is the sum of the velocity relative to the air and wind vector. ˠ Taking advantage on the wind can significantly reduce the fuel cost and the delay cost. 64 *Airspeed (IAS): 対気速度 (the speed of an aircraft relative to the air). *Ground speed (GS): 対地速度
(Reduce 15% of total cost from the red line). 65 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
“A Dynamic Programming Approach for 4D Flight Route Optimization” Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight * Winning solution doesn’t open in this competition. Optimize 4D parameters separately. 66
(Dijkstra’s algorithm) Vertex: the current position the destination airport the vertices of the restricted zones Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 67 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
path? Procedure: (1) Create an Initial Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight 68 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
airspace and divides the initial path into N parts. ˠ by Dynamic Programming Perform it recursively. (2) 2D optimization process (latitude and longitude) 69 (Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)
Routes (2) 2D optimization process (latitude and longitude) (3) Set the altitudes and the airspeed of the flight Optimize two variables: the descending distance and cruise speed. For this 1D optimization, the solution used an exhaustive search. 70
important to solve the problem. (i.e. taking advantage on the wind is key point.) Basic knowledge of Computer Science (DP algorithm) and engineering efforts are also helpful for this kind of competitions. 72
users receive points for their performance in competitions. On May 2015, Kaggle rolled out an updated version of the ranking system. Penalty on being part of a team Popularity of the contest Decay Penalty on being part of a team Popularity of the contest Decay The new ranking system
from expert data scientist in the world! RGF and XGBoost have L2-regularization and it might work well for noisy dataset (and ensemble model). Ensemble/Blending techniques are tricky. Some techniques are impractical in real world setting. l%FFQ-FBSOJOH΄΅Θͳ͍zͱࢥ͍·ͤΜɻ͍·͢ɻ 78
to the Netflix Grand Prize”. [2] Rie Johnson and Tong Zhang “Learning Nonlinear Functions Using Regularized Greedy Forest”, TPAMI’14. [3] Christian Kiss-Toth and Gábor Takács, “A Dynamic Programming Approach for 4D Flight Route Optimization”, Big Data’14. [4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional Neural Network for Object Recognition”, CVPR’15. 79