Hands-on solving classification and regression competition on Kaggle: validation, feature engineering, ensembles.

CROSS VALIDATION & ENSEMBLING ALEX TSELIKOV DATA SCIENTIST @ KI-LABS

CROSS VALIDATION: BASIC This is one of the most important
parts of any data mining task. It is essential to create a validation environment that has the same or very close characteristics of the production environment. General goal: To create quite robust and close to the public leaderboard validation scheme. (it’s not always possible)

CROSS VALIDATION: CLASSIC Basic requirement – folds must be independent

CROSS VALIDATION: TIME SERIES Basic requirement – folds must be
independent

CROSS VALIDATION: STRATIFIED

CROSS VALIDATION: WHEN IT DOESN’T WORK 1. CV-validation error not
reflected in leaderboard score AT ALL Two Sigma Financial Modeling Challenge https://www.kaggle.com/c/two-sigma-financial-modeling/leaderboard Current competition example Santander Value Prediction Challenge 4k train set, 40k test set

CROSS VALIDATION: WHEN IT DOESN’T WORK Outliers. Mercedes-Benz Greener Manufacturing
Mean R2 by folds: [ 0.37, 0.43, 0.58 , 0.54, 0.55] Check standard deviation for folds If it’s too big – there is no sense to compare error mean by folds Fold1 Fold2 Fold3 Fold4 Fold5 MeanByFolds Splitting1 1,484 1,557 1,517 1,503 1,487 1,50993 Splitting2 1,473 1,563 1,543 1,508 1,465 1,510306 Splitting3 1,492 1,5 1,525 1,502 1,521 1,508102 Splitting4 1,504 1,452 1,497 1,53 1,542 1,505258 Splitting5 1,487 1,517 1,539 1,494 1,519 1,511306

CROSS VALIDATION: STATISTICS General Idea: calculated the t-statistics instead of
means XGBRegressor 100 trees or 200?

CROSS VALIDATION: STATISTICS General Idea: calculated the t-statistics instead of
means XGBRegressor 500 trees or 50?

CROSS VALIDATION: ADVERSARIAL VALIDATION • The general idea is to
check the degree of similarity between training and tests in terms of feature distribution • This intuition can be quantified by combining train and test sets, assigning 0/1 labels (0 - train, 1-test) and evaluating a binary classification task. ROC AUC ~ 0.87 for Santander Value Prediction Challenge

ENSEMBLES: BASIC • Stacking - ensembling technique to combine information
from multiple predictive models to generate a new model. • Usually stacked model (2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. • Stacking is most effective when the base models are significantly different.

ENSEMBLES: EXAMPLE1

ENSEMBLES: EXAMPLE2

ENSEMBLES: STACKING

ENSEMBLES: HARD STORY… • Hard to build (it’s much more
easy to overfit on 2nd level comparing with 1st) • Hard to support (need to recalculate all ensemble after adding a feature) • Hard to validate & easy to loose control • Takes a lot of time • Sometimes it doesn’t work due to nature of data

ENSEMBLES: THE SIMPLIEST • Bagging (decrease variance) – averaging same
algorithm with different random state • Simple averaging of different algorithms

BAGGING IN SANTANDER VALUE PREDICTION CHALLENGE RMSLE Fold1 Fold2 Fold3
Fold4 Fold5 Bag: 0 1.490034 1.551832 1.507701 1.498717 1.4574 Bag: 1 1.477678 1.54417 1.502861 1.499165 1.454479 Bag: 2 1.471445 1.543386 1.500513 1.493613 1.45419 Bag: 3 1.469421 1.543851 1.499011 1.495249 1.450472 Bag: 4 1.46828 1.54117 1.499485 1.494402 1.454471 Bag: 5 1.468886 1.540295 1.497646 1.493316 1.45233 Bag: 6 1.467041 1.540077 1.497461 1.490675 1.45191 Bag: 7 1.4662 1.540261 1.49709 1.490636 1.452969 Bag: 8 1.466083 1.540779 1.499358 1.490681 1.454017 Bag: 9 1.465501 1.538435 1.498638 1.49134 1.451611

LINKS Language Name Comment Python ML-Ensemble General ensemble learning Python
Scikit-learn Bagging, majority voting classifiers. API for stacking in development Python mlxtend Regression and Classification ensembles R SuperLearner Super Learner ensembles R Subsemble Subsembles R caretEnsemble Ensembles of Caret estimators Mutliple H20 Distributed stacked ensemble learning. Limited to estimators in the H20 library Java StackNet Empowered by H20 Web-based xcessiv Web-based ensemble learning Some packages to get you started with ensembles

Hands-on solving classification and regression ...

Hands-on solving classification and regression competition on Kaggle: validation, feature engineering, ensembles.

Alex Tselikov

More Decks by Alex Tselikov

Other Decks in Research

Featured

Transcript

CROSS VALIDATION & ENSEMBLING ALEX TSELIKOV DATA SCIENTIST @ KI-LABS

CROSS VALIDATION: BASIC This is one of the most important

CROSS VALIDATION: CLASSIC Basic requirement – folds must be independent

CROSS VALIDATION: TIME SERIES Basic requirement – folds must be

CROSS VALIDATION: STRATIFIED

CROSS VALIDATION: WHEN IT DOESN’T WORK 1. CV-validation error not

CROSS VALIDATION: WHEN IT DOESN’T WORK Outliers. Mercedes-Benz Greener Manufacturing

CROSS VALIDATION: STATISTICS General Idea: calculated the t-statistics instead of

CROSS VALIDATION: STATISTICS General Idea: calculated the t-statistics instead of

CROSS VALIDATION: ADVERSARIAL VALIDATION • The general idea is to

ENSEMBLES: BASIC • Stacking - ensembling technique to combine information

ENSEMBLES: EXAMPLE1

ENSEMBLES: EXAMPLE2

ENSEMBLES: STACKING

ENSEMBLES: STACKING

ENSEMBLES: HARD STORY… • Hard to build (it’s much more

ENSEMBLES: THE SIMPLIEST • Bagging (decrease variance) – averaging same

BAGGING IN SANTANDER VALUE PREDICTION CHALLENGE RMSLE Fold1 Fold2 Fold3

LINKS Language Name Comment Python ML-Ensemble General ensemble learning Python