Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hands-on solving classification and regression competition on Kaggle: validation, feature engineering, ensembles.

Hands-on solving classification and regression competition on Kaggle: validation, feature engineering, ensembles.

Hands-on solving classification and regression competition on Kaggle: validation, feature engineering, ensembles. Examples from current competitions: “Home Credit Default Risk”, “Santander Value Prediction Challenge”.
https://www.meetup.com/Kaggle-Munich/events/250963570/

Alex Tselikov

July 04, 2018
Tweet

More Decks by Alex Tselikov

Other Decks in Research

Transcript

  1. CROSS VALIDATION: BASIC This is one of the most important

    parts of any data mining task. It is essential to create a validation environment that has the same or very close characteristics of the production environment. General goal: To create quite robust and close to the public leaderboard validation scheme. (it’s not always possible)
  2. CROSS VALIDATION: WHEN IT DOESN’T WORK 1. CV-validation error not

    reflected in leaderboard score AT ALL Two Sigma Financial Modeling Challenge https://www.kaggle.com/c/two-sigma-financial-modeling/leaderboard Current competition example Santander Value Prediction Challenge 4k train set, 40k test set
  3. CROSS VALIDATION: WHEN IT DOESN’T WORK Outliers. Mercedes-Benz Greener Manufacturing

    Mean R2 by folds: [ 0.37, 0.43, 0.58 , 0.54, 0.55] Check standard deviation for folds If it’s too big – there is no sense to compare error mean by folds Fold1 Fold2 Fold3 Fold4 Fold5 MeanByFolds Splitting1 1,484 1,557 1,517 1,503 1,487 1,50993 Splitting2 1,473 1,563 1,543 1,508 1,465 1,510306 Splitting3 1,492 1,5 1,525 1,502 1,521 1,508102 Splitting4 1,504 1,452 1,497 1,53 1,542 1,505258 Splitting5 1,487 1,517 1,539 1,494 1,519 1,511306
  4. CROSS VALIDATION: ADVERSARIAL VALIDATION • The general idea is to

    check the degree of similarity between training and tests in terms of feature distribution • This intuition can be quantified by combining train and test sets, assigning 0/1 labels (0 - train, 1-test) and evaluating a binary classification task. ROC AUC ~ 0.87 for Santander Value Prediction Challenge
  5. ENSEMBLES: BASIC • Stacking - ensembling technique to combine information

    from multiple predictive models to generate a new model. • Usually stacked model (2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. • Stacking is most effective when the base models are significantly different.
  6. ENSEMBLES: HARD STORY… • Hard to build (it’s much more

    easy to overfit on 2nd level comparing with 1st) • Hard to support (need to recalculate all ensemble after adding a feature) • Hard to validate & easy to loose control • Takes a lot of time • Sometimes it doesn’t work due to nature of data
  7. ENSEMBLES: THE SIMPLIEST • Bagging (decrease variance) – averaging same

    algorithm with different random state • Simple averaging of different algorithms
  8. BAGGING IN SANTANDER VALUE PREDICTION CHALLENGE RMSLE Fold1 Fold2 Fold3

    Fold4 Fold5 Bag: 0 1.490034 1.551832 1.507701 1.498717 1.4574 Bag: 1 1.477678 1.54417 1.502861 1.499165 1.454479 Bag: 2 1.471445 1.543386 1.500513 1.493613 1.45419 Bag: 3 1.469421 1.543851 1.499011 1.495249 1.450472 Bag: 4 1.46828 1.54117 1.499485 1.494402 1.454471 Bag: 5 1.468886 1.540295 1.497646 1.493316 1.45233 Bag: 6 1.467041 1.540077 1.497461 1.490675 1.45191 Bag: 7 1.4662 1.540261 1.49709 1.490636 1.452969 Bag: 8 1.466083 1.540779 1.499358 1.490681 1.454017 Bag: 9 1.465501 1.538435 1.498638 1.49134 1.451611
  9. LINKS Language Name Comment Python ML-Ensemble General ensemble learning Python

    Scikit-learn Bagging, majority voting classifiers. API for stacking in development Python mlxtend Regression and Classification ensembles R SuperLearner Super Learner ensembles R Subsemble Subsembles R caretEnsemble Ensembles of Caret estimators Mutliple H20 Distributed stacked ensemble learning. Limited to estimators in the H20 library Java StackNet Empowered by H20 Web-based xcessiv Web-based ensemble learning Some packages to get you started with ensembles