Japanese Kaggle Book Oct. 09 .2019 released, In store now! https://amzn.to/2NdxKsj An Actuarial Data Scientist working at Global Insurance Company A seconded employee at the University of Tokyo as a researcher of data health One of the authors of Japanese Kaggle Book `Data Science Skill to Win Kaggle` BS, MS degrees of Physics at the University of Tokyo 1. Self-Introduction
In Statistics, there are many imputation methods. Multiple Imputation (Rubin, 1987) is unbiased and recommended*1, when missing pattern is MCAR or MAR (explained later). To the best of my knowledge, Multiple Imputation seems to be NOT used in kaggle competitions. In a lot of top solutions, missing values were treated as it is and input into machine learning models without imputed. (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...) *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials 2. Motivation
chapter 3.3.1 Use missing values as it is GBDT packages can treat missing values without any imputation. It is a natural way to treat missing values as it is, because we can think that a missing value itself has information about data generation backgrounds. ... ... ... Japanese original text
3 - 3. Single Imputation Applicable to MCAR or MAR Mean Imputation - when only used in the MCAR case, the estimated parameter remains unbiased - scikit-learn SimpleImputer*1 Regression Imputation, Matching - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... ) implemented in scikit-learn itself *2 Stochastic Regression Imputation - add the noise to imputation values Underestimate the variance (uncertainty) *1 https://scikit-learn.org/stable/modules/impute.html#impute *2 https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py x1 x2 Not missing X1 Missing X1 Imputed x1 mean Mean Imputation in MAR
3 - 4. Multiple Imputation*1*2 Overcome underestimation of variance (of single imputation) Python package - fancyimpute - Impyute - scikit-learn IterativeImputer R package - Norm (MCMC, Data Augmentation) - MICE (FCS, Chained Equations)*3 - Amelia II (EMB) *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R Missing Data Imputed Data 1 Imputed Data 2 Imputed Data M-1 Imputed Data M Model 1 Model 2 Model M-1 Model M Ensemble (Simple Blending) 1. Build M models with separate M imputed datasets 2. Average each M results to get predictions
scikit-learn IterativeImputer https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation IterativeImputer returns a single imputation With different random seeds and setting sample_posterior=True, we can get multiple imputations still an open problem !
|− , 1 < < ( : index of column with missing values ) step 2. Initialize = 0 ( 0 are sampled from observations ) step 3. for 1 ≤ ≤ , for 1 ≤ ≤ , generate and
Chen, et al., 2016, XGBoost Example 4 - 2. Missing Value Flow ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60 x1 x2 x3 NA Flow (Default Direction) split 1 split 2 split 3 Decision Tree at round n Right Left Case L: NA goes to Left NA always go to left Find a threshold with not-missing values (e.g. X1 = 0.64, 0.58, ..., 0.43) Loss is computed including NA values 1 2 Case R: NA goes to Right NA always go to Right The same procedure as case L 3 Select Case L or Case R to reduce loss at most
5 - 2. Basic Model Pipeline Hyper parameter Tuning HyperOpt 5 fold CV with ES log-loss-mean Using seed 2019 for CV and XGBoost train Drop useless columns 131 => 21 12: numeric 9: categorical Ordinal Encoder ( category_ encoders ) Numeric Features Categorical Features fit_transform Numeric Features Categorical Features transform test Training 5 fold CV 5 seeds for fold patterns 2020, 21, 22, 23, 24 Model on seed 2020 Model on seed 2021 Model on seed 2022 Model on seed 2023 Model on seed 2024 prediction Public / Private score CV score OOF prediction
Most frequent pattern is missing of a single feature, v30 About 45% of train + test examples are complete ( NOT missing ) Little's test*1 - Reject Null Hypothesis, being MCAR ( p < 1e-32 ) Missing Combinations on selected columns R package `VIM` *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values
XGBoost with NAs is best in both public and private Mean Imputation does not work ( MAR? ) More than 5 MICE datasets will be necessary to check the performance
1. XGBoost can consider relation of NAs and target values 2. At each round, NAs are allocated to an optimal set to decrease loss function. For example, let's Imagine X1 elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l. NAs of X1 are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l. In other words, X1 is imputed with 1 or 2 in round k, with 4 or 5 in round l. This is similar to Multiple Imputation. round k predict: yl-1 y - =1 −1 yi round l predict: yk-1 y - =1 −1 yi Excellent Features of XGBoost x1 x1 [4, 5] [1, 2, 3] [3, 4, 5] [1, 2]