https://amzn.to/2NdxKsj An Actuarial Data Scientist working at Global Insurance Company A seconded employee at the University of Tokyo as a researcher of data health One of the authors of Japanese Kaggle Book `Data Science Skill to Win Kaggle` BS, MS degrees of Physics at the University of Tokyo 1. Self-Introduction
Imputation (Rubin, 1987) is unbiased and recommended*1, when missing pattern is MCAR or MAR (explained later). To the best of my knowledge, Multiple Imputation seems to be NOT used in kaggle competitions. In a lot of top solutions, missing values were treated as it is and input into machine learning models without imputed. (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...) *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials 2. Motivation
can treat missing values without any imputation. It is a natural way to treat missing values as it is, because we can think that a missing value itself has information about data generation backgrounds. ... ... ... Japanese original text
MAR Mean Imputation - when only used in the MCAR case, the estimated parameter remains unbiased - scikit-learn SimpleImputer*1 Regression Imputation, Matching - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... ) implemented in scikit-learn itself *2 Stochastic Regression Imputation - add the noise to imputation values Underestimate the variance (uncertainty) *1 https://scikit-learn.org/stable/modules/impute.html#impute *2 https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py x1 x2 Not missing X1 Missing X1 Imputed x1 mean Mean Imputation in MAR
(of single imputation) Python package - fancyimpute - Impyute - scikit-learn IterativeImputer R package - Norm (MCMC, Data Augmentation) - MICE (FCS, Chained Equations)*3 - Amelia II (EMB) *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R Missing Data Imputed Data 1 Imputed Data 2 Imputed Data M-1 Imputed Data M Model 1 Model 2 Model M-1 Model M Ensemble (Simple Blending) 1. Build M models with separate M imputed datasets 2. Average each M results to get predictions
Value Flow ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60 x1 x2 x3 NA Flow (Default Direction) split 1 split 2 split 3 Decision Tree at round n Right Left Case L: NA goes to Left NA always go to left Find a threshold with not-missing values (e.g. X1 = 0.64, 0.58, ..., 0.43) Loss is computed including NA values 1 2 Case R: NA goes to Right NA always go to Right The same procedure as case L 3 Select Case L or Case R to reduce loss at most
HyperOpt 5 fold CV with ES log-loss-mean Using seed 2019 for CV and XGBoost train Drop useless columns 131 => 21 12: numeric 9: categorical Ordinal Encoder ( category_ encoders ) Numeric Features Categorical Features fit_transform Numeric Features Categorical Features transform test Training 5 fold CV 5 seeds for fold patterns 2020, 21, 22, 23, 24 Model on seed 2020 Model on seed 2021 Model on seed 2022 Model on seed 2023 Model on seed 2024 prediction Public / Private score CV score OOF prediction
v30 About 45% of train + test examples are complete ( NOT missing ) Little's test*1 - Reject Null Hypothesis, being MCAR ( p < 1e-32 ) Missing Combinations on selected columns R package `VIM` *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values
2. At each round, NAs are allocated to an optimal set to decrease loss function. For example, let's Imagine X1 elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l. NAs of X1 are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l. In other words, X1 is imputed with 1 or 2 in round k, with 4 or 5 in round l. This is similar to Multiple Imputation. round k predict: yl-1 y - =1 −1 yi round l predict: yk-1 y - =1 −1 yi Excellent Features of XGBoost x1 x1 [4, 5] [1, 2, 3] [3, 4, 5] [1, 2]