Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

by Maxwell

Slide 1

Slide 1 text

Yuji Hiramatsu Sr. Data Scientist and Actuary Kaggle Master Imputation Strategy

Slide 2

Slide 2 text

1. Self-Introduction 2. Motivation 3. Missing Value Treatment in Statistics 4. Missing Value Treatment in XGBoost 5. Naïve Experiment Agenda

Slide 3

Slide 3 text

Japanese Kaggle Book Oct. 09 .2019 released, In store now! https://amzn.to/2NdxKsj  An Actuarial Data Scientist working at Global Insurance Company  A seconded employee at the University of Tokyo as a researcher of data health  One of the authors of Japanese Kaggle Book `Data Science Skill to Win Kaggle`  BS, MS degrees of Physics at the University of Tokyo 1. Self-Introduction

Slide 4

Slide 4 text

 In Statistics, there are many imputation methods.  Multiple Imputation (Rubin, 1987) is unbiased and recommended*1, when missing pattern is MCAR or MAR (explained later).  To the best of my knowledge, Multiple Imputation seems to be NOT used in kaggle competitions. In a lot of top solutions, missing values were treated as it is and input into machine learning models without imputed. (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...) *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials 2. Motivation

Slide 5

Slide 5 text

chapter 3.3.1 Use missing values as it is GBDT packages can treat missing values without any imputation. It is a natural way to treat missing values as it is, because we can think that a missing value itself has information about data generation backgrounds. ... ... ... Japanese original text

Slide 6

Slide 6 text

3 - 1. MCAR, MAR and NMAR 3 - 2. LD, PD 3 - 3. Single Imputation 3 - 4. Multiple Imputation 3. Missing Value Treatment in Statistics

Slide 7

Slide 7 text

3 - 1. MCAR, MAR and NMAR MCAR (Missing Completely At Random) x1 x2 x1 x2 x1 x2 MAR (Missing At Random) Not missing X1 Missing NMAR (Not Missing at Random) X1 Missing X1 Missing - X1 Missing dose not depend on any variable - X1 Missing depends on X2 - X1 Missing depends on not only X2 but also X1 - Hard to impute without knowing a missing mechanism ID X1 X2 1 0.06 0.28 2 0.51 0.51 3 0.14 4 0.05 0.01 5 0.52 0.61 6 0.45 7 0.94 0.98 8 0.30 0.73 9 0.75 0.84 10 0.47 11 0.49 12 0.72 0.80 13 0.13 0.83 14 0.77 0.11 15 0.43 0.90 16 0.91 17 0.18 0.37

Slide 8

Slide 8 text

3 - 2. LD, PD  LD(Listwise Deletion) - Basically applicable to only MCAR - Remove any examples which has at least one missing value - Decrease sample size  PD(Pairwise Deletion) - Basically applicable to only MCAR - Decrease sample size - When modeling with all features, this is identical to LD ID X1 X2 X3 1 0.06 0.28 0.42 2 0.51 0.51 0.82 3 0.14 4 0.05 0.01 0.73 5 0.52 0.61 0.85 6 0.45 0.36 7 0.94 0.98 0.39 8 0.30 0.73 9 0.75 0.84 0.54 10 0.47 0.00 11 0.49 12 0.72 0.80 0.30 13 0.13 0.83 0.08 14 0.77 0.11 0.29 15 0.43 0.90 0.18 16 0.91 17 0.18 0.37 0.23 Listwise Deletion sample size: 11 Pairwise Deletion on X2 and X3 sample size: 13

Slide 9

Slide 9 text

3 - 3. Single Imputation  Applicable to MCAR or MAR  Mean Imputation - when only used in the MCAR case, the estimated parameter remains unbiased - scikit-learn SimpleImputer*1  Regression Imputation, Matching - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... ) implemented in scikit-learn itself *2  Stochastic Regression Imputation - add the noise to imputation values  Underestimate the variance (uncertainty) *1 https://scikit-learn.org/stable/modules/impute.html#impute *2 https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py x1 x2 Not missing X1 Missing X1 Imputed x1 mean Mean Imputation in MAR

Slide 10

Slide 10 text

3 - 4. Multiple Imputation*1*2  Overcome underestimation of variance (of single imputation)  Python package - fancyimpute - Impyute - scikit-learn IterativeImputer  R package - Norm (MCMC, Data Augmentation) - MICE (FCS, Chained Equations)*3 - Amelia II (EMB) *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R Missing Data Imputed Data 1 Imputed Data 2 Imputed Data M-1 Imputed Data M Model 1 Model 2 Model M-1 Model M Ensemble (Simple Blending) 1. Build M models with separate M imputed datasets 2. Average each M results to get predictions

Slide 11

Slide 11 text

scikit-learn IterativeImputer https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation  IterativeImputer returns a single imputation  With different random seeds and setting sample_posterior=True, we can get multiple imputations still an open problem !

Slide 12

Slide 12 text

ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.12 0.22 0.73 0.60 MICE ( Multiple Imputation by Chained Equation )  MICE algorithm (Fully Conditional Specification) step 1. Set the conditional distributions |− , 1 < < ( : index of column with missing values ) step 2. Initialize = 0 ( 0 are sampled from observations ) step 3. for 1 ≤ ≤ , for 1 ≤ ≤ , generate and ⇐ |1 , … , −1 , , +1 −1, … , −1 ∼ |1 , … , −1 , +1 −1, … , −1, step 4. Remove `Burn-in` phase and sample M(< D) imputed datasets, Psuedo-Complete data  Imputation Models: |− , d = 1, j = 1 1 1 fitting d = 1, j = 1 1 1 sampling d = 1, j = 3 3 1 fitting Example ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60

Slide 13

Slide 13 text

4. Missing Value Treatment in XGBoost 4 - 1. Review of XGBoost 4 - 2. Missing Value Flow

Slide 14

Slide 14 text

target: y 1 predict: y1 y - y1 2 predict: y2 y - y1 - y2 3 predict: yN-1 y - =1 −1 yi N 4 - 1. Review of XGBoost

Slide 15

Slide 15 text

Chen, et al., 2016, XGBoost Example 4 - 2. Missing Value Flow ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60 x1 x2 x3 NA Flow (Default Direction) split 1 split 2 split 3 Decision Tree at round n Right Left Case L: NA goes to Left  NA always go to left  Find a threshold with not-missing values (e.g. X1 = 0.64, 0.58, ..., 0.43)  Loss is computed including NA values 1 2 Case R: NA goes to Right  NA always go to Right  The same procedure as case L 3 Select Case L or Case R to reduce loss at most

Slide 16

Slide 16 text

5. Naïve Experiment 5 - 1. Task and Dataset 5 - 2. Basic Model Pipeline 5 - 3. Model Pipeline with MICE 5 - 4. Result 1 5 - 5. Result 2

Slide 17

Slide 17 text

5 - 1. Task and Dataset  Binary Classification 0: 27,300 examples 1: 87,021 examples  Logloss  train : 114,321 x 133 test : 114,393 x 132  v1 ~ v131 anonymous features  Digits of Private Score Difference ~ 0.000X / position Specification

Slide 18

Slide 18 text

Missing Map v1 v131  Many NaNs  0 ~ 50% missing in each feature  Some missing patterns observed ( Explore this later ) Counts Features v1 v131

Slide 19

Slide 19 text

5 - 2. Basic Model Pipeline Hyper parameter Tuning  HyperOpt  5 fold CV with ES  log-loss-mean  Using seed 2019 for CV and XGBoost train Drop useless columns 131 => 21 12: numeric 9: categorical Ordinal Encoder ( category_ encoders ) Numeric Features Categorical Features fit_transform Numeric Features Categorical Features transform test Training  5 fold CV  5 seeds for fold patterns 2020, 21, 22, 23, 24 Model on seed 2020 Model on seed 2021 Model on seed 2022 Model on seed 2023 Model on seed 2024 prediction Public / Private score CV score OOF prediction

Slide 20

Slide 20 text

 Most frequent pattern is missing of a single feature, v30  About 45% of train + test examples are complete ( NOT missing )  Little's test*1 - Reject Null Hypothesis, being MCAR ( p < 1e-32 ) Missing Combinations on selected columns R package `VIM` *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values

Slide 21

Slide 21 text

5 - 3. Model Pipeline with MICE MICE ( R package ) Imputed Datasets: M = 30 *1 Max Iteration: m = 20 Imputation Models v10: pmm v12: pmm v14: pmm v21: pmm v30: polyreg v31: polyreg v34: pmm v40: pmm v50: pmm v52: polyreg v91: polyreg v114: pmm *1 Carpenter, et al., 2013, Multiple Imputation and Its Application

Slide 22

Slide 22 text

5 - 4. Result 1

Slide 23

Slide 23 text

 XGBoost with NAs is best in both public and private  Mean Imputation does not work ( MAR? )  More than 5 MICE datasets will be necessary to check the performance

Slide 24

Slide 24 text

1. XGBoost can consider relation of NAs and target values 2. At each round, NAs are allocated to an optimal set to decrease loss function. For example, let's Imagine X1 elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l. NAs of X1 are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l. In other words, X1 is imputed with 1 or 2 in round k, with 4 or 5 in round l. This is similar to Multiple Imputation. round k predict: yl-1 y - =1 −1 yi round l predict: yk-1 y - =1 −1 yi Excellent Features of XGBoost x1 x1 [4, 5] [1, 2, 3] [3, 4, 5] [1, 2]

Slide 25

Slide 25 text

5 - 5. Result 2 XGBoost can consider relation of NAs and target values What if adding NA features (indicating missingness) to MICE datasets? na_v10 na_v12 na_v14 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 0 0 0 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 1 0 0 ... 0 1 0 ... 0 0 0 ...

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

 Adding NA features improve MICE scores  XGBoost with NAs already uses missing information, so the improvement is not larger than that of MICE

Slide 28

Slide 28 text

Thank you!