## Slide 1

### Slide 1 text

Yuji Hiramatsu Sr. Data Scientist and Actuary Kaggle Master Imputation Strategy

## Slide 2

### Slide 2 text

1. Self-Introduction 2. Motivation 3. Missing Value Treatment in Statistics 4. Missing Value Treatment in XGBoost 5. Naïve Experiment Agenda

## Slide 3

### Slide 3 text

Japanese Kaggle Book Oct. 09 .2019 released, In store now! https://amzn.to/2NdxKsj  An Actuarial Data Scientist working at Global Insurance Company  A seconded employee at the University of Tokyo as a researcher of data health  One of the authors of Japanese Kaggle Book `Data Science Skill to Win Kaggle`  BS, MS degrees of Physics at the University of Tokyo 1. Self-Introduction

## Slide 4

### Slide 4 text

 In Statistics, there are many imputation methods.  Multiple Imputation (Rubin, 1987) is unbiased and recommended*1, when missing pattern is MCAR or MAR (explained later).  To the best of my knowledge, Multiple Imputation seems to be NOT used in kaggle competitions. In a lot of top solutions, missing values were treated as it is and input into machine learning models without imputed. (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...) *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials 2. Motivation

## Slide 5

### Slide 5 text

chapter 3.3.1 Use missing values as it is GBDT packages can treat missing values without any imputation. It is a natural way to treat missing values as it is, because we can think that a missing value itself has information about data generation backgrounds. ... ... ... Japanese original text

## Slide 6

### Slide 6 text

3 - 1. MCAR, MAR and NMAR 3 - 2. LD, PD 3 - 3. Single Imputation 3 - 4. Multiple Imputation 3. Missing Value Treatment in Statistics

## Slide 7

### Slide 7 text

3 - 1. MCAR, MAR and NMAR MCAR (Missing Completely At Random) x1 x2 x1 x2 x1 x2 MAR (Missing At Random) Not missing X1 Missing NMAR (Not Missing at Random) X1 Missing X1 Missing - X1 Missing dose not depend on any variable - X1 Missing depends on X2 - X1 Missing depends on not only X2 but also X1 - Hard to impute without knowing a missing mechanism ID X1 X2 1 0.06 0.28 2 0.51 0.51 3 0.14 4 0.05 0.01 5 0.52 0.61 6 0.45 7 0.94 0.98 8 0.30 0.73 9 0.75 0.84 10 0.47 11 0.49 12 0.72 0.80 13 0.13 0.83 14 0.77 0.11 15 0.43 0.90 16 0.91 17 0.18 0.37

## Slide 8

### Slide 8 text

3 - 2. LD, PD  LD(Listwise Deletion) - Basically applicable to only MCAR - Remove any examples which has at least one missing value - Decrease sample size  PD(Pairwise Deletion) - Basically applicable to only MCAR - Decrease sample size - When modeling with all features, this is identical to LD ID X1 X2 X3 1 0.06 0.28 0.42 2 0.51 0.51 0.82 3 0.14 4 0.05 0.01 0.73 5 0.52 0.61 0.85 6 0.45 0.36 7 0.94 0.98 0.39 8 0.30 0.73 9 0.75 0.84 0.54 10 0.47 0.00 11 0.49 12 0.72 0.80 0.30 13 0.13 0.83 0.08 14 0.77 0.11 0.29 15 0.43 0.90 0.18 16 0.91 17 0.18 0.37 0.23 Listwise Deletion sample size: 11 Pairwise Deletion on X2 and X3 sample size: 13

## Slide 9

### Slide 9 text

3 - 3. Single Imputation  Applicable to MCAR or MAR  Mean Imputation - when only used in the MCAR case, the estimated parameter remains unbiased - scikit-learn SimpleImputer*1  Regression Imputation, Matching - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... ) implemented in scikit-learn itself *2  Stochastic Regression Imputation - add the noise to imputation values  Underestimate the variance (uncertainty) *1 https://scikit-learn.org/stable/modules/impute.html#impute *2 https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py x1 x2 Not missing X1 Missing X1 Imputed x1 mean Mean Imputation in MAR

## Slide 10

### Slide 10 text

3 - 4. Multiple Imputation*1*2  Overcome underestimation of variance (of single imputation)  Python package - fancyimpute - Impyute - scikit-learn IterativeImputer  R package - Norm (MCMC, Data Augmentation) - MICE (FCS, Chained Equations)*3 - Amelia II (EMB) *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R Missing Data Imputed Data 1 Imputed Data 2 Imputed Data M-1 Imputed Data M Model 1 Model 2 Model M-1 Model M Ensemble (Simple Blending) 1. Build M models with separate M imputed datasets 2. Average each M results to get predictions

## Slide 11

### Slide 11 text

scikit-learn IterativeImputer https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation  IterativeImputer returns a single imputation  With different random seeds and setting sample_posterior=True, we can get multiple imputations still an open problem !

## Slide 12

### Slide 12 text

ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.12 0.22 0.73 0.60 MICE ( Multiple Imputation by Chained Equation )  MICE algorithm (Fully Conditional Specification) step 1. Set the conditional distributions |− , 1 < < ( : index of column with missing values ) step 2. Initialize = 0 ( 0 are sampled from observations ) step 3. for 1 ≤ ≤ , for 1 ≤ ≤ , generate and ⇐ |1 , … , −1 , , +1 −1, … , −1 ∼ |1 , … , −1 , +1 −1, … , −1, step 4. Remove `Burn-in` phase and sample M(< D) imputed datasets, Psuedo-Complete data  Imputation Models: |− , d = 1, j = 1 1 1 fitting d = 1, j = 1 1 1 sampling d = 1, j = 3 3 1 fitting Example ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60

## Slide 13

### Slide 13 text

4. Missing Value Treatment in XGBoost 4 - 1. Review of XGBoost 4 - 2. Missing Value Flow

## Slide 14

### Slide 14 text

target: y 1 predict: y1 y - y1 2 predict: y2 y - y1 - y2 3 predict: yN-1 y - =1 −1 yi N 4 - 1. Review of XGBoost

## Slide 15

### Slide 15 text

Chen, et al., 2016, XGBoost Example 4 - 2. Missing Value Flow ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60 x1 x2 x3 NA Flow (Default Direction) split 1 split 2 split 3 Decision Tree at round n Right Left Case L: NA goes to Left  NA always go to left  Find a threshold with not-missing values (e.g. X1 = 0.64, 0.58, ..., 0.43)  Loss is computed including NA values 1 2 Case R: NA goes to Right  NA always go to Right  The same procedure as case L 3 Select Case L or Case R to reduce loss at most

## Slide 16

### Slide 16 text

5. Naïve Experiment 5 - 1. Task and Dataset 5 - 2. Basic Model Pipeline 5 - 3. Model Pipeline with MICE 5 - 4. Result 1 5 - 5. Result 2

## Slide 17

### Slide 17 text

5 - 1. Task and Dataset  Binary Classification 0: 27,300 examples 1: 87,021 examples  Logloss  train : 114,321 x 133 test : 114,393 x 132  v1 ~ v131 anonymous features  Digits of Private Score Difference ~ 0.000X / position Specification

## Slide 18

### Slide 18 text

Missing Map v1 v131  Many NaNs  0 ~ 50% missing in each feature  Some missing patterns observed ( Explore this later ) Counts Features v1 v131

## Slide 19

### Slide 19 text

5 - 2. Basic Model Pipeline Hyper parameter Tuning  HyperOpt  5 fold CV with ES  log-loss-mean  Using seed 2019 for CV and XGBoost train Drop useless columns 131 => 21 12: numeric 9: categorical Ordinal Encoder ( category_ encoders ) Numeric Features Categorical Features fit_transform Numeric Features Categorical Features transform test Training  5 fold CV  5 seeds for fold patterns 2020, 21, 22, 23, 24 Model on seed 2020 Model on seed 2021 Model on seed 2022 Model on seed 2023 Model on seed 2024 prediction Public / Private score CV score OOF prediction

## Slide 20

### Slide 20 text

 Most frequent pattern is missing of a single feature, v30  About 45% of train + test examples are complete ( NOT missing )  Little's test*1 - Reject Null Hypothesis, being MCAR ( p < 1e-32 ) Missing Combinations on selected columns R package `VIM` *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values

## Slide 21

### Slide 21 text

5 - 3. Model Pipeline with MICE MICE ( R package ) Imputed Datasets: M = 30 *1 Max Iteration: m = 20 Imputation Models v10: pmm v12: pmm v14: pmm v21: pmm v30: polyreg v31: polyreg v34: pmm v40: pmm v50: pmm v52: polyreg v91: polyreg v114: pmm *1 Carpenter, et al., 2013, Multiple Imputation and Its Application

5 - 4. Result 1

## Slide 23

### Slide 23 text

 XGBoost with NAs is best in both public and private  Mean Imputation does not work ( MAR? )  More than 5 MICE datasets will be necessary to check the performance

## Slide 24

### Slide 24 text

1. XGBoost can consider relation of NAs and target values 2. At each round, NAs are allocated to an optimal set to decrease loss function. For example, let's Imagine X1 elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l. NAs of X1 are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l. In other words, X1 is imputed with 1 or 2 in round k, with 4 or 5 in round l. This is similar to Multiple Imputation. round k predict: yl-1 y - =1 −1 yi round l predict: yk-1 y - =1 −1 yi Excellent Features of XGBoost x1 x1 [4, 5] [1, 2, 3] [3, 4, 5] [1, 2]

## Slide 25

### Slide 25 text

5 - 5. Result 2 XGBoost can consider relation of NAs and target values What if adding NA features (indicating missingness) to MICE datasets? na_v10 na_v12 na_v14 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 0 0 0 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 1 0 0 ... 0 1 0 ... 0 0 0 ...

No content

## Slide 27

### Slide 27 text

 Adding NA features improve MICE scores  XGBoost with NAs already uses missing information, so the improvement is not larger than that of MICE

Thank you!