Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

Yuji Hiramatsu Sr. Data Scientist and Actuary Kaggle Master Imputation
Strategy

1. Self-Introduction 2. Motivation 3. Missing Value Treatment in Statistics
4. Missing Value Treatment in XGBoost 5. Naïve Experiment Agenda

Japanese Kaggle Book Oct. 09 .2019 released, In store now!
https://amzn.to/2NdxKsj  An Actuarial Data Scientist working at Global Insurance Company  A seconded employee at the University of Tokyo as a researcher of data health  One of the authors of Japanese Kaggle Book `Data Science Skill to Win Kaggle`  BS, MS degrees of Physics at the University of Tokyo 1. Self-Introduction

 In Statistics, there are many imputation methods.  Multiple
Imputation (Rubin, 1987) is unbiased and recommended*1, when missing pattern is MCAR or MAR (explained later).  To the best of my knowledge, Multiple Imputation seems to be NOT used in kaggle competitions. In a lot of top solutions, missing values were treated as it is and input into machine learning models without imputed. (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...) *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials 2. Motivation

chapter 3.3.1 Use missing values as it is GBDT packages
can treat missing values without any imputation. It is a natural way to treat missing values as it is, because we can think that a missing value itself has information about data generation backgrounds. ... ... ... Japanese original text

3 - 1. MCAR, MAR and NMAR 3 - 2.
LD, PD 3 - 3. Single Imputation 3 - 4. Multiple Imputation 3. Missing Value Treatment in Statistics

3 - 1. MCAR, MAR and NMAR MCAR (Missing Completely
At Random) x1 x2 x1 x2 x1 x2 MAR (Missing At Random) Not missing X1 Missing NMAR (Not Missing at Random) X1 Missing X1 Missing - X1 Missing dose not depend on any variable - X1 Missing depends on X2 - X1 Missing depends on not only X2 but also X1 - Hard to impute without knowing a missing mechanism ID X1 X2 1 0.06 0.28 2 0.51 0.51 3 0.14 4 0.05 0.01 5 0.52 0.61 6 0.45 7 0.94 0.98 8 0.30 0.73 9 0.75 0.84 10 0.47 11 0.49 12 0.72 0.80 13 0.13 0.83 14 0.77 0.11 15 0.43 0.90 16 0.91 17 0.18 0.37

3 - 2. LD, PD  LD(Listwise Deletion) - Basically
applicable to only MCAR - Remove any examples which has at least one missing value - Decrease sample size  PD(Pairwise Deletion) - Basically applicable to only MCAR - Decrease sample size - When modeling with all features, this is identical to LD ID X1 X2 X3 1 0.06 0.28 0.42 2 0.51 0.51 0.82 3 0.14 4 0.05 0.01 0.73 5 0.52 0.61 0.85 6 0.45 0.36 7 0.94 0.98 0.39 8 0.30 0.73 9 0.75 0.84 0.54 10 0.47 0.00 11 0.49 12 0.72 0.80 0.30 13 0.13 0.83 0.08 14 0.77 0.11 0.29 15 0.43 0.90 0.18 16 0.91 17 0.18 0.37 0.23 Listwise Deletion sample size: 11 Pairwise Deletion on X2 and X3 sample size: 13

3 - 3. Single Imputation  Applicable to MCAR or
MAR  Mean Imputation - when only used in the MCAR case, the estimated parameter remains unbiased - scikit-learn SimpleImputer*1  Regression Imputation, Matching - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... ) implemented in scikit-learn itself *2  Stochastic Regression Imputation - add the noise to imputation values  Underestimate the variance (uncertainty) *1 https://scikit-learn.org/stable/modules/impute.html#impute *2 https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py x1 x2 Not missing X1 Missing X1 Imputed x1 mean Mean Imputation in MAR

3 - 4. Multiple Imputation*1*2  Overcome underestimation of variance
(of single imputation)  Python package - fancyimpute - Impyute - scikit-learn IterativeImputer  R package - Norm (MCMC, Data Augmentation) - MICE (FCS, Chained Equations)*3 - Amelia II (EMB) *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R Missing Data Imputed Data 1 Imputed Data 2 Imputed Data M-1 Imputed Data M Model 1 Model 2 Model M-1 Model M Ensemble (Simple Blending) 1. Build M models with separate M imputed datasets 2. Average each M results to get predictions

scikit-learn IterativeImputer https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation  IterativeImputer returns a single imputation 
With different random seeds and setting sample_posterior=True, we can get multiple imputations still an open problem !

ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41
0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.12 0.22 0.73 0.60 MICE ( Multiple Imputation by Chained Equation )  MICE algorithm (Fully Conditional Specification) step 1. Set the conditional distributions |− , 1 < < ( : index of column with missing values ) step 2. Initialize = 0 ( 0 are sampled from observations ) step 3. for 1 ≤ ≤ , for 1 ≤ ≤ , generate and ⇐ |1 , … , −1 , , +1 −1, … , −1 ∼ |1 , … , −1 , +1 −1, … , −1, step 4. Remove `Burn-in` phase and sample M(< D) imputed datasets, Psuedo-Complete data  Imputation Models: |− , d = 1, j = 1 1 1 fitting d = 1, j = 1 1 1 sampling d = 1, j = 3 3 1 fitting Example ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60

4. Missing Value Treatment in XGBoost 4 - 1. Review
of XGBoost 4 - 2. Missing Value Flow

target: y 1 predict: y1 y - y1 2 predict:
y2 y - y1 - y2 3 predict: yN-1 y - =1 −1 yi N 4 - 1. Review of XGBoost

Chen, et al., 2016, XGBoost Example 4 - 2. Missing
Value Flow ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60 x1 x2 x3 NA Flow (Default Direction) split 1 split 2 split 3 Decision Tree at round n Right Left Case L: NA goes to Left  NA always go to left  Find a threshold with not-missing values (e.g. X1 = 0.64, 0.58, ..., 0.43)  Loss is computed including NA values 1 2 Case R: NA goes to Right  NA always go to Right  The same procedure as case L 3 Select Case L or Case R to reduce loss at most

5. Naïve Experiment 5 - 1. Task and Dataset 5
- 2. Basic Model Pipeline 5 - 3. Model Pipeline with MICE 5 - 4. Result 1 5 - 5. Result 2

5 - 1. Task and Dataset  Binary Classification 0:
27,300 examples 1: 87,021 examples  Logloss  train : 114,321 x 133 test : 114,393 x 132  v1 ~ v131 anonymous features  Digits of Private Score Difference ~ 0.000X / position Specification

Missing Map v1 v131  Many NaNs  0 ~
50% missing in each feature  Some missing patterns observed ( Explore this later ) Counts Features v1 v131

5 - 2. Basic Model Pipeline Hyper parameter Tuning 
HyperOpt  5 fold CV with ES  log-loss-mean  Using seed 2019 for CV and XGBoost train Drop useless columns 131 => 21 12: numeric 9: categorical Ordinal Encoder ( category_ encoders ) Numeric Features Categorical Features fit_transform Numeric Features Categorical Features transform test Training  5 fold CV  5 seeds for fold patterns 2020, 21, 22, 23, 24 Model on seed 2020 Model on seed 2021 Model on seed 2022 Model on seed 2023 Model on seed 2024 prediction Public / Private score CV score OOF prediction

 Most frequent pattern is missing of a single feature,
v30  About 45% of train + test examples are complete ( NOT missing )  Little's test*1 - Reject Null Hypothesis, being MCAR ( p < 1e-32 ) Missing Combinations on selected columns R package `VIM` *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values

5 - 3. Model Pipeline with MICE MICE ( R
package ) Imputed Datasets: M = 30 *1 Max Iteration: m = 20 Imputation Models v10: pmm v12: pmm v14: pmm v21: pmm v30: polyreg v31: polyreg v34: pmm v40: pmm v50: pmm v52: polyreg v91: polyreg v114: pmm *1 Carpenter, et al., 2013, Multiple Imputation and Its Application

5 - 4. Result 1

 XGBoost with NAs is best in both public and
private  Mean Imputation does not work ( MAR? )  More than 5 MICE datasets will be necessary to check the performance

1. XGBoost can consider relation of NAs and target values
2. At each round, NAs are allocated to an optimal set to decrease loss function. For example, let's Imagine X1 elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l. NAs of X1 are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l. In other words, X1 is imputed with 1 or 2 in round k, with 4 or 5 in round l. This is similar to Multiple Imputation. round k predict: yl-1 y - =1 −1 yi round l predict: yk-1 y - =1 −1 yi Excellent Features of XGBoost x1 x1 [4, 5] [1, 2, 3] [3, 4, 5] [1, 2]

5 - 5. Result 2 XGBoost can consider relation of
NAs and target values What if adding NA features (indicating missingness) to MICE datasets? na_v10 na_v12 na_v14 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 0 0 0 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 1 0 0 ... 0 1 0 ... 0 0 0 ...

 Adding NA features improve MICE scores  XGBoost with
NAs already uses missing information, so the improvement is not larger than that of MICE

Thank you!

Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

Maxwell

More Decks by Maxwell

Other Decks in Science

Featured

Transcript

Yuji Hiramatsu Sr. Data Scientist and Actuary Kaggle Master Imputation

1. Self-Introduction 2. Motivation 3. Missing Value Treatment in Statistics

Japanese Kaggle Book Oct. 09 .2019 released, In store now!

 In Statistics, there are many imputation methods.  Multiple

chapter 3.3.1 Use missing values as it is GBDT packages

3 - 1. MCAR, MAR and NMAR 3 - 2.

3 - 1. MCAR, MAR and NMAR MCAR (Missing Completely

3 - 2. LD, PD  LD(Listwise Deletion) - Basically

3 - 3. Single Imputation  Applicable to MCAR or

3 - 4. Multiple Imputation12  Overcome underestimation of variance

scikit-learn IterativeImputer https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation  IterativeImputer returns a single imputation 

ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41

4. Missing Value Treatment in XGBoost 4 - 1. Review

target: y 1 predict: y1 y - y1 2 predict:

Chen, et al., 2016, XGBoost Example 4 - 2. Missing

5. Naïve Experiment 5 - 1. Task and Dataset 5

5 - 1. Task and Dataset  Binary Classification 0:

Missing Map v1 v131  Many NaNs  0 ~

5 - 2. Basic Model Pipeline Hyper parameter Tuning 

 Most frequent pattern is missing of a single feature,

5 - 3. Model Pipeline with MICE MICE ( R

5 - 4. Result 1

 XGBoost with NAs is best in both public and

1. XGBoost can consider relation of NAs and target values

5 - 5. Result 2 XGBoost can consider relation of

 Adding NA features improve MICE scores  XGBoost with

Thank you!