Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

F6c0cb53d72908942998923f1a05c71b?s=47 Maxwell
December 11, 2019

Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

This presentation is for Kaggle Days Tokyo organized by Kaggle and Google Cloud at Roppongi Hills Tokyo on Dec 11th.

F6c0cb53d72908942998923f1a05c71b?s=128

Maxwell

December 11, 2019
Tweet

Transcript

  1. 2.

    1. Self-Introduction 2. Motivation 3. Missing Value Treatment in Statistics

    4. Missing Value Treatment in XGBoost 5. Naïve Experiment Agenda
  2. 3.

    Japanese Kaggle Book Oct. 09 .2019 released, In store now!

    https://amzn.to/2NdxKsj  An Actuarial Data Scientist working at Global Insurance Company  A seconded employee at the University of Tokyo as a researcher of data health  One of the authors of Japanese Kaggle Book `Data Science Skill to Win Kaggle`  BS, MS degrees of Physics at the University of Tokyo 1. Self-Introduction
  3. 4.

     In Statistics, there are many imputation methods.  Multiple

    Imputation (Rubin, 1987) is unbiased and recommended*1, when missing pattern is MCAR or MAR (explained later).  To the best of my knowledge, Multiple Imputation seems to be NOT used in kaggle competitions. In a lot of top solutions, missing values were treated as it is and input into machine learning models without imputed. (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...) *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials 2. Motivation
  4. 5.

    chapter 3.3.1 Use missing values as it is GBDT packages

    can treat missing values without any imputation. It is a natural way to treat missing values as it is, because we can think that a missing value itself has information about data generation backgrounds. ... ... ... Japanese original text
  5. 6.

    3 - 1. MCAR, MAR and NMAR 3 - 2.

    LD, PD 3 - 3. Single Imputation 3 - 4. Multiple Imputation 3. Missing Value Treatment in Statistics
  6. 7.

    3 - 1. MCAR, MAR and NMAR MCAR (Missing Completely

    At Random) x1 x2 x1 x2 x1 x2 MAR (Missing At Random) Not missing X1 Missing NMAR (Not Missing at Random) X1 Missing X1 Missing - X1 Missing dose not depend on any variable - X1 Missing depends on X2 - X1 Missing depends on not only X2 but also X1 - Hard to impute without knowing a missing mechanism ID X1 X2 1 0.06 0.28 2 0.51 0.51 3 0.14 4 0.05 0.01 5 0.52 0.61 6 0.45 7 0.94 0.98 8 0.30 0.73 9 0.75 0.84 10 0.47 11 0.49 12 0.72 0.80 13 0.13 0.83 14 0.77 0.11 15 0.43 0.90 16 0.91 17 0.18 0.37
  7. 8.

    3 - 2. LD, PD  LD(Listwise Deletion) - Basically

    applicable to only MCAR - Remove any examples which has at least one missing value - Decrease sample size  PD(Pairwise Deletion) - Basically applicable to only MCAR - Decrease sample size - When modeling with all features, this is identical to LD ID X1 X2 X3 1 0.06 0.28 0.42 2 0.51 0.51 0.82 3 0.14 4 0.05 0.01 0.73 5 0.52 0.61 0.85 6 0.45 0.36 7 0.94 0.98 0.39 8 0.30 0.73 9 0.75 0.84 0.54 10 0.47 0.00 11 0.49 12 0.72 0.80 0.30 13 0.13 0.83 0.08 14 0.77 0.11 0.29 15 0.43 0.90 0.18 16 0.91 17 0.18 0.37 0.23 Listwise Deletion sample size: 11 Pairwise Deletion on X2 and X3 sample size: 13
  8. 9.

    3 - 3. Single Imputation  Applicable to MCAR or

    MAR  Mean Imputation - when only used in the MCAR case, the estimated parameter remains unbiased - scikit-learn SimpleImputer*1  Regression Imputation, Matching - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... ) implemented in scikit-learn itself *2  Stochastic Regression Imputation - add the noise to imputation values  Underestimate the variance (uncertainty) *1 https://scikit-learn.org/stable/modules/impute.html#impute *2 https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py x1 x2 Not missing X1 Missing X1 Imputed x1 mean Mean Imputation in MAR
  9. 10.

    3 - 4. Multiple Imputation*1*2  Overcome underestimation of variance

    (of single imputation)  Python package - fancyimpute - Impyute - scikit-learn IterativeImputer  R package - Norm (MCMC, Data Augmentation) - MICE (FCS, Chained Equations)*3 - Amelia II (EMB) *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R Missing Data Imputed Data 1 Imputed Data 2 Imputed Data M-1 Imputed Data M Model 1 Model 2 Model M-1 Model M Ensemble (Simple Blending) 1. Build M models with separate M imputed datasets 2. Average each M results to get predictions
  10. 11.

    scikit-learn IterativeImputer https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation  IterativeImputer returns a single imputation 

    With different random seeds and setting sample_posterior=True, we can get multiple imputations still an open problem !
  11. 12.

    ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41

    0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.30 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.60 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.70 0.12 0.22 0.73 0.60 ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.77 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 0.80 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.33 0.94 0.18 9 0.43 0.66 0.46 0.32 0.18 10 0.12 0.22 0.73 0.60 MICE ( Multiple Imputation by Chained Equation )  MICE algorithm (Fully Conditional Specification) step 1. Set the conditional distributions |− , 1 < < ( : index of column with missing values ) step 2. Initialize = 0 ( 0 are sampled from observations ) step 3. for 1 ≤ ≤ , for 1 ≤ ≤ , generate and ⇐ |1 , … , −1 , , +1 −1, … , −1 ∼ |1 , … , −1 , +1 −1, … , −1, step 4. Remove `Burn-in` phase and sample M(< D) imputed datasets, Psuedo-Complete data  Imputation Models: |− , d = 1, j = 1 1 1 fitting d = 1, j = 1 1 1 sampling d = 1, j = 3 3 1 fitting Example ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60
  12. 13.

    4. Missing Value Treatment in XGBoost 4 - 1. Review

    of XGBoost 4 - 2. Missing Value Flow
  13. 14.

    target: y 1 predict: y1 y - y1 2 predict:

    y2 y - y1 - y2 3 predict: yN-1 y - =1 −1 yi N 4 - 1. Review of XGBoost
  14. 15.

    Chen, et al., 2016, XGBoost Example 4 - 2. Missing

    Value Flow ID X1 X2 X3 X4 X5 1 0.64 0.23 0.41 0.37 0.37 2 0.58 0.26 0.77 0.54 0.68 3 0.10 0.57 0.81 4 0.74 0.63 0.14 0.44 0.80 5 0.07 0.40 0.00 0.13 6 0.20 0.43 0.47 0.63 7 0.15 0.99 0.33 0.42 0.54 8 0.30 0.55 0.94 0.18 9 0.43 0.66 0.46 0.32 10 0.12 0.22 0.73 0.60 x1 x2 x3 NA Flow (Default Direction) split 1 split 2 split 3 Decision Tree at round n Right Left Case L: NA goes to Left  NA always go to left  Find a threshold with not-missing values (e.g. X1 = 0.64, 0.58, ..., 0.43)  Loss is computed including NA values 1 2 Case R: NA goes to Right  NA always go to Right  The same procedure as case L 3 Select Case L or Case R to reduce loss at most
  15. 16.

    5. Naïve Experiment 5 - 1. Task and Dataset 5

    - 2. Basic Model Pipeline 5 - 3. Model Pipeline with MICE 5 - 4. Result 1 5 - 5. Result 2
  16. 17.

    5 - 1. Task and Dataset  Binary Classification 0:

    27,300 examples 1: 87,021 examples  Logloss  train : 114,321 x 133 test : 114,393 x 132  v1 ~ v131 anonymous features  Digits of Private Score Difference ~ 0.000X / position Specification
  17. 18.

    Missing Map v1 v131  Many NaNs  0 ~

    50% missing in each feature  Some missing patterns observed ( Explore this later ) Counts Features v1 v131
  18. 19.

    5 - 2. Basic Model Pipeline Hyper parameter Tuning 

    HyperOpt  5 fold CV with ES  log-loss-mean  Using seed 2019 for CV and XGBoost train Drop useless columns 131 => 21 12: numeric 9: categorical Ordinal Encoder ( category_ encoders ) Numeric Features Categorical Features fit_transform Numeric Features Categorical Features transform test Training  5 fold CV  5 seeds for fold patterns 2020, 21, 22, 23, 24 Model on seed 2020 Model on seed 2021 Model on seed 2022 Model on seed 2023 Model on seed 2024 prediction Public / Private score CV score OOF prediction
  19. 20.

     Most frequent pattern is missing of a single feature,

    v30  About 45% of train + test examples are complete ( NOT missing )  Little's test*1 - Reject Null Hypothesis, being MCAR ( p < 1e-32 ) Missing Combinations on selected columns R package `VIM` *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values
  20. 21.

    5 - 3. Model Pipeline with MICE MICE ( R

    package ) Imputed Datasets: M = 30 *1 Max Iteration: m = 20 Imputation Models v10: pmm v12: pmm v14: pmm v21: pmm v30: polyreg v31: polyreg v34: pmm v40: pmm v50: pmm v52: polyreg v91: polyreg v114: pmm *1 Carpenter, et al., 2013, Multiple Imputation and Its Application
  21. 23.

     XGBoost with NAs is best in both public and

    private  Mean Imputation does not work ( MAR? )  More than 5 MICE datasets will be necessary to check the performance
  22. 24.

    1. XGBoost can consider relation of NAs and target values

    2. At each round, NAs are allocated to an optimal set to decrease loss function. For example, let's Imagine X1 elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l. NAs of X1 are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l. In other words, X1 is imputed with 1 or 2 in round k, with 4 or 5 in round l. This is similar to Multiple Imputation. round k predict: yl-1 y - =1 −1 yi round l predict: yk-1 y - =1 −1 yi Excellent Features of XGBoost x1 x1 [4, 5] [1, 2, 3] [3, 4, 5] [1, 2]
  23. 25.

    5 - 5. Result 2 XGBoost can consider relation of

    NAs and target values What if adding NA features (indicating missingness) to MICE datasets? na_v10 na_v12 na_v14 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 0 0 0 ... 0 0 0 ... 0 0 1 ... 1 0 0 ... 1 0 0 ... 0 1 0 ... 0 0 0 ...
  24. 26.
  25. 27.

     Adding NA features improve MICE scores  XGBoost with

    NAs already uses missing information, so the improvement is not larger than that of MICE