Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

Maxwell
December 11, 2019

Imputation Strategy @ Kaggle Days Tokyo (Maxwell)

This presentation is for Kaggle Days Tokyo organized by Kaggle and Google Cloud at Roppongi Hills Tokyo on Dec 11th.

Maxwell

December 11, 2019
Tweet

More Decks by Maxwell

Other Decks in Science

Transcript

  1. Yuji Hiramatsu
    Sr. Data Scientist and Actuary
    Kaggle Master
    Imputation
    Strategy

    View Slide

  2. 1. Self-Introduction
    2. Motivation
    3. Missing Value Treatment in Statistics
    4. Missing Value Treatment in XGBoost
    5. Naïve Experiment
    Agenda

    View Slide

  3. Japanese Kaggle Book
    Oct. 09 .2019 released, In store now!
    https://amzn.to/2NdxKsj
     An Actuarial Data Scientist working at Global Insurance Company
     A seconded employee at the University of Tokyo
    as a researcher of data health
     One of the authors of Japanese Kaggle Book
    `Data Science Skill to Win Kaggle`
     BS, MS degrees of Physics at the University of Tokyo
    1. Self-Introduction

    View Slide

  4.  In Statistics, there are many imputation methods.
     Multiple Imputation (Rubin, 1987) is unbiased and recommended*1,
    when missing pattern is MCAR or MAR (explained later).
     To the best of my knowledge,
    Multiple Imputation seems to be NOT used in kaggle competitions.
    In a lot of top solutions, missing values were treated as it is and
    input into machine learning models without imputed.
    (Especially into Boosting Tree, e.g. XGBoost, LightGBM, ...)
    *1 National Research Council, 2010, The Prevention and Treatment of Missing Data in Clinical Trials
    2. Motivation

    View Slide

  5. chapter 3.3.1
    Use missing values as it is
    GBDT packages can treat missing values without
    any imputation.
    It is a natural way to treat missing values as it is,
    because we can think that a missing value itself
    has information about data generation backgrounds.
    ...
    ...
    ...
    Japanese original text

    View Slide

  6. 3 - 1. MCAR, MAR and NMAR
    3 - 2. LD, PD
    3 - 3. Single Imputation
    3 - 4. Multiple Imputation
    3. Missing Value Treatment
    in Statistics

    View Slide

  7. 3 - 1. MCAR, MAR and NMAR
    MCAR
    (Missing Completely At Random)
    x1
    x2
    x1
    x2
    x1
    x2
    MAR
    (Missing At Random)
    Not missing
    X1
    Missing
    NMAR
    (Not Missing at Random)
    X1
    Missing X1
    Missing
    - X1
    Missing dose not depend on any variable - X1
    Missing depends on X2
    - X1
    Missing depends on not only X2
    but also X1
    - Hard to impute without knowing a missing mechanism
    ID X1
    X2
    1 0.06 0.28
    2 0.51 0.51
    3 0.14
    4 0.05 0.01
    5 0.52 0.61
    6 0.45
    7 0.94 0.98
    8 0.30 0.73
    9 0.75 0.84
    10 0.47
    11 0.49
    12 0.72 0.80
    13 0.13 0.83
    14 0.77 0.11
    15 0.43 0.90
    16 0.91
    17 0.18 0.37

    View Slide

  8. 3 - 2. LD, PD
     LD(Listwise Deletion)
    - Basically applicable to only MCAR
    - Remove any examples which has at least one missing value
    - Decrease sample size
     PD(Pairwise Deletion)
    - Basically applicable to only MCAR
    - Decrease sample size
    - When modeling with all features, this is identical to LD
    ID X1
    X2
    X3
    1 0.06 0.28 0.42
    2 0.51 0.51 0.82
    3 0.14
    4 0.05 0.01 0.73
    5 0.52 0.61 0.85
    6 0.45 0.36
    7 0.94 0.98 0.39
    8 0.30 0.73
    9 0.75 0.84 0.54
    10 0.47 0.00
    11 0.49
    12 0.72 0.80 0.30
    13 0.13 0.83 0.08
    14 0.77 0.11 0.29
    15 0.43 0.90 0.18
    16 0.91
    17 0.18 0.37 0.23
    Listwise Deletion
    sample size: 11
    Pairwise Deletion
    on X2
    and X3
    sample size: 13

    View Slide

  9. 3 - 3. Single Imputation
     Applicable to MCAR or MAR
     Mean Imputation
    - when only used in the MCAR case,
    the estimated parameter remains unbiased
    - scikit-learn SimpleImputer*1
     Regression Imputation, Matching
    - scikit-learn IterativeImputer*1 can use any estimator class ( BayesianRidge, DT, ExT, kNN, ... )
    implemented in scikit-learn itself *2
     Stochastic Regression Imputation
    - add the noise to imputation values
     Underestimate the variance (uncertainty) *1
    https://scikit-learn.org/stable/modules/impute.html#impute
    *2
    https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
    x1
    x2
    Not missing
    X1
    Missing
    X1
    Imputed
    x1
    mean
    Mean Imputation in MAR

    View Slide

  10. 3 - 4. Multiple Imputation*1*2
     Overcome underestimation of variance (of single imputation)
     Python package
    - fancyimpute
    - Impyute
    - scikit-learn IterativeImputer
     R package
    - Norm (MCMC, Data Augmentation)
    - MICE (FCS, Chained Equations)*3
    - Amelia II (EMB)
    *1 Little and Rubin, 2002, Statistical Analysis with Missing Data, 2nd Edition
    *2 Rubin, 1987, Multiple Imputation for Nonresponse in Surveys
    *3 Buuren et al. , 2011, mice: Multivariate Imputation by Chained Equations in R
    Missing Data
    Imputed
    Data 1
    Imputed
    Data 2
    Imputed
    Data M-1
    Imputed
    Data M
    Model 1 Model 2 Model M-1 Model M
    Ensemble
    (Simple Blending)
    1. Build M models with separate M imputed datasets
    2. Average each M results to get predictions

    View Slide

  11. scikit-learn
    IterativeImputer
    https://scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation  IterativeImputer returns a single imputation
     With different random seeds and setting
    sample_posterior=True,
    we can get multiple imputations
    still an open problem !

    View Slide

  12. ID X1
    X2
    X3
    X4
    X5
    1 0.64 0.23 0.41 0.37 0.37
    2 0.58 0.26 0.77 0.54 0.68
    3 0.30 0.10 0.57 0.81
    4 0.74 0.63 0.14 0.44 0.80
    5 0.07 0.40 0.00 0.13 0.80
    6 0.60 0.20 0.43 0.47 0.63
    7 0.15 0.99 0.33 0.42 0.54
    8 0.30 0.55 0.94 0.18
    9 0.43 0.66 0.46 0.32 0.18
    10 0.70 0.12 0.22 0.73 0.60
    ID X1
    X2
    X3
    X4
    X5
    1 0.64 0.23 0.41 0.37 0.37
    2 0.58 0.26 0.77 0.54 0.68
    3 0.30 0.10 0.77 0.57 0.81
    4 0.74 0.63 0.14 0.44 0.80
    5 0.07 0.40 0.00 0.13 0.80
    6 0.60 0.20 0.43 0.47 0.63
    7 0.15 0.99 0.33 0.42 0.54
    8 0.30 0.55 0.33 0.94 0.18
    9 0.43 0.66 0.46 0.32 0.18
    10 0.70 0.12 0.22 0.73 0.60
    ID X1
    X2
    X3
    X4
    X5
    1 0.64 0.23 0.41 0.37 0.37
    2 0.58 0.26 0.77 0.54 0.68
    3 0.10 0.77 0.57 0.81
    4 0.74 0.63 0.14 0.44 0.80
    5 0.07 0.40 0.00 0.13 0.80
    6 0.20 0.43 0.47 0.63
    7 0.15 0.99 0.33 0.42 0.54
    8 0.30 0.55 0.33 0.94 0.18
    9 0.43 0.66 0.46 0.32 0.18
    10 0.12 0.22 0.73 0.60
    MICE ( Multiple Imputation by Chained Equation )
     MICE algorithm (Fully Conditional Specification)
    step 1. Set the conditional distributions

    |−
    ,
    1 < < ( : index of column with missing values )
    step 2. Initialize
    =
    0 (
    0 are sampled from observations )
    step 3. for 1 ≤ ≤ , for 1 ≤ ≤ , generate
    and



    |1
    , … , −1
    ,
    , +1
    −1, … ,
    −1


    |1
    , … , −1
    , +1
    −1, … ,
    −1,

    step 4. Remove `Burn-in` phase and sample M(< D)
    imputed datasets, Psuedo-Complete data
     Imputation Models:
    |−
    ,
    d = 1, j = 1
    1
    1 fitting
    d = 1, j = 1
    1
    1 sampling
    d = 1, j = 3
    3
    1 fitting
    Example
    ID X1
    X2
    X3
    X4
    X5
    1 0.64 0.23 0.41 0.37 0.37
    2 0.58 0.26 0.77 0.54 0.68
    3 0.10 0.57 0.81
    4 0.74 0.63 0.14 0.44 0.80
    5 0.07 0.40 0.00 0.13
    6 0.20 0.43 0.47 0.63
    7 0.15 0.99 0.33 0.42 0.54
    8 0.30 0.55 0.94 0.18
    9 0.43 0.66 0.46 0.32
    10 0.12 0.22 0.73 0.60

    View Slide

  13. 4. Missing Value Treatment
    in XGBoost
    4 - 1. Review of XGBoost
    4 - 2. Missing Value Flow

    View Slide

  14. target: y
    1
    predict: y1
    y - y1
    2
    predict: y2
    y - y1
    - y2
    3
    predict: yN-1
    y - =1
    −1 yi
    N
    4 - 1. Review of XGBoost

    View Slide

  15. Chen, et al., 2016, XGBoost
    Example
    4 - 2. Missing Value Flow
    ID X1
    X2
    X3
    X4
    X5
    1 0.64 0.23 0.41 0.37 0.37
    2 0.58 0.26 0.77 0.54 0.68
    3 0.10 0.57 0.81
    4 0.74 0.63 0.14 0.44 0.80
    5 0.07 0.40 0.00 0.13
    6 0.20 0.43 0.47 0.63
    7 0.15 0.99 0.33 0.42 0.54
    8 0.30 0.55 0.94 0.18
    9 0.43 0.66 0.46 0.32
    10 0.12 0.22 0.73 0.60
    x1
    x2
    x3
    NA Flow
    (Default Direction)
    split 1
    split 2 split 3
    Decision Tree at round n
    Right
    Left
    Case L: NA goes to Left
     NA always go to left
     Find a threshold with not-missing values (e.g. X1
    = 0.64, 0.58, ..., 0.43)
     Loss is computed including NA values
    1
    2 Case R: NA goes to Right
     NA always go to Right
     The same procedure as case L
    3 Select Case L or Case R to reduce loss at most

    View Slide

  16. 5. Naïve Experiment
    5 - 1. Task and Dataset
    5 - 2. Basic Model Pipeline
    5 - 3. Model Pipeline with MICE
    5 - 4. Result 1
    5 - 5. Result 2

    View Slide

  17. 5 - 1. Task and Dataset
     Binary Classification
    0: 27,300 examples
    1: 87,021 examples
     Logloss
     train : 114,321 x 133
    test : 114,393 x 132
     v1 ~ v131 anonymous features
     Digits of Private Score Difference
    ~ 0.000X / position
    Specification

    View Slide

  18. Missing Map
    v1 v131
     Many NaNs
     0 ~ 50% missing in each feature
     Some missing patterns observed
    ( Explore this later )
    Counts
    Features
    v1 v131

    View Slide

  19. 5 - 2. Basic Model Pipeline
    Hyper parameter Tuning
     HyperOpt
     5 fold CV with ES
     log-loss-mean
     Using seed 2019 for CV and
    XGBoost
    train
    Drop
    useless
    columns
    131 => 21
    12: numeric
    9: categorical
    Ordinal
    Encoder
    ( category_
    encoders )
    Numeric Features
    Categorical
    Features
    fit_transform
    Numeric Features
    Categorical
    Features transform
    test
    Training
     5 fold CV
     5 seeds for fold
    patterns
    2020, 21, 22, 23, 24
    Model on seed 2020
    Model on seed 2021
    Model on seed 2022
    Model on seed 2023
    Model on seed 2024
    prediction
    Public /
    Private
    score
    CV
    score
    OOF
    prediction

    View Slide

  20.  Most frequent pattern is missing of
    a single feature, v30
     About 45% of train + test examples are
    complete ( NOT missing )
     Little's test*1
    - Reject Null Hypothesis, being MCAR
    ( p < 1e-32 )
    Missing Combinations on selected columns
    R package `VIM`
    *1 Little, R. J. A. 1988. A test of missing completely at random for multivariate data with missing values

    View Slide

  21. 5 - 3. Model Pipeline with MICE
    MICE
    ( R package )
    Imputed Datasets: M = 30 *1
    Max Iteration: m = 20
    Imputation Models
    v10: pmm
    v12: pmm
    v14: pmm
    v21: pmm
    v30: polyreg
    v31: polyreg
    v34: pmm
    v40: pmm
    v50: pmm
    v52: polyreg
    v91: polyreg
    v114: pmm
    *1 Carpenter, et al., 2013, Multiple Imputation and Its Application

    View Slide

  22. 5 - 4. Result 1

    View Slide

  23.  XGBoost with NAs is best in both
    public and private
     Mean Imputation does not work
    ( MAR? )
     More than 5 MICE datasets will be necessary
    to check the performance

    View Slide

  24. 1. XGBoost can consider relation of NAs and target values
    2. At each round, NAs are allocated to an optimal set to decrease loss function.
    For example, let's Imagine X1
    elements are [NA, 1, 2, 3, 4, 5] and used in the trees at round k and l.
    NAs of X1
    are allocated to subset [1, 2] in round k, while allocated to subset [4, 5] in round l.
    In other words, X1
    is imputed with 1 or 2 in round k, with 4 or 5 in round l.
    This is similar to Multiple Imputation.
    round k
    predict: yl-1
    y - =1
    −1 yi
    round l
    predict: yk-1
    y - =1
    −1 yi
    Excellent Features of XGBoost
    x1
    x1
    [4, 5]
    [1, 2, 3]
    [3, 4, 5]
    [1, 2]

    View Slide

  25. 5 - 5. Result 2
    XGBoost can consider relation of NAs and target values
    What if adding NA features (indicating missingness) to MICE datasets?
    na_v10 na_v12 na_v14 ...
    0 0 0 ...
    0 0 1 ...
    1 0 0 ...
    0 0 0 ...
    0 0 0 ...
    0 0 1 ...
    1 0 0 ...
    1 0 0 ...
    0 1 0 ...
    0 0 0 ...

    View Slide

  26. View Slide

  27.  Adding NA features improve MICE scores
     XGBoost with NAs already uses missing
    information, so the improvement is not
    larger than that of MICE

    View Slide

  28. Thank you!

    View Slide