A missing values tour with principal component methods

Introduction Point estimates Conﬁdence Areas MCA/MFA SI for mixed var.
Multiple imputation Practice Appendix Missing values and principal components methods Julie Josse Stanford Stat 300, July 2015 1 / 92

Multiple imputation Practice Appendix Outline 1 Introduction 2 Point estimates of the PCA axes and components 3 Uncertainty 4 MCA/MFA 5 Single imputation for mixed variables 6 Multiple imputation 7 Practice 8 Appendix 2 / 92

Multiple imputation Practice Appendix Missing values Gertrude Mary Cox “The best thing to do with missing values is not to have any” Missing values are ubiquitous: • no answer in a questionnaire • data that are lost or destroyed • machines that fail • plants damaged • ... Still an issue in the "big data" area 3 / 92

Multiple imputation Practice Appendix Some references Schafer (1997) Little & Rubin (1987, 2002) Joseph L. Schafer Roderick Little Donald Rubin Suggested reading: chap 25 of Gelman & Hill (2006) Andrew Gelman Jennifer L. Hill 4 / 92

Multiple imputation Practice Appendix Missing values problematic A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values 5 / 92

Multiple imputation Practice Appendix Missing values problematic A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values • MCAR: probability does not depend on any values • MAR: probability may depend on values on other variables • MNAR: probability depends on the value itself (Ex: Income - Age) 5 / 92

Multiple imputation Practice Appendix Missing values problematic A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values • MCAR: probability does not depend on any values • MAR: probability may depend on values on other variables • MNAR: probability depends on the value itself (Ex: Income - Age) ⇒ Inspect/ visualization of missing data 5 / 92

Multiple imputation Practice Appendix Single imputation methods 6 / 92

Multiple imputation Practice Appendix Single imputation methods q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Mean imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q µy = 0 σy = 1 ρ = 0.6 CIµy 95% 0.01 0.5 0.30 39.4 6 / 92

Multiple imputation Practice Appendix Single imputation methods q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Mean imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Regression imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q µy = 0 σy = 1 ρ = 0.6 CIµy 95% 0.01 0.5 0.30 39.4 0.01 0.72 0.78 61.6 6 / 92

Multiple imputation Practice Appendix Single imputation methods q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Mean imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Regression imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 Stochastic regression imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q µy = 0 σy = 1 ρ = 0.6 CIµy 95% 0.01 0.5 0.30 39.4 0.01 0.72 0.78 61.6 0.01 0.99 0.59 70.8 6 / 92

Multiple imputation Practice Appendix Single imputation methods q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Mean imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −2 −1 0 1 2 Regression imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 Stochastic regression imputation X Y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q µy = 0 σy = 1 ρ = 0.6 CIµy 95% 0.01 0.5 0.30 39.4 0.01 0.72 0.78 61.6 0.01 0.99 0.59 70.8 ⇒ Standard errors of the parameters (ˆ σˆ µy ) calculated from the imputed data set are underestimated 6 / 92

Multiple imputation Practice Appendix Recommended methods ⇒ Multiple imputation (Rubin, 1987) • Generate M plausible values for each missing value ( ˆ F ˆ u′)ij ( ˆ F ˆ u′)1 ij + ε1 ij ( ˆ F ˆ u′)2 ij + ε2 ij ( ˆ F ˆ u′)3 ij + ε3 ij ( ˆ F ˆ u′)B ij + εB ij • Perform the analysis on each imputed data set: ˆ θm, Var ˆ θm • Combine the results: ˆ θ = 1 M M m=1 ˆ θm T = 1 M M m=1 Var ˆ θm + 1 + 1 M 1 M−1 M m=1 ˆ θm − ˆ θ 2 7 / 92

Multiple imputation Practice Appendix Recommended methods ⇒ Multiple imputation (Rubin, 1987) ⇒ Maximum likelihood: EM algorithm (Dempster et al., 1977) to obtain point estimates + other algorithms for their variability One speciﬁc algorithms for each statistical method ⇒ Common aim: provide estimation of the parameters and of their variability (taken into account the variability due to missing values) 8 / 92

Multiple imputation Practice Appendix PCA reconstruction -2.00 -2.74 -1.56 -0.77 -1.11 -1.59 -0.67 -1.13 -0.22 -1.22 0.22 -0.52 0.67 1.46 1.11 0.63 1.56 1.10 2.00 1.00 X X X X -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x1 x2 10 / 92

Multiple imputation Practice Appendix PCA reconstruction -2.00 -2.74 -1.56 -0.77 -1.11 -1.59 -0.67 -1.13 -0.22 -1.22 0.22 -0.52 0.67 1.46 1.11 0.63 1.56 1.10 2.00 1.00 -2.16 -2.58 -0.96 -1.35 -1.15 -1.55 -0.70 -1.09 -0.53 -0.92 0.04 -0.34 1.24 0.89 1.05 0.69 1.50 1.15 1.67 1.33 X X X X -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x1 x2 X X X X 10 / 92

Multiple imputation Practice Appendix PCA reconstruction ˆ X = FV t 10 / 92

Multiple imputation Practice Appendix Minimizes the reconstruction error ⇒ Minimize the distance between observations and their projection ⇒ Approximation of X with a low rank matrix S < p Xn×p − ˆ Xn×p 2 SVD: ˆ XPCA = Un×SΛ 1 2 S×S Vp×S = Fn×SVp×S F = UΛ1 2 PC - scores V principal axes - loadings 11 / 92

Multiple imputation Practice Appendix Missing values in PCA ⇒ PCA: least squares Xn×p − Un×SΛ 1 2 S×S Vp×S 2 ⇒ PCA with missing values: weighted least squares Wn×p ∗ (Xn×p − Un×SΛ 1 2 S×S Vp×S ) 2 with wij = 0 if xij is missing, wij = 1 otherwise Many algorithms: weighted alternating least squares (Gabriel & Zamir, 1979); iterative PCA (Kiers, 1997) 12 / 92

Multiple imputation Practice Appendix Weighted least squares ⇒ Rank 1: n i=1 p j=1 (xij − Fi1Vj1)2 2 simple regressions: Vj1 = i (xij ×Fi1) i F2 i1 Fi1 = j (xij ×Vj1) j u2 j1 Power method. Deﬂation: (F2, V2) in ˆ ε1 = X − F1V1 NIPALS (Non linear Iterative PArtial Least Squares, Wold, Christoﬀerson, 1966, 1969). Vj1 = i (wij xij Fi1) i wij F2 i1 ; Fi1 = j (wij xij uj1) j wij V 2 j1 ⇒ Subspace S > 1: 2 multiple regressions: V = X F(F F)−1; F = XV (V V )−1 2 multiple weighted regressions 13 / 92

Multiple imputation Practice Appendix Iterative PCA -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 14 / 92

Multiple imputation Practice Appendix Iterative PCA -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.00 2.0 1.98 Initialization = 0: X0 (mean imputation) 14 / 92

Multiple imputation Practice Appendix Iterative PCA -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.00 2.0 1.98 x1 x2 -1.98 -2.04 -1.44 -1.56 0.15 -0.18 1.00 0.57 2.27 1.67 PCA on the completed data set → (U , Λ , V ); 14 / 92

Multiple imputation Practice Appendix Iterative PCA -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.00 2.0 1.98 x1 x2 -1.98 -2.04 -1.44 -1.56 0.15 -0.18 1.00 0.57 2.27 1.67 Missing values imputed with the model matrix ˆ X = U Λ1/2 V 14 / 92

Multiple imputation Practice Appendix Iterative PCA -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.00 2.0 1.98 x1 x2 -1.98 -2.04 -1.44 -1.56 0.15 -0.18 1.00 0.57 2.27 1.67 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.57 2.0 1.98 The new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ X 14 / 92

Multiple imputation Practice Appendix Iterative PCA x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.57 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.57 2.0 1.98 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 14 / 92

Multiple imputation Practice Appendix Iterative PCA x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.57 2.0 1.98 x1 x2 -2.00 -2.01 -1.47 -1.52 0.09 -0.11 1.20 0.90 2.18 1.78 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.90 2.0 1.98 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 14 / 92

Multiple imputation Practice Appendix Iterative PCA x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.00 2.0 1.98 x1 x2 -1.98 -2.04 -1.44 -1.56 0.15 -0.18 1.00 0.57 2.27 1.67 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 0.57 2.0 1.98 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 Steps are repeated until convergence 14 / 92

Multiple imputation Practice Appendix Iterative PCA x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 NA 2.0 1.98 x1 x2 -2.0 -2.01 -1.5 -1.48 0.0 -0.01 1.5 1.46 2.0 1.98 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x1 x2 PCA on the completed data set → (U , Λ , V ) Missing values imputed with the model matrix ˆ X = U Λ1/2 V 14 / 92

Multiple imputation Practice Appendix Iterative PCA 1 initialization = 0: X0 (mean imputation) 2 step : (a) PCA on the completed data set → (U , Λ , V ); S dimensions are kept (b) missing values imputed with ˆ X = U Λ1/2 V ; the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ X 3 steps of estimation and imputation are repeated 15 / 92

Multiple imputation Practice Appendix Iterative PCA 1 initialization = 0: X0 (mean imputation) 2 step : (a) PCA on the completed data set → (U , Λ , V ); S dimensions are kept (b) missing values imputed with ˆ X = U Λ1/2 V ; the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ X (c) means (and standard deviations) are updated 3 steps of estimation and imputation are repeated 15 / 92

Multiple imputation Practice Appendix Iterative PCA 1 initialization = 0: X0 (mean imputation) 2 step : (a) PCA on the completed data set → (U , Λ , V ); S dimensions are kept (b) missing values imputed with ˆ X = U Λ1/2 V ; the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ X (c) means (and standard deviations) are updated 3 steps of estimation and imputation are repeated ⇒ EM algorithm of the fixed effect model (Caussinus, 1986) xij = S s=1 √ λsUisVjs + εij εij ∼ N(0, σ2) ⇒ Imputation (matrix completion framework, Netflix) ⇒ Reduction of the variability (imputation by UΛ1/2V ) 15 / 92

Multiple imputation Practice Appendix Overﬁtting X41×6 = F41×2V2×6 + N(0, 0.5) -4 -2 0 2 4 -3 -2 -1 0 1 2 3 4 ACP sur données complètes Dim 1 (55.09%) Dim 2 (27.91%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa 16 / 92

Multiple imputation Practice Appendix Overﬁtting X41×6 = F41×2V2×6 + N(0, 0.5) ⇒ 50% of NA -4 -2 0 2 4 -3 -2 -1 0 1 2 3 4 ACP sur données complètes Dim 1 (55.09%) Dim 2 (27.91%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa -4 -2 0 2 4 -4 -2 0 2 ACP itérative Dim 1 (63.97%) Dim 2 (31.9%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa 16 / 92

Multiple imputation Practice Appendix Overﬁtting X41×6 = F41×2V2×6 + N(0, 0.5) ⇒ 50% of NA -4 -2 0 2 4 -3 -2 -1 0 1 2 3 4 ACP sur données complètes Dim 1 (55.09%) Dim 2 (27.91%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa -4 -2 0 2 4 -4 -2 0 2 ACP itérative Dim 1 (63.97%) Dim 2 (31.9%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa ⇒ ﬁtting error is low: ||W ∗ (X − ˆ X)||2 = 0.48 ⇒ prediction error is high: ||(1 − W) ∗ (X − ˆ X)||2 = 5.58 16 / 92

Multiple imputation Practice Appendix Overﬁtting Overﬁtting when: • many parameters / the number of observed values (the number of dimensions S and of missing values are important) • data are very noisy ⇒ Trust too much the relationship between variables Remarks: • missing values: special case of small data set • iterative PCA: prediction method Solution: ⇒ Shrinkage methods 17 / 92

Multiple imputation Practice Appendix Regularized iterative PCA (Josse et al., 2009) ⇒ Initialization - estimation step - imputation step The imputation step: ˆ xPCA ij = S s=1 λsUisVjs is replaced by a "shrunk" imputation step (Efron & Morris 1972): ˆ xrPCA ij = S s=1 λs − ˆ σ2 λs λsUisVjs = S s=1 λs − ˆ σ2 √ λs UisVjs 18 / 92

Multiple imputation Practice Appendix Regularized iterative PCA (Josse et al., 2009) ⇒ Initialization - estimation step - imputation step The imputation step: ˆ xPCA ij = S s=1 λsUisVjs is replaced by a "shrunk" imputation step (Efron & Morris 1972): ˆ xrPCA ij = S s=1 λs − ˆ σ2 λs λsUisVjs = S s=1 λs − ˆ σ2 √ λs UisVjs ˆ σ2 = RSS ddl = n q s=S+1 λs np − p − nS − pS + S2 + S (Xn×p; Un×S; Vp×S) 18 / 92

Multiple imputation Practice Appendix Regularized iterative PCA (Josse et al., 2009) ⇒ Initialization - estimation step - imputation step The imputation step: ˆ xPCA ij = S s=1 λsUisVjs is replaced by a "shrunk" imputation step (Efron & Morris 1972): ˆ xrPCA ij = S s=1 λs − ˆ σ2 λs λsUisVjs = S s=1 λs − ˆ σ2 √ λs UisVjs ˆ σ2 = RSS ddl = n q s=S+1 λs np − p − nS − pS + S2 + S (Xn×p; Un×S; Vp×S) Between hard/soft thresholding (Mazumder, Hastie & Tibshirani, 2010) σ2 small → regularized PCA ≈ PCA σ2 large → mean imputation 18 / 92

Multiple imputation Practice Appendix Regularized iterative PCA X41×6 = F41×2V2×6 + N(0, 0.5) ⇒ 50% of NA -4 -2 0 2 4 -3 -2 -1 0 1 2 3 4 ACP sur données complètes Dim 1 (55.09%) Dim 2 (27.91%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa -4 -2 0 2 4 -3 -2 -1 0 1 2 3 ACP régularisée Dim 1 (64.27%) Dim 2 (30.72%) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Karpov Macey Warners Zsivoczky Hernu Nool Bernard Schwarzl Pogorelov Schoenbeck Barras Smith Averyanov Ojaniemi Smirnov Qi Drews Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou Uldal Casarsa ⇒ ﬁtting error: ||W ∗ (X − ˆ X)||2 = 0.52 (EM= 0.48) ⇒ prediction error: ||(1 − W) ∗ (X − ˆ X)||2 = 0.67 (EM= 5.58) 19 / 92

Multiple imputation Practice Appendix Properties ⇒ Quality of estimation of the parameters: Simulation X = FV + ε RV coeﬃcient between complete/ incomplete • performances decrease with missing values and level of noise • diﬃcult settings: regularized PCA equals mean imputation • the choice of the number of dimensions is less crucial ⇒ Quality of imputation: • Good when the structure is strong (imputation uses similarities between individuals and relationship between variables) • Competitive with random forests 20 / 92

Multiple imputation Practice Appendix A real dataset O3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 O3v 0601 NA 15.6 18.5 18.4 4 4 8 NA -1.7101 -0.6946 84 0602 82 17 18.4 17.7 5 5 7 NA NA NA 87 0603 92 NA 17.6 19.5 2 5 4 2.9544 1.8794 0.5209 82 0604 114 16.2 NA NA 1 1 0 NA NA NA 92 0605 94 17.4 20.5 NA 8 8 7 -0.5 NA -4.3301 114 0606 80 17.7 NA 18.3 NA NA NA -5.6382 -5 -6 94 0607 NA 16.8 15.6 14.9 7 8 8 -4.3301 -1.8794 -3.7588 80 0610 79 14.9 17.5 18.9 5 5 4 0 -1.0419 -1.3892 NA 0611 101 NA 19.6 21.4 2 4 4 -0.766 NA -2.2981 79 0612 NA 18.3 21.9 22.9 5 6 8 1.2856 -2.2981 -3.9392 101 0613 101 17.3 19.3 20.2 NA NA NA -1.5 -1.5 -0.8682 NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0919 NA 14.8 16.3 15.9 7 7 7 -4.3301 -6.0622 -5.1962 42 0920 71 15.5 18 17.4 7 7 6 -3.9392 -3.0642 0 NA 0921 96 NA NA NA 3 3 3 NA NA NA 71 0922 98 NA NA NA 2 2 2 4 5 4.3301 96 0923 92 14.7 17.6 18.2 1 4 6 5.1962 5.1423 3.5 98 0924 NA 13.3 17.7 17.7 NA NA NA -0.9397 -0.766 -0.5 92 0925 84 13.3 17.7 17.8 3 5 6 0 -1 -1.2856 NA 0927 NA 16.2 20.8 22.1 6 5 5 -0.6946 -2 -1.3681 71 0928 99 16.9 23 22.6 NA 4 7 1.5 0.8682 0.8682 NA 0929 NA 16.9 19.8 22.1 6 5 3 -4 -3.7588 -4 99 0930 70 15.7 18.6 20.7 NA NA NA 0 -1.0419 -4 NA 21 / 92

Multiple imputation Practice Appendix PCA on the incomplete data q −4 −2 0 2 4 6 −6 −4 −2 0 2 4 Individuals factor map (PCA) Dim 1 (57.47%) Dim 2 (21.34%) East North West South q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q East North West South q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Variables factor map (PCA) Dim 1 (55.85%) Dim 2 (21.73%) T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v maxO3 22 / 92

Multiple imputation Practice Appendix Uncertainty with incomplete case ⇒ A new source of variability to take into account • less data: more uncertainty • iterative PCA: single imputation → residual bootstrap on the completed data leads to underestimate the variability ⇒ Multiple imputation 1 Generating B imputed data sets 2 Performing the analysis on each imputed data set 3 Combining: variance = within + between imputation variance 24 / 92

Multiple imputation Practice Appendix Uncertainty with incomplete case ⇒ A new source of variability to take into account • less data: more uncertainty • iterative PCA: single imputation → residual bootstrap on the completed data leads to underestimate the variability ⇒ Multiple imputation 1 Generating B imputed data sets: b = 1, ..., B, missing values xb ij drawn from the predictive N (FV )ij, ˆ σ2 ⇒ "improper" imputation 2 Performing the analysis on each imputed data set 3 Combining: variance = within + between imputation variance 24 / 92

Multiple imputation Practice Appendix “proper” multiple imputation 1 Variability of the parameters: obtaining B plausible sets of parameters, (F, V )1, ..., (F, V )B ⇒ bootstrap/bayesian 2 Noise: for b = 1, ..., B, missing values xb ij are imputing by drawing from the predictive distribution N (FV )b ij , ˆ σ2 ( ˆ F ˆ U′)ik ( ˆ F ˆ U′)1 ik + ε1 ik ( ˆ F ˆ U′)2 ik + ε2 ik ( ˆ F ˆ U′)3 ik + ε3 ik ( ˆ F ˆ U′)B ik + εB ik 25 / 92

Multiple imputation Practice Appendix Supplementary projection ⇒ Individuals position (and variables) with other predictions Supplementary projection PCA Regularized iterative PCA ⇒ reference conﬁguration 26 / 92

Multiple imputation Practice Appendix Multiple imputation in practice q −5 0 5 −8 −6 −4 −2 0 2 4 6 Supplementary projection Dim 1 (57.20%) Dim 2 (20.27%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 7677 78 79 80 81 82 83 84 85 86 87 88 89 9091 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Variable representation Dim 1 (57.20%) Dim 2 (20.27%) maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 27 / 92

Multiple imputation Practice Appendix Between imputation variability ⇒ Inﬂuence of the diﬀerent predictions on the parameters (PCA on each table) PCA 28 / 92

Multiple imputation Practice Appendix Between imputation variability ⇒ Inﬂuence of the diﬀerent predictions on the parameters (PCA on each table) PCA ( ˜ F ˜ U′)1 ( ˜ F ˜ U′)2 ( ˜ F ˜ U′)3 ( ˜ F ˜ U′)B 28 / 92

Multiple imputation Practice Appendix Between imputation variability ⇒ Inﬂuence of the diﬀerent predictions on the parameters (PCA on each table) Procrustean rotation PCA ( ˜ F ˜ U′)1 ( ˜ F ˜ U′)2 ( ˜ F ˜ U′)3 ( ˜ F ˜ U′)B 28 / 92

Multiple imputation Practice Appendix Between imputation variability q −4 −2 0 2 4 6 −4 −2 0 2 Multiple imputation using Procrustes Dim 1 (71.33%) Dim 2 (17.17%) q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 11 12 29 / 92

Multiple imputation Practice Appendix MCA for categorical data MCA can be seen as the PCA of (data, metric, row masses) IXD−1 Σ , 1 IJ DΣ, 1 I II with X the indicator matrix and DΣ the diagonal matrix of the column margins of X, xik I1 Ik IK J J J IJ X = DΣ = I1 Ik IK . . . .. . . .. . . . . . . .. . . .. . . . . . . .. . . .. . . . . . . .. . . .. . . . 0 0 1 0 0 1 0 0 1 ... 0 1 1 0 0 1 0 1 0 ... NA NA NA NA NA 0 1 0 0 ... 0 1 1 0 0 1 0 0 1 ... 0 1 0 0 1 NA NA 0 ... 0 1 1 0 0 1 0 0 1 ... 0 1 31 / 92

Multiple imputation Practice Appendix Regularized iterative MCA (Josse et al., 2012) • Initialization: imputation of the indicator matrix (proportion) • Iterate until convergence 1 Estimation of F , V : MCA on the completed indicator matrix 2 Imputation of the missing values with the model matrix 3 Column margins are updated V1 V2 V3 … V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h … ind 1 a NA g … u ind 1 1 0 0 0.71 0.29 1 0 … ind 2 NA f g u ind 2 0.12 0.29 0.59 0 1 1 0 … ind 3 a e h v ind 3 1 0 0 1 0 0 1 … ind 4 a e h v ind 4 1 0 0 1 0 0 1 … ind 5 b f h u ind 5 0 1 0 0 1 0 1 … ind 6 c f h u ind 6 0 0 1 0 1 0 1 … ind 7 c f NA v ind 7 0 0 1 0 1 0.37 0.63 … … … … … … … … … … … … … … … ind 1232 c f h v ind 1232 0 0 1 0 1 0 1 … ⇒ Imputed values can be seen as degree of membership ⇒ Missing values mask an underlying value 32 / 92

Multiple imputation Practice Appendix A real example • 1232 respondents, 14 questions, 35 categories, 9% of missing values concerning 42% of respondents q 0 1 2 3 4 5 6 −3 −2 −1 0 1 2 3 Missing single: categories Dim 1 (11.74%) Dim 2 (8.618%) Q1.NA Q1_1 Q1_2 Q1_3 Q2.NA Q2_1 Q2_2 Q2_3 Q3.NA Q3_1 Q3_2 Q3_3 Q4.NA Q4_1 Q4_2 Q5.NA Q5_1 Q5_2 Q6.NA Q6_1 Q6_2 Q7.NA Q7_1 Q7_2 Q8.NA Q8_1 Q8_2 Q9.NA Q9_1 Q9_2 Q9_3 Q10.NA Q10_1 Q10_2 Q11.NA Q11_1 Q11_2 Q12.NA Q12_1 Q12_2 Q12_3 Q13.NA Q13_1 Q13_2 Q13_3 Q14.NA Q14_1 Q14_2 Q14_3 q 0 1 2 3 4 5 −3 −2 −1 0 1 2 3 Missing single: subjects Dim 1 (11.74%) Dim 2 (8.618%) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 33 / 92

Multiple imputation Practice Appendix A real example • 1232 respondents, 14 questions, 35 categories, 9% of missing values concerning 42% of respondents q 0 1 2 3 4 5 6 −3 −2 −1 0 1 2 3 Missing single: categories Dim 1 (11.74%) Dim 2 (8.618%) Q1.NA Q1_1 Q1_2 Q1_3 Q2.NA Q2_1 Q2_2 Q2_3 Q3.NA Q3_1 Q3_2 Q3_3 Q4.NA Q4_1 Q4_2 Q5.NA Q5_1 Q5_2 Q6.NA Q6_1 Q6_2 Q7.NA Q7_1 Q7_2 Q8.NA Q8_1 Q8_2 Q9.NA Q9_1 Q9_2 Q9_3 Q10.NA Q10_1 Q10_2 Q11.NA Q11_1 Q11_2 Q12.NA Q12_1 Q12_2 Q12_3 Q13.NA Q13_1 Q13_2 Q13_3 Q14.NA Q14_1 Q14_2 Q14_3 q 0 1 2 3 4 5 −3 −2 −1 0 1 2 3 Missing single: subjects Dim 1 (11.74%) Dim 2 (8.618%) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −1.0 −0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.5 Regularized iterative MCA: categories Dim 1 (14.58%) Dim 2 (11.21%) Q1.1 Q1.2 Q1.3 Q2.1 Q2.2 Q2.3 Q3.1 Q3.2 Q3.3 Q4.1 Q4.2 Q5.1 Q5.2 Q6.1 Q6.2 Q7.1 Q7.2 Q8.1 Q8.2 Q9.1 Q9.2 Q9.3 Q10.1 Q10.2 Q11.1 Q11.2 Q12.1 Q12.2 Q12.3 Q13.1 Q13.2 Q13.3 Q14.1 Q14.2 Q14.3 q −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Regularized iterative MCA: subjects Dim 1 (14.58%) Dim 2 (11.21%) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 33 / 92

Multiple imputation Practice Appendix Multi-blocks data set • Biology: 10 samples without expression data • Sensory analysis: each judge can’t evaluate more than a certain number of products (saturation) Planned missing products judge, experimental design: BIB ⇒ Missing rows per subtable ⇒ Regularized iterative MFA (Husson & Josse, 2013) 34 / 92

Multiple imputation Practice Appendix Journal impact factors journalmetrics.com provides 27000 journals/ 15 years of metrics. 443 journals (Computer Science, Statistics, Probability and Mathematics). 45 metrics, some may be NA, 15 years by 3 types of measures: • IPP - Impact Per Publication (like the ISI impact factor but for 3 (rather than 2) years. • SNIP - Source Normalized Impact Per Paper: Tries to weight by the number of citations per subject ﬁeld to adjust for diﬀerent citation cultures. • SJR - SCImago Journal Rank: Tries to capture average prestige per publication. 35 / 92

Multiple imputation Practice Appendix MFA with missing values -5 0 5 10 15 20 -4 -2 0 2 4 6 Journals Dim 1 (74.03%) Dim 2 (8.29%) ACM Transactions on Autonomous and Adaptive Systems ACM Transactions on Mathematical Software ACM Transactions on Programming Languages and Systems ACM Transactions on Software Engineering and Methodology Ad Hoc Networks Advances in Engineering Software (1978) Annals of Applied Probability Annals of Probability Annals of Statistics Bioinformatics Biometrics Biometrika Biostatistics Computer Vision and Image Understanding Finance and Stochastics IBM Systems Journal IEEE Micro IEEE Network IEEE Pervasive Computing IEEE Transactions on Affective Computing IEEE Transactions on Evolutionary Computation IEEE Transactions on Image Processing IEEE Transactions on Medical Imaging IEEE Transactions on Mobile Computing IEEE Transactions on Neural Networks IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Transactions on Software Engineering IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics IEEE Transactions on Visualization and Computer Graphics IEEE/ACM Transactions on Networking Information Systems International Journal of Computer Vision International Journal of Robotics Research Journal of Business Journal of Business and Economic Statistics Journal of Cryptology Journal of Informetrics Journal of Machine Learning Research Journal of the ACM Journal of the American Society for Information Science and Technology Journal of the American Statistical Association Journal of the Royal Statistical Society. Series B: Statistical Methodology Machine Learning Mathematical Programming, Series B Multivariate Behavioral Research New Zealand Statistician Pattern Recognition Physical Review E - Statistical, Nonlinear, and Soft Matter Physics Probability Surveys Probability Theory and Related Fields Journal of Computational and Graphical Statistics R Journal Annals of Applied Statistics Journal of Statistical Software 36 / 92

Multiple imputation Practice Appendix MFA with missing values q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Correlation circle Dim 1 (74.03%) Dim 2 (8.29%) IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 IPP_1999 IPP_2000 IPP_2001 IPP_2002 IPP_2003 IPP_2004 IPP_2005 IPP_2006 IPP_2007 IPP_2008 IPP_2009 IPP_2010 IPP_2011 IPP_2012 IPP_2013 q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Correlation circle Dim 1 (74.03%) Dim 2 (8.29%) SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 SNIP_1999 SNIP_2000 SNIP_2001 SNIP_2002 SNIP_2003 SNIP_2004 SNIP_2005 SNIP_2006 SNIP_2007 SNIP_2008 SNIP_2009 SNIP_2010 SNIP_2011 SNIP_2012 SNIP_2013 36 / 92

Multiple imputation Practice Appendix MFA with missing values ACM Transactions on Networking trajectory.pdf q −20 −10 0 10 20 30 40 50 −20 −10 0 10 20 30 40 Individual factor map Dim 1 (74.03%) Dim 2 (8.29%) q q q q q q q q q q q q q q q IEEE/ACM Transactions on Networking q year_1999 year_2000 year_2001 year_2002 year_2003 year_2004 year_2005 year_2006 year_2007 year_2008 year_2009 year_2010 year_2011 year_2012 year_2013 36 / 92

Multiple imputation Practice Appendix After performing principal component methods despite missing entries (getting the graphical outputs and the principal component and axes), we use these methods as tools of single and multiple imputation and compare them to the state of the art methods. PC methods are powerful to impute, since they use similarities between rows, relationship between columns and require a small number of parameters (dimensionality reduction) With single imputation, the aim to complete a dataset as best as possible (prediction). With multiple imputation the aim is to perform other statistical methods after and to estimate parameters and their variability taking into account the missing values uncertainty. 37 / 92

Multiple imputation Practice Appendix Principal component method for mixed data (complete) Factorial Analysis on Mixed Data (Escoﬁer, 1979), PCAMIX (Kiers, 1991) Categorical variables Continuous variables 0 1 0 1 0 centring & scaling I1 I2 Ik division by and centring I/Ik 0 1 0 1 0 0 1 0 0 1 51 100 190 70 96 196 38 69 166 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 Indicator matrix Matrix which balances the influence of each variable A PCA is performed on the weighted matrix: SVD (X, D−1 Σ , 1 I II ), with X the matrix with the continuous variables and the indicator matrix, DΣ , the diagonal matrix with the standard deviation and the weights (Ik /I). 39 / 92

Multiple imputation Practice Appendix Properties of the method • The distance between individuals is: d2(i, l) = Kcont k=1 1 σk (xik − xlk)2 + Q q=1 Kq k=1 1 Ikq (xiq − xlq)2 • The principal component Fs maximises: Kcont k=1 r2(Fs, vk) + Qcat q=1 η2(Fs, vq) 40 / 92

Multiple imputation Practice Appendix Iterative FAMD algorithm 1 Initialization: imputation mean (continuous) and proportion (dummy) 2 Iterate until convergence (a) estimation: FAMD on the completed data ⇒ U, Λ, V (b) imputation of the missing values with the ﬁtted matrix ˆ X = US Λ1/2 S VS (c) means, standard deviations and column margins are updated age weight size alcohol sex snore tobacco NA 100 190 NA M yes no 70 96 186 1-2 gl/d M NA <=1 NA 104 194 No W no NA 62 68 165 1-2 gl/d M no <=1 age weight size alcohol sex snore tobacco 51 100 190 1-2 gl/d M yes no 70 96 186 1-2 gl/d M no <=1 48 104 194 No W no <=1 62 68 165 1-2 gl/d M no <=1 51 100 190 0.2 0.7 0.1 1 0 0 1 1 0 0 70 96 186 0 1 0 1 0 0.8 0.2 0 1 0 48 104 194 1 0 0 0 1 1 0 0.1 0.8 0.1 62 68 165 0 1 0 1 0 1 0 0 1 0 NA 100 190 NA NA NA 1 0 0 1 1 0 0 70 96 186 0 1 0 1 0 NA NA 0 1 0 NA 104 194 1 0 0 0 1 1 0 NA NA NA 62 68 165 0 1 0 1 0 1 0 0 1 0 imputeAFDM ⇒ Imputed values can be seen as degrees of membership 41 / 92

Multiple imputation Practice Appendix Iterative Random Forests imputation 1 Initial imputation: mean imputation - random category Sort the variables according to the amount of missing values 2 Fit a RF Xobs j on variables Xobs −j and then predict Xmiss j 3 Cycling through variables 4 Repeat step 2 and 3 until convergence • number of trees: 100 • number of variables randomly selected at each node √ p • number of iterations: 4-5 Implemented in the R package missForest (Daniel J. Stekhoven, Peter Buhlmann, 2011) 42 / 92

Multiple imputation Practice Appendix Simulation study Several data sets • Relationships between variables • Number of categories • percentage of missing values (10%,20%,30%) Criteria: • for continuous data: RMSE • for categorical data: proportion of falsely classiﬁed entries 43 / 92

Multiple imputation Practice Appendix Comparison on real data sets Imputations obtained with random forest & FAMD algorithm 44 / 92

Multiple imputation Practice Appendix Summary Imputations with PC methods are good: • for strong linear relationships • for categorical variables • especially for rare categories (weights of MCA) ⇒ Number of components S?? Cross-Validation (GCV) Imputations with RF are good: • for strong non-linear relationships between continuous variables • when there are interactions ⇒ No tunning parameters? Rq: categorical data improve the imputation on continuous data and continuous data improve the imputation on categorical data 45 / 92

Multiple imputation Practice Appendix Summary Imputations with PC methods are good: • for strong linear relationships • for categorical variables • especially for rare categories (weights of MCA) ⇒ Number of components S?? Cross-Validation (GCV) Imputations with RF are good: • for strong non-linear relationships between continuous variables (cutting continuous variables into categories) • when there are interactions (creating interactions) ⇒ No tunning parameters? Rq: categorical data improve the imputation on continuous data and continuous data improve the imputation on categorical data 45 / 92

Multiple imputation Practice Appendix Multiple imputation continuous data: bivariate case ⇒ Proper multiple imputation with yi = xi β + εi 1 Variability of the parameters, M plausible: (ˆ β)1, ..., (ˆ β)M ⇒ Bootstrap ⇒ Posterior distribution: Data Augmentation (Tanner & Wong, 1987) 2 Noise: for m = 1, ..., M, missing values ym i are imputed by drawing from the predictive distribution N(xi ˆ βm, (ˆ σ2)m) Improper Proper CIµy 95% 0.818 0.935 47 / 92

Multiple imputation Practice Appendix Joint modeling ⇒ Hypothesis xi. ∼ N (µ, Σ) Algorithm Expectation Maximization Bootstrap: 1 Bootstrap rows: X1, ... , XM EM algorithm: (ˆ µ1, ˆ Σ1), ... , (ˆ µM, ˆ ΣM) 2 Imputation: xm ij drawn from N ˆ µm, ˆ Σm Easy to parallelized. Implemented in Amelia (website) Amelia Earhart James Honaker Gary King Matt Blackwell 48 / 92

Multiple imputation Practice Appendix (Fully) Conditional modeling ⇒ Hypothesis: one model/variable 1 Initial imputation: mean imputation 2 For a variable j 2.1 (β−j , σ−j ) drawn from a Bootstrap or a posterior distribution 2.2 Imputation: stochastic regression xij from N X−j β−j , σ−j 3 Cycling through variables 4 Repeat M times steps 2 and 3 ⇒ Iteratively reﬁne the imputation. Implemented in mice (website) “There is no clear-cut method for determining whether the MICE algorithm has converged” Stef van Buuren 49 / 92

Multiple imputation Practice Appendix (Fully) Conditional modeling ⇒ Hypothesis: one model/variable 1 Initial imputation: mean imputation 2 For a variable j 2.1 (β−j , σ−j ) drawn from a Bootstrap or a posterior distribution 2.2 Imputation: stochastic regression xij from N X−j β−j , σ−j 3 Cycling through variables 4 Repeat M times steps 2 and 3 ⇒ Iteratively reﬁne the imputation. ⇒ With continuous variables and a regression/variable: N (µ, Σ) Implemented in mice (website) “There is no clear-cut method for determining whether the MICE algorithm has converged” Stef van Buuren 49 / 92

Multiple imputation Practice Appendix Joint / Conditional modeling ⇒ Both seen imputed values are drawn from a Joint distribution (even if joint does not exist) ⇒ Conditional modeling takes the lead? • Flexible: one model/variable. Easy to deal with interactions and variables of diﬀerent nature (binary, ordinal, categorical...) • Many statistical models are conditional models! • Tailor to your data • Appears to work quite well in practice ⇒ Drawbacks: one model/variable... tedious... 50 / 92

Multiple imputation Practice Appendix Joint / Conditional modeling ⇒ Both seen imputed values are drawn from a Joint distribution (even if joint does not exist) ⇒ Conditional modeling takes the lead? • Flexible: one model/variable. Easy to deal with interactions and variables of diﬀerent nature (binary, ordinal, categorical...) • Many statistical models are conditional models! • Tailor to your data • Appears to work quite well in practice ⇒ Drawbacks: one model/variable... tedious... ⇒ What to do with high correlation or when n < p? • JM shrinks the covariance Σ + kI (selection of k?) • CM: ridge regression or predictors selection/variable ⇒ a lot of tuning ... not so easy ... 50 / 92

Multiple imputation Practice Appendix Multiple imputation with Bootstrap/Bayesian PCA xij = ˜ xij + εij = S s=1 λsuisvjs + εij , εij ∼ N(0, σ2) 1 Variability of the parameters, M plausible: (ˆ xij)1, ..., (ˆ xij)M Bootstrap - Iterative PCA 2 Noise: for m = 1, ..., M, missing values xm ij drawn N(ˆ xm ij , ˆ σ2) Implemented in missMDA (website) François Husson 51 / 92

Multiple imputation Practice Appendix Simulations • 1000 simulations • data set drawn from Np (µ, Σ) with a two-block structure, varying n (30 or 200), p (6 or 60) and ρ (0.3 or 0.9) 0 0 0 0 0 0 0 0 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 • 10% or 30% of missing values using a MCAR mechanism • multiple imputation using M = 20 imputed data • Quantities of interest: θ1 = E [Y ] , θ2 = β1, θ3 = ρ • Criteria • bias • CI width, coverage 52 / 92

Multiple imputation Practice Appendix Results for the expectation parameters conﬁdence interval width coverage n p ρ % Amelia MICE BayesMIPCA Amelia MICE BayesMIPCA 1 30 6 0.3 0.1 0.803 0.805 0.781 0.955 0.953 0.950 2 30 6 0.3 0.3 1.010 0.898 0.971 0.949 3 30 6 0.9 0.1 0.763 0.759 0.756 0.952 0.95 0.949 4 30 6 0.9 0.3 0.818 0.783 0.965 0.953 5 30 60 0.3 0.1 0.775 0.955 6 30 60 0.3 0.3 0.864 0.952 7 30 60 0.9 0.1 0.742 0.953 8 30 60 0.9 0.3 0.759 0.954 9 200 6 0.3 0.1 0.291 0.294 0.292 0.947 0.947 0.946 10 200 6 0.3 0.3 0.328 0.334 0.325 0.954 0.959 0.952 11 200 6 0.9 0.1 0.281 0.281 0.281 0.953 0.95 0.952 12 200 6 0.9 0.3 0.288 0.289 0.288 0.948 0.951 0.951 13 200 60 0.3 0.1 0.304 0.289 0.957 0.945 14 200 60 0.3 0.3 0.384 0.313 0.981 0.958 15 200 60 0.9 0.1 0.282 0.279 0.951 0.948 16 200 60 0.9 0.3 0.296 0.283 0.958 0.952 53 / 92

Multiple imputation Practice Appendix Joint, conditional and PCA ⇒ Good estimates of the parameters and their variance from an incomplete data (coverage close to 0.95) The variability due to missing values is well taken into account Amelia & mice have difficulties with large correlations or n < p missMDA does not but requires a tuning parameter: number of dim. Amelia & missMDA are based on linear relationships mice is more flexible (one model per variable) MI based on PCA works in a large range of configuration, n < p, n > p strong or weak relationships, low or high percentage of missing values 54 / 92

Multiple imputation Practice Appendix Remarks ⇒ MI theory: good theory for regression parameters. Others? ⇒ Imputation model as complex as the analysis model (interaction) 55 / 92

Multiple imputation Practice Appendix Remarks ⇒ MI theory: good theory for regression parameters. Others? ⇒ Imputation model as complex as the analysis model (interaction) ⇒ Some practical issues: • Imputation not in agreement (X and X2): missing passive • Imputation out of range? (Predictive mean matching pmm) • Problems of logical bounds (> 0) ⇒ truncation? 55 / 92

Multiple imputation Practice Appendix MI for categorical variables • Loglinear model: R package cat (J.L. Schafer) • Fully conditional speciﬁcation: R package mice (Van Burren) • Imputation with Gaussian distribution • Latent Class Variables: mixture models: each sample belongs to a latent class in which variables are independent (D. Vidotto, M. C. Kapteijn, and Vermunt J.K, 2014) Non-parametric version: Dirichlet process mixture of products of multinomial distributions model DPMPM (Y. Si and J.P. Reiter, 2014) 56 / 92

Multiple imputation Practice Appendix Multiple imputation for categorical data using MCA A set of parameters: UI×S , Λ1/2 S×S , VJ×S 1 , . . . , UI×S , Λ1/2 S×S , VJ×S M obtained using a non-parametric Bootstrap approach: 1 Generate M bootstrap replicates 2 Estimate the parameters on each incomplete replicate 3 Add uncertainty on the prediction 57 / 92

Multiple imputation Practice Appendix Multiple imputation with MCA 1 Variability of the parameters of MCA (UI×S, Λ1/2 S×S , VJ×S ) using a non-parametric bootstrap: → deﬁne M weightings (Rm)1≤m≤M for the individuals 58 / 92

Multiple imputation Practice Appendix Multiple imputation with MCA 1 Variability of the parameters of MCA (UI×S, Λ1/2 S×S , VJ×S ) using a non-parametric bootstrap: → deﬁne M weightings (Rm)1≤m≤M for the individuals 2 Estimate MCA parameters using SVD of X, 1 K (DΣ)−1 , Rm 58 / 92

Multiple imputation Practice Appendix Multiple imputation with MCA 1 Variability of the parameters of MCA (UI×S, Λ1/2 S×S , VJ×S ) using a non-parametric bootstrap: → deﬁne M weightings (Rm)1≤m≤M for the individuals 2 Estimate MCA parameters using SVD of X, 1 K (DΣ)−1 , Rm ˆ X1 ˆ X2 ˆ XM 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.81 0.19 0.25 0.75 0 1 0 1 0 1 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.60 0.40 0.26 0.74 0 1 0 1 0 1 . . . 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.74 0.16 0.20 0.80 0 1 0 1 0 1 58 / 92

Multiple imputation Practice Appendix Multiple imputation with MCA 1 Variability of the parameters of MCA (UI×S, Λ1/2 S×S , VJ×S ) using a non-parametric bootstrap: → deﬁne M weightings (Rm)1≤m≤M for the individuals 2 Estimate MCA parameters using SVD of X, 1 K (DΣ)−1 , Rm ˆ X1 ˆ X2 ˆ XM 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.81 0.19 0.25 0.75 0 1 0 1 0 1 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.60 0.40 0.26 0.74 0 1 0 1 0 1 . . . 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.74 0.16 0.20 0.80 0 1 0 1 0 1 A . . . A A . . . A A . . . A B . . . C B . . . B A . . . A A . . . A A . . . A B . . . C B . . . B . . . A . . . A A . . . A A . . . A B . . . C B . . . B majority ⇒ lack of variability 58 / 92

Multiple imputation Practice Appendix Multiple imputation with MCA 1 Variability of the parameters of MCA (UI×S, Λ1/2 S×S , VJ×S ) using a non-parametric bootstrap: → deﬁne M weightings (Rm)1≤m≤M for the individuals 2 Estimate MCA parameters using SVD of X, 1 K (DΣ)−1 , Rm ˆ X1 ˆ X2 ˆ XM 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.81 0.19 0.25 0.75 0 1 0 1 0 1 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.60 0.40 0.26 0.74 0 1 0 1 0 1 . . . 1 0 . . . 1 0 1 0 . . . 1 0 1 0 . . . 0.74 0.16 0.20 0.80 0 1 0 1 0 1 3 Draw categories from the values of ˆ Xm 1≤m≤M A . . . A A . . . A A . . . B B . . . C B . . . B A . . . A A . . . A A . . . A B . . . C B . . . B . . . A . . . A A . . . A A . . . B B . . . C B . . . B 58 / 92

Multiple imputation Practice Appendix Simulations • Quantities of interest: θ = parameters of a logistic model • 200 simulations from real data sets • the real data set is considered as a population • drawn one sample from the data set • generate 20% of missing values • multiple imputation using M = 5 imputed data • Criteria • bias • CI width, coverage 59 / 92

Multiple imputation Practice Appendix Results - Inference q MIMCA 5 Loglinear Latent class FCS−log FCS−rf 0.80 0.85 0.90 0.95 1.00 Titanic coverage q q q q MIMCA 2 Loglinear Latent class FCS−log FCS−rf 0.80 0.85 0.90 0.95 1.00 Galetas coverage q MIMCA 5 Latent class FCS−log FCS−rf 0.80 0.85 0.90 0.95 1.00 Income coverage Titanic Galetas Income Number of variables 4 4 14 Number of categories ≤ 4 ≤ 11 ≤ 9 60 / 92

Multiple imputation Practice Appendix Results - Time Titanic Galetas Income MIMCA 2.750 8.972 58.729 Loglinear 0.740 4.597 NA Latent class model 10.854 17.414 143.652 FCS logistic 4.781 38.016 881.188 FCS forests 265.771 112.987 6329.514 Table : Time in second Titanic Galetas Income Number of individuals 2201 1192 6876 Number of variables 4 4 14 61 / 92

Multiple imputation Practice Appendix Conclusion Multiple imputation methods for continuous and categorical data using dimensionality reduction method Properties: • requires a small number of parameters • captures the relationships between variables • captures the similarities between individuals From a practical point of view: • can be applied on data sets of various dimensions • provides correct inferences for analysis model based on relationships between pairs of variables • requires to choose the number of dimensions S Perspective: • mixed data 62 / 92

Multiple imputation Practice Appendix Mixed variables ⇒ Joint modeling: • General location model (Schafer, 1997) =⇒ pb when many categories • Transform the categorical variables into dummy variables and deal as continuous variables (Amelia) • Latent class models (Vermunt) – nonparametric Bayesian models (work in progress, Dunson, Reiter, Duke University) ⇒ Conditional modeling: linear, logistic, multinomial logit models (mice), Random forests 63 / 92

Multiple imputation Practice Appendix To conclude Take home message: • “The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is suﬃciently minor that it can be legitimately handled in this way and situations where standard estimators applied to the real and imputed data have substantial biases.” (Dempster and Rubin, 1983) • Advanced methods are available to estimate parameters and their variance (taking into account the variability due to missing values) • Multiple imputation is an appealing method .... but ... how can we do with big data? • Still an active area of research 64 / 92

Multiple imputation Practice Appendix Ressources ⇒ Softwares: • van Buuren webpage: http://www.stefvanbuuren.nl/mi/Software.html • R task View: Oﬃcial Statistics & Survey Methodology ⇒ Recent Books: • van Buuren (2012). Flexible Imputation of Missing Data. Chapman & Hall/CRC • Carpenter & Kenward (2013). Multiple Imputation and its Application. Wiley • G. Molenberghs, G. Fitzmaurice, M.G. Kenward, A. Tsiatis & G. Verbeke (nov 2014). Handbook of Missing Data. Chapman & Hall/CRC ⇒ Little & Rubin (2002). Statistical Analysis with missing data - Schafer (1997) Analysis of incomplete multivariate data ⇒ J.L. Schafer & J.W. Graham, 2002. Missing Data: Our View of the State of the Art. Psychological Methods, 7 147-177 ⇒ B. Efron. 1989. Missing data, Imputation and the Bootstrap. Journal of the American Statistical Association, 426 463-475 65 / 92

Multiple imputation Practice Appendix Contributors on the topic of multiple imputation • J. Honaker - G. King - M. Blackwell (Harvard): Amelia • S. van Buuren (Utrecht): mice • F. Husson - J. Josse (Rennes): missMDA • A. Gelman - J. Hill - Y. Su (Colombia): mi • J. Reiter (Duke): NPBayesImpute Non-Parametric Bayesian Multiple Imputation for Categorical Data • J. Bartlett - J. Carpenter - M. Kenward (UCL): smcfcs Substantive model compatible FCS multiple imputation • H. Goldstein (Bristol) : realcom for multi-level data • J.K. Vermunt (Tilburg): poLCA latent class models • Shaun Seaman (Medical Research Council Biostatistics Unit, UK), Roderick Little (Michigan)... • Donald B Rubin (Harvard) 66 / 92

Multiple imputation Practice Appendix Conference on missing data and matrix completion http://missdata2015.agrocampus-ouest.fr/ 67 / 92

Multiple imputation Practice Appendix A real dataset O3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 O3v 0601 NA 15.6 18.5 18.4 4 4 8 NA -1.7101 -0.6946 84 0602 82 17 18.4 17.7 5 5 7 NA NA NA 87 0603 92 NA 17.6 19.5 2 5 4 2.9544 1.8794 0.5209 82 0604 114 16.2 NA NA 1 1 0 NA NA NA 92 0605 94 17.4 20.5 NA 8 8 7 -0.5 NA -4.3301 114 0606 80 17.7 NA 18.3 NA NA NA -5.6382 -5 -6 94 0607 NA 16.8 15.6 14.9 7 8 8 -4.3301 -1.8794 -3.7588 80 0610 79 14.9 17.5 18.9 5 5 4 0 -1.0419 -1.3892 NA 0611 101 NA 19.6 21.4 2 4 4 -0.766 NA -2.2981 79 0612 NA 18.3 21.9 22.9 5 6 8 1.2856 -2.2981 -3.9392 101 0613 101 17.3 19.3 20.2 NA NA NA -1.5 -1.5 -0.8682 NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0919 NA 14.8 16.3 15.9 7 7 7 -4.3301 -6.0622 -5.1962 42 0920 71 15.5 18 17.4 7 7 6 -3.9392 -3.0642 0 NA 0921 96 NA NA NA 3 3 3 NA NA NA 71 0922 98 NA NA NA 2 2 2 4 5 4.3301 96 0923 92 14.7 17.6 18.2 1 4 6 5.1962 5.1423 3.5 98 0924 NA 13.3 17.7 17.7 NA NA NA -0.9397 -0.766 -0.5 92 0925 84 13.3 17.7 17.8 3 5 6 0 -1 -1.2856 NA 0927 NA 16.2 20.8 22.1 6 5 5 -0.6946 -2 -1.3681 71 0928 99 16.9 23 22.6 NA 4 7 1.5 0.8682 0.8682 NA 0929 NA 16.9 19.8 22.1 6 5 3 -4 -3.7588 -4 99 0930 70 15.7 18.6 20.7 NA NA NA 0 -1.0419 -4 NA 69 / 92

Multiple imputation Practice Appendix Count missing values > library(VIM) > aggr(don,only.miss=TRUE,sortVar=TRUE) > res<-summary(aggr(don,prop=TRUE,combined=TRUE))$combinations > res[rev(order(res[,2])),] Variables sorted by number of missings: Combinations Count Percent Variable Count 0:0:0:0:0:0:0:0:0:0:0 13 11.6071429 Ne12 0.37500000 0:1:1:1:0:0:0:0:0:0:0 7 6.2500000 T9 0.33035714 0:0:0:0:0:1:0:0:0:0:0 5 4.4642857 T15 0.33035714 0:1:0:0:0:0:0:0:0:0:0 4 3.5714286 Ne9 0.30357143 0:1:0:0:1:1:1:0:0:0:0 3 2.6785714 T12 0.29464286 0:0:1:0:0:0:0:0:0:0:0 3 2.6785714 Ne15 0.28571429 0:0:0:1:0:0:0:0:0:0:0 3 2.6785714 Vx15 0.18750000 0:0:0:0:1:1:1:0:0:0:0 3 2.6785714 Vx9 0.16071429 0:0:0:0:0:1:0:0:0:0:1 3 2.6785714 maxO3 0.14285714 0:1:1:1:1:0:0:0:0:0:0 2 1.7857143 maxO3v 0.10714286 0:0:0:0:1:0:0:0:0:1:0 2 1.7857143 Vx12 0.08928571 0:0:0:0:0:0:1:1:0:0:0 2 1.7857143 0:0:0:0:0:0:1:0:0:0:0 2 1.7857143 ..................... . ... 70 / 92

Multiple imputation Practice Appendix Pattern visualization Proportion of missings 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Ne12 T9 T15 Ne9 T12 Ne15 Vx15 Vx9 maxO3 maxO3v Vx12 Combinations Ne12 T9 T15 Ne9 T12 Ne15 Vx15 Vx9 maxO3 maxO3v Vx12 > aggr(don,only.miss=TRUE,sortVar=TRUE) 71 / 92

Multiple imputation Practice Appendix Visualization maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 0 20 40 60 80 100 Index q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 16 37 4 12 14 16 18 20 22 24 40 60 80 100 120 140 160 T9 maxO3 > matrixplot(don,sortby=2) > marginplot(don[,c("T9","maxO3")]) 72 / 92

Multiple imputation Practice Appendix Visualization with Multiple Correspondence Analysis ⇒ Create the missingness matrix > mis.ind <- matrix("o",nrow=nrow(don),ncol=ncol(don)) > mis.ind[is.na(don)]="m" > dimnames(mis.ind)=dimnames(don) > mis.ind maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 20010601 "o" "o" "o" "m" "o" "o" "o" "o" "o" "o" "o" 20010602 "o" "m" "m" "m" "o" "o" "o" "o" "o" "o" "o" 20010603 "o" "o" "o" "o" "o" "m" "m" "o" "m" "o" "o" 20010604 "o" "o" "o" "m" "o" "o" "o" "m" "o" "o" "o" 20010605 "o" "m" "o" "o" "m" "m" "m" "o" "o" "o" "o" 20010606 "o" "o" "o" "o" "o" "m" "o" "o" "o" "o" "o" 20010607 "o" "o" "o" "o" "o" "o" "m" "o" "o" "o" "o" 20010610 "o" "o" "o" "o" "o" "o" "m" "o" "o" "o" "o" 73 / 92

Multiple imputation Practice Appendix Visualization with Multiple Correspondence Analysis q −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 MCA graph of the categories Dim 1 (19.07%) Dim 2 (17.71%) maxO3_m maxO3_o T9_m T9_o T12_m T12_o T15_m T15_o Ne9_m Ne9_o Ne12_m Ne12_o Ne15_m Ne15_o Vx9_m Vx9_o Vx12_m Vx12_o Vx15_m Vx15_o maxO3v_m maxO3v_o > library(FactoMineR) > resMCA <- MCA(mis.ind) > plot(resMCA,invis="ind",title="MCA graph of the categories") 74 / 92

Multiple imputation Practice Appendix Imputation with PCA ⇒ Step 1: Estimation of the number of dimensions > library(missMDA) > nb <- estim_ncpPCA(don,method.cv="Kfold") > nb$ncp #2 > plot(0:5,nb$criterion,xlab="nb dim", ylab="MSEP") q q q q q q 0 1 2 3 4 5 4000 5000 6000 7000 nb dim MSEP 75 / 92

Multiple imputation Practice Appendix Imputation with PCA ⇒ Step 2: Imputation of the missing values > res.comp <- imputePCA(don,ncp=2) > res.comp$completeObs[1:3,] maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 0601 87 15.60 18.50 20.47 4 4.00 8.00 0.69 -1.71 -0.69 84 0602 82 18.51 20.88 21.81 5 5.00 7.00 -4.33 -4.00 -3.00 87 0603 92 15.30 17.60 19.50 2 3.98 3.81 2.95 1.97 0.52 82 76 / 92

Multiple imputation Practice Appendix PCA representation ⇒ Step 3: PCA on the completed data set q −4 −2 0 2 4 6 −6 −4 −2 0 2 4 Individuals factor map (PCA) Dim 1 (57.47%) Dim 2 (21.34%) East North West South q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q East North West South q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Variables factor map (PCA) Dim 1 (55.85%) Dim 2 (21.73%) T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v maxO3 > imp <- cbind.data.frame(res.comp$completeObs,WindDirection) > res.pca <- PCA(imp,quanti.sup=1,quali.sup=12) > plot(res.pca, hab=12, lab="quali"); plot(res.pca, choix="var") > res.pca$ind$coord #scores (principal components) 77 / 92

Multiple imputation Practice Appendix Multiple imputation in practice ⇒ Step 1: Generate M imputed data sets > library(Amelia) > res.amelia <- amelia(don,m=100) ## in combination with zelig > library(mice) > res.mice <- mice(don,m=100,defaultMethod="norm.boot") > library(missMDA) > res.MIPCA <- MIPCA(don,ncp=2,nboot=100) > res.MIPCA$res.MI 78 / 92

Multiple imputation Practice Appendix Multiple imputation in practice ⇒ Step 2: visualization 10 15 20 25 30 35 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Observed and Imputed values of T12 T12 −− Fraction Missing: 0.295 Relative Density Mean Imputations Observed Values 40 60 80 100 120 140 160 50 100 150 200 Observed versus Imputed Values of maxO3 Observed Values Imputed Values 0−.2 .2−.4 .4−.6 .6−.8 .8−1 > library(Amelia) > compare.density(res.amelia, var="T12") > overimpute(res.amelia, var="maxO3") function stripplot in mice 79 / 92

Multiple imputation Practice Appendix Multiple imputation in practice ⇒ Step 2: visualization > res.MIPCA <- MIPCA(don,ncp=2) > plot(res.MIPCA,choice= "ind.supp"); plot(res.MIPCA,choice= "var") q −5 0 5 −8 −6 −4 −2 0 2 4 6 Supplementary projection Dim 1 (57.20%) Dim 2 (20.27%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 7677 78 79 80 81 82 83 84 85 86 87 88 89 9091 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Variable representation Dim 1 (57.20%) Dim 2 (20.27%) maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 80 / 92

Multiple imputation Practice Appendix Multiple imputation in practice ⇒ Step 3. Regression on each table and pool the results ˆ β = 1 M M m=1 ˆ βm T = 1 M m Var ˆ βm + 1 + 1 M 1 M−1 m ˆ βm − ˆ β 2 > library(mice) > imp.mice <- mice(don,m=100,defaultMethod="norm") > lm.mice.out <- with(res.mice, lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+ Ne15+Vx9+Vx12+Vx15+maxO3v)) > pool.mice <- pool(lm.mice.out) > summary(pool.mice) est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda (Intercept) 19.31 16.30 1.18 50.48 0.24 -13.43 52.05 NA 0.46 0.44 T9 -0.88 2.25 -0.39 26.43 0.70 -5.50 3.75 37 0.71 0.69 T12 3.29 2.38 1.38 27.54 0.18 -1.59 8.18 33 0.70 0.68 .... Vx15 0.23 1.33 0.17 39.00 0.87 -2.47 2.93 21 0.57 0.55 maxO3v 0.36 0.10 3.65 46.03 0.00 0.16 0.56 12 0.50 0.48 81 / 92

Multiple imputation Practice Appendix Mixed imputation in practice > library(missMDA) > imputeFAMD(mydata,ncp=2) > library(missForest) > missForest(mydata) > library(mice) > mice(mydata) > mice(mydata, defaultMethod = "rf") ## mice with random forests 82 / 92

Multiple imputation Practice Appendix An ecological data set Glopnet data: 2494 species described by 6 quantitative variables • LMA (leaf mass per area) • LL (leaf lifespan) • Amass (photosynthetic assimilation) • Nmass (leaf nitrogen), • Pmass (leaf phosphorus) • Rmass (dark respiration rate) and 1 categorical variable: the biome Wright IJ, et al. (2004). The worldwide leaf economics spectrum. Nature, 428:821. www.nature.com/nature/journal/v428/n6985/extref/nature02403-s2.xls 83 / 92

Multiple imputation Practice Appendix An ecological data set > sum(is.na(don))/(nrow(don)*ncol(don)) # 53% of missing values [1] 0.5338145 > dim(na.omit(don)) ## Delete species with missing values [1] 72 6 ## only 72 remaining species! > library(VIM) > aggr(don,numbers=TRUE,sortVar=TRUE) Proportion of missings 0.0 0.2 0.4 0.6 0.8 Rmass LL Pmass Amass Nmass LMA Combinations Rmass LL Pmass Amass Nmass LMA 0.2326 0.1985 0.1359 0.0714 0.0589 0.0573 0.0525 0.0397 0.0289 0.0180 0.0180 0.0152 0.0124 0.0124 0.0120 0.0080 0.0056 0.0052 0.0036 0.0028 0.0024 0.0024 0.0024 0.0020 0.0004 0.0004 0.0004 0.0004 84 / 92

Multiple imputation Practice Appendix An ecological data set q −1 0 1 2 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 MCA graph of the categories Dim 1 (33.67%) Dim 2 (21.07%) LL_m LL_o LMA_m LMA_o Nmass_m Nmass_o Pmass_m Pmass_o Amass_m Amass_o Rmass_m Rmass_o > mis.ind <- matrix("o",nrow=nrow(don),ncol=ncol(don)) > mis.ind[is.na(don)] <- "m" > dimnames(mis.ind) <- dimnames(don) > library(FactoMineR) > resMCA <- MCA(mis.ind) > plot(resMCA,invis="ind",title="MCA graph of the categories") 85 / 92

Multiple imputation Practice Appendix An ecological data set What about mean imputation? q −5 0 5 −6 −4 −2 0 2 4 6 8 Individuals factor map (PCA) Dim 1 (44.79%) Dim 2 (23.50%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq qq q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland 86 / 92

Multiple imputation Practice Appendix An ecological data set q −10 −5 0 5 −6 −4 −2 0 2 4 6 Individuals factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland q −1 0 1 2 −2 −1 0 1 Individuals factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland q −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 Variables factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) LL LMA Nmass Pmass Amass Rmass > library(missMDA) > nb <- estim_ncpPCA(don,method.cv="Kfold",nbsim=100) > res.comp <- imputePCA(don,ncp=2) > imp <- cbind.data.frame(res.comp$completeObs,tab.init[,1:4]) > res.pca <- PCA(imp,quanti.sup=1,quali.sup=12) > plot(res.pca, hab=12, lab="quali"); plot(res.pca, choix="var") > res.pca$ind$coord #scores (principal components) > res.MIPCA <- MIPCA(don,ncp=2) > plot(res.MIPCA,choice= "ind.supp"); plot(res.MIPCA,choice= "var ") 87 / 92

Multiple imputation Practice Appendix Expectation - Maximization (Dempster et al., 1977) Need the modiﬁcation of the estimation process (not always easy!) Rationale to get ML estimates on the observed values max Lobs through max of Lcomp of X = (Xobs, Xmiss). Augment the data to simplify the problem E step (conditional expectation): Q(θ, θ ) = ln(f (X|θ))f (Xmiss|Xobs, θ )dXmiss M step (maximization): θ +1 = argmaxθ Q(θ, θ ) Result: when θ +1 max Q(θ, θ ) then L(Xobs, θ +1) ≥ L(Xobs, θ ) 89 / 92

Multiple imputation Practice Appendix Maximum likelihood approach Hypothesis xi. ∼ N (µ, Σ) ⇒ Point estimates with EM: > library(norm) > pre <- prelim.norm(as.matrix(don)) > thetahat <- em.norm(pre) > getparam.norm(pre,thetahat) 90 / 92

Multiple imputation Practice Appendix Maximum likelihood approach Hypothesis xi. ∼ N (µ, Σ) ⇒ Point estimates with EM: > library(norm) > pre <- prelim.norm(as.matrix(don)) > thetahat <- em.norm(pre) > getparam.norm(pre,thetahat) ⇒ Variances: • Supplemented EM (Meng, 1991) • Bootstrap approach: • Bootstrap rows: X1, ... , XB • EM algorithm: (ˆ µ1, ˆ Σ1 ), ... , (ˆ µB, ˆ ΣB ) 90 / 92

Multiple imputation Practice Appendix Maximum likelihood approach Hypothesis xi. ∼ N (µ, Σ) ⇒ Point estimates with EM: > library(norm) > pre <- prelim.norm(as.matrix(don)) > thetahat <- em.norm(pre) > getparam.norm(pre,thetahat) ⇒ Variances: • Supplemented EM (Meng, 1991) • Bootstrap approach: • Bootstrap rows: X1, ... , XB • EM algorithm: (ˆ µ1, ˆ Σ1 ), ... , (ˆ µB, ˆ ΣB ) Issue: develop a speciﬁc method for each statistical method 90 / 92

Multiple imputation Practice Appendix MI using the loglinear model • Hypothesis X = (xijk)i,j,k: X|θ ∼ M (n, θ) where: log(θijk) = λ0 + λA i + λB j + λC k + λAB ij + λAC ik + λBC jk + λABC ijk 1 Variability of the parameters • prior on θ : θ|θ ∈ Θ ∼ D(α) • posterior: θ|x, θ ∈ Θ ∼ D(α ) • Data Augmentation (M.A. Tanner, W.H. Wong, 1987) 2 Imputation according to the loglinear model using the set of M parameters • Implemented: R package cat (J.L. Schafer) 91 / 92

Multiple imputation Practice Appendix MI using a DPMPM model (Si and Reiter, 2013) • Hypothesis: P (X = (x1, . . . , xK ); θ) = L =1 θ K k=1 θ( ) xk 1 Variability of the parameters: • a hierarchic prior on θ: α ∼ G(.25, .25) ζ ∼ B(1, α) θ = ζ g< (1 − ζg ) for in 1, . . . , ∞ • posterior on θ: untractable → Gibbs sampler and Data Augmentation 2 Imputation according to the mixture model using the set of M parameters • Implemented: R package mi (Gelman et al.) 92 / 92

A missing values tour with principal component ...

A missing values tour with principal component methods

More Decks by julie josse

Other Decks in Research

Featured

Transcript