How to perform principal components methods despite missing values and how it can help to handle missing values...
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Missing values and principal components methods
Julie Josse
Stanford Stat 300, July 2015
1 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
2 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Missing values
Gertrude Mary Cox
“The best thing to do with missing
values is not to have any”
Missing values are ubiquitous:
• no answer in a questionnaire
• data that are lost or destroyed
• machines that fail
• plants damaged
• ...
Still an issue in the "big data" area
3 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Some references
Schafer (1997) Little & Rubin (1987, 2002)
Joseph L. Schafer Roderick Little Donald Rubin
Suggested reading: chap 25 of Gelman & Hill (2006)
Andrew Gelman Jennifer L. Hill
4 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Missing values problematic
A very simple way: deletion (default lm function in R)
Dealing with missing values depends on:
• the pattern of missing values
• the mechanism leading to missing values
5 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Missing values problematic
A very simple way: deletion (default lm function in R)
Dealing with missing values depends on:
• the pattern of missing values
• the mechanism leading to missing values
• MCAR: probability does not depend on any values
• MAR: probability may depend on values on other variables
• MNAR: probability depends on the value itself
(Ex: Income - Age)
5 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Missing values problematic
A very simple way: deletion (default lm function in R)
Dealing with missing values depends on:
• the pattern of missing values
• the mechanism leading to missing values
• MCAR: probability does not depend on any values
• MAR: probability may depend on values on other variables
• MNAR: probability depends on the value itself
(Ex: Income - Age)
⇒ Inspect/ visualization of missing data
5 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Single imputation methods
6 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Single imputation methods
q
q
q
q
q q q q
q
q
q
q q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q q q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q q
q q
q
q
q
q
q
q
q
q
qq
q
q
q q q
q q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q q q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Mean imputation
X
Y
q
q
q
q q q
q q
q
q q
q
q q
q
q
q
q q
q q
q q
q
q q
q
q
q q q
q
q q
q q q
q
q q
q q q q
q q
q
q
q q q
q q
q
q q
q q
q q
q q
q q
q
q q
q
q q q q
q q
q
q
q qq
q q q
q q q
q q q
q q
q
q q q
q q q
q
q
q q
q q
q
q q
q
q q
q
q q
q q
q q q q
q
q
µy = 0
σy = 1
ρ = 0.6
CIµy 95%
0.01
0.5
0.30
39.4
6 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Single imputation methods
q
q
q
q
q q q q
q
q
q
q q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q q q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q q
q q
q
q
q
q
q
q
q
q
qq
q
q
q q q
q q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q q q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Mean imputation
X
Y
q
q
q
q q q
q q
q
q q
q
q q
q
q
q
q q
q q
q q
q
q q
q
q
q q q
q
q q
q q q
q
q q
q q q q
q q
q
q
q q q
q q
q
q q
q q
q q
q q
q q
q
q q
q
q q q q
q q
q
q
q qq
q q q
q q q
q q q
q q
q
q q q
q q q
q
q
q q
q q
q
q q
q
q q
q
q q
q q
q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Regression imputation
X
Y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
µy = 0
σy = 1
ρ = 0.6
CIµy 95%
0.01
0.5
0.30
39.4
0.01
0.72
0.78
61.6
6 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Single imputation methods
q
q
q
q
q q q q
q
q
q
q q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q q q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q q
q q
q
q
q
q
q
q
q
q
qq
q
q
q q q
q q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q q q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Mean imputation
X
Y
q
q
q
q q q
q q
q
q q
q
q q
q
q
q
q q
q q
q q
q
q q
q
q
q q q
q
q q
q q q
q
q q
q q q q
q q
q
q
q q q
q q
q
q q
q q
q q
q q
q q
q
q q
q
q q q q
q q
q
q
q qq
q q q
q q q
q q q
q q
q
q q q
q q q
q
q
q q
q q
q
q q
q
q q
q
q q
q q
q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Regression imputation
X
Y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−3 −2 −1 0 1 2
Stochastic regression imputation
X
Y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
µy = 0
σy = 1
ρ = 0.6
CIµy 95%
0.01
0.5
0.30
39.4
0.01
0.72
0.78
61.6
0.01
0.99
0.59
70.8
6 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Single imputation methods
q
q
q
q
q q q q
q
q
q
q q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q q q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q q
q q
q
q
q
q
q
q
q
q
qq
q
q
q q q
q q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q q q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Mean imputation
X
Y
q
q
q
q q q
q q
q
q q
q
q q
q
q
q
q q
q q
q q
q
q q
q
q
q q q
q
q q
q q q
q
q q
q q q q
q q
q
q
q q q
q q
q
q q
q q
q q
q q
q q
q
q q
q
q q q q
q q
q
q
q qq
q q q
q q q
q q q
q q
q
q q q
q q q
q
q
q q
q q
q
q q
q
q q
q
q q
q q
q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−2 −1 0 1 2
Regression imputation
X
Y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2
−3 −2 −1 0 1 2
Stochastic regression imputation
X
Y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
µy = 0
σy = 1
ρ = 0.6
CIµy 95%
0.01
0.5
0.30
39.4
0.01
0.72
0.78
61.6
0.01
0.99
0.59
70.8
⇒ Standard errors of the parameters (ˆ
σˆ
µy
) calculated from the
imputed data set are underestimated
6 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Recommended methods
⇒ Multiple imputation (Rubin, 1987)
• Generate M plausible values for each missing value
( ˆ
F ˆ
u′)ij
( ˆ
F ˆ
u′)1
ij
+ ε1
ij
( ˆ
F ˆ
u′)2
ij
+ ε2
ij
( ˆ
F ˆ
u′)3
ij
+ ε3
ij
( ˆ
F ˆ
u′)B
ij
+ εB
ij
• Perform the analysis on each imputed data set: ˆ
θm, Var ˆ
θm
• Combine the results: ˆ
θ = 1
M
M
m=1
ˆ
θm
T = 1
M
M
m=1
Var ˆ
θm + 1 + 1
M
1
M−1
M
m=1
ˆ
θm − ˆ
θ
2
7 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Recommended methods
⇒ Multiple imputation (Rubin, 1987)
⇒ Maximum likelihood: EM algorithm (Dempster et al., 1977) to
obtain point estimates + other algorithms for their variability
One specific algorithms for each statistical method
⇒ Common aim: provide estimation of the parameters and of their
variability (taken into account the variability due to missing values)
8 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
9 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
PCA reconstruction
-2.00 -2.74
-1.56 -0.77
-1.11 -1.59
-0.67 -1.13
-0.22 -1.22
0.22 -0.52
0.67 1.46
1.11 0.63
1.56 1.10
2.00 1.00
X
X
X
X
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
x1
x2
10 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
PCA reconstruction
-2.00 -2.74
-1.56 -0.77
-1.11 -1.59
-0.67 -1.13
-0.22 -1.22
0.22 -0.52
0.67 1.46
1.11 0.63
1.56 1.10
2.00 1.00
-2.16 -2.58
-0.96 -1.35
-1.15 -1.55
-0.70 -1.09
-0.53 -0.92
0.04 -0.34
1.24 0.89
1.05 0.69
1.50 1.15
1.67 1.33
X
X
X
X
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
x1
x2
X
X
X
X
10 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
PCA reconstruction
ˆ
X = FV t
10 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Minimizes the reconstruction error
⇒ Minimize the distance between observations and their projection
⇒ Approximation of X with a low rank matrix S < p
Xn×p − ˆ
Xn×p
2
SVD: ˆ
XPCA = Un×SΛ
1
2
S×S
Vp×S
= Fn×SVp×S
F = UΛ1
2 PC - scores
V principal axes - loadings
11 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Missing values in PCA
⇒ PCA: least squares
Xn×p − Un×SΛ
1
2
S×S
Vp×S
2
⇒ PCA with missing values: weighted least squares
Wn×p ∗ (Xn×p − Un×SΛ
1
2
S×S
Vp×S
) 2
with wij = 0 if xij is missing, wij = 1 otherwise
Many algorithms: weighted alternating least squares (Gabriel &
Zamir, 1979); iterative PCA (Kiers, 1997)
12 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Weighted least squares
⇒ Rank 1: n
i=1
p
j=1
(xij − Fi1Vj1)2
2 simple regressions: Vj1 = i
(xij ×Fi1)
i
F2
i1
Fi1 = j
(xij ×Vj1)
j
u2
j1
Power method. Deflation: (F2, V2) in ˆ
ε1 = X − F1V1
NIPALS (Non linear Iterative PArtial Least Squares, Wold,
Christofferson, 1966, 1969). Vj1 = i
(wij xij Fi1)
i
wij F2
i1
; Fi1 = j
(wij xij uj1)
j
wij V 2
j1
⇒ Subspace S > 1:
2 multiple regressions: V = X F(F F)−1; F = XV (V V )−1
2 multiple weighted regressions
13 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00
2.0 1.98
Initialization = 0: X0 (mean imputation)
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00
2.0 1.98
x1 x2
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
2.27 1.67
PCA on the completed data set → (U , Λ , V );
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00
2.0 1.98
x1 x2
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
2.27 1.67
Missing values imputed with the model matrix ˆ
X = U Λ1/2 V
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00
2.0 1.98
x1 x2
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
2.27 1.67
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57
2.0 1.98
The new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ
X
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57
2.0 1.98
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57
2.0 1.98
x1 x2
-2.00 -2.01
-1.47 -1.52
0.09 -0.11
1.20 0.90
2.18 1.78
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.90
2.0 1.98
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00
2.0 1.98
x1 x2
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
2.27 1.67
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57
2.0 1.98
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
Steps are repeated until convergence
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 NA
2.0 1.98
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 1.46
2.0 1.98
-2 -1 0 1 2 3
-2 -1 0 1 2 3
x1
x2
PCA on the completed data set → (U , Λ , V )
Missing values imputed with the model matrix ˆ
X = U Λ1/2 V
14 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data set → (U , Λ , V );
S dimensions are kept
(b) missing values imputed with ˆ
X = U Λ1/2 V ;
the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ
X
3 steps of estimation and imputation are repeated
15 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data set → (U , Λ , V );
S dimensions are kept
(b) missing values imputed with ˆ
X = U Λ1/2 V ;
the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ
X
(c) means (and standard deviations) are updated
3 steps of estimation and imputation are repeated
15 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data set → (U , Λ , V );
S dimensions are kept
(b) missing values imputed with ˆ
X = U Λ1/2 V ;
the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ
X
(c) means (and standard deviations) are updated
3 steps of estimation and imputation are repeated
15 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data set → (U , Λ , V );
S dimensions are kept
(b) missing values imputed with ˆ
X = U Λ1/2 V ;
the new imputed dataset is X = W ∗ X + (1 − W) ∗ ˆ
X
(c) means (and standard deviations) are updated
3 steps of estimation and imputation are repeated
⇒ EM algorithm of the fixed effect model (Caussinus, 1986)
xij = S
s=1
√
λsUisVjs + εij εij ∼ N(0, σ2)
⇒ Imputation (matrix completion framework, Netflix)
⇒ Reduction of the variability (imputation by UΛ1/2V )
15 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Overfitting
X41×6 = F41×2V2×6
+ N(0, 0.5)
-4 -2 0 2 4
-3 -2 -1 0 1 2 3 4
ACP sur données complètes
Dim 1 (55.09%)
Dim 2 (27.91%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
16 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Overfitting
X41×6 = F41×2V2×6
+ N(0, 0.5) ⇒ 50% of NA
-4 -2 0 2 4
-3 -2 -1 0 1 2 3 4
ACP sur données complètes
Dim 1 (55.09%)
Dim 2 (27.91%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
-4 -2 0 2 4
-4 -2 0 2
ACP itérative
Dim 1 (63.97%)
Dim 2 (31.9%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
16 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Overfitting
X41×6 = F41×2V2×6
+ N(0, 0.5) ⇒ 50% of NA
-4 -2 0 2 4
-3 -2 -1 0 1 2 3 4
ACP sur données complètes
Dim 1 (55.09%)
Dim 2 (27.91%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
-4 -2 0 2 4
-4 -2 0 2
ACP itérative
Dim 1 (63.97%)
Dim 2 (31.9%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
⇒ fitting error is low: ||W ∗ (X − ˆ
X)||2 = 0.48
⇒ prediction error is high: ||(1 − W) ∗ (X − ˆ
X)||2 = 5.58
16 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Overfitting
Overfitting when:
• many parameters / the number of observed values (the
number of dimensions S and of missing values are important)
• data are very noisy
⇒ Trust too much the relationship between variables
Remarks:
• missing values: special case of small data set
• iterative PCA: prediction method
Solution:
⇒ Shrinkage methods
17 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Regularized iterative PCA (Josse et al., 2009)
⇒ Initialization - estimation step - imputation step
The imputation step:
ˆ
xPCA
ij
=
S
s=1
λsUisVjs
is replaced by a "shrunk" imputation step (Efron & Morris 1972):
ˆ
xrPCA
ij
=
S
s=1
λs − ˆ
σ2
λs
λsUisVjs =
S
s=1
λs −
ˆ
σ2
√
λs
UisVjs
18 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Regularized iterative PCA (Josse et al., 2009)
⇒ Initialization - estimation step - imputation step
The imputation step:
ˆ
xPCA
ij
=
S
s=1
λsUisVjs
is replaced by a "shrunk" imputation step (Efron & Morris 1972):
ˆ
xrPCA
ij
=
S
s=1
λs − ˆ
σ2
λs
λsUisVjs =
S
s=1
λs −
ˆ
σ2
√
λs
UisVjs
ˆ
σ2 =
RSS
ddl
=
n q
s=S+1
λs
np − p − nS − pS + S2 + S
(Xn×p; Un×S; Vp×S)
18 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Regularized iterative PCA (Josse et al., 2009)
⇒ Initialization - estimation step - imputation step
The imputation step:
ˆ
xPCA
ij
=
S
s=1
λsUisVjs
is replaced by a "shrunk" imputation step (Efron & Morris 1972):
ˆ
xrPCA
ij
=
S
s=1
λs − ˆ
σ2
λs
λsUisVjs =
S
s=1
λs −
ˆ
σ2
√
λs
UisVjs
ˆ
σ2 =
RSS
ddl
=
n q
s=S+1
λs
np − p − nS − pS + S2 + S
(Xn×p; Un×S; Vp×S)
Between hard/soft thresholding (Mazumder, Hastie & Tibshirani, 2010)
σ2 small → regularized PCA ≈ PCA
σ2 large → mean imputation
18 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Regularized iterative PCA
X41×6 = F41×2V2×6
+ N(0, 0.5) ⇒ 50% of NA
-4 -2 0 2 4
-3 -2 -1 0 1 2 3 4
ACP sur données complètes
Dim 1 (55.09%)
Dim 2 (27.91%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
-4 -2 0 2 4
-3 -2 -1 0 1 2 3
ACP régularisée
Dim 1 (64.27%)
Dim 2 (30.72%)
SEBRLE
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
⇒ fitting error: ||W ∗ (X − ˆ
X)||2 = 0.52 (EM= 0.48)
⇒ prediction error: ||(1 − W) ∗ (X − ˆ
X)||2 = 0.67 (EM= 5.58)
19 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Properties
⇒ Quality of estimation of the parameters:
Simulation X = FV + ε
RV coefficient between complete/ incomplete
• performances decrease with missing values and level of noise
• difficult settings: regularized PCA equals mean imputation
• the choice of the number of dimensions is less crucial
⇒ Quality of imputation:
• Good when the structure is strong (imputation uses similarities
between individuals and relationship between variables)
• Competitive with random forests
20 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
A real dataset
O3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 O3v
0601 NA 15.6 18.5 18.4 4 4 8 NA -1.7101 -0.6946 84
0602 82 17 18.4 17.7 5 5 7 NA NA NA 87
0603 92 NA 17.6 19.5 2 5 4 2.9544 1.8794 0.5209 82
0604 114 16.2 NA NA 1 1 0 NA NA NA 92
0605 94 17.4 20.5 NA 8 8 7 -0.5 NA -4.3301 114
0606 80 17.7 NA 18.3 NA NA NA -5.6382 -5 -6 94
0607 NA 16.8 15.6 14.9 7 8 8 -4.3301 -1.8794 -3.7588 80
0610 79 14.9 17.5 18.9 5 5 4 0 -1.0419 -1.3892 NA
0611 101 NA 19.6 21.4 2 4 4 -0.766 NA -2.2981 79
0612 NA 18.3 21.9 22.9 5 6 8 1.2856 -2.2981 -3.9392 101
0613 101 17.3 19.3 20.2 NA NA NA -1.5 -1.5 -0.8682 NA
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0919 NA 14.8 16.3 15.9 7 7 7 -4.3301 -6.0622 -5.1962 42
0920 71 15.5 18 17.4 7 7 6 -3.9392 -3.0642 0 NA
0921 96 NA NA NA 3 3 3 NA NA NA 71
0922 98 NA NA NA 2 2 2 4 5 4.3301 96
0923 92 14.7 17.6 18.2 1 4 6 5.1962 5.1423 3.5 98
0924 NA 13.3 17.7 17.7 NA NA NA -0.9397 -0.766 -0.5 92
0925 84 13.3 17.7 17.8 3 5 6 0 -1 -1.2856 NA
0927 NA 16.2 20.8 22.1 6 5 5 -0.6946 -2 -1.3681 71
0928 99 16.9 23 22.6 NA 4 7 1.5 0.8682 0.8682 NA
0929 NA 16.9 19.8 22.1 6 5 3 -4 -3.7588 -4 99
0930 70 15.7 18.6 20.7 NA NA NA 0 -1.0419 -4 NA
21 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
PCA on the incomplete data
q
−4 −2 0 2 4 6
−6 −4 −2 0 2 4
Individuals factor map (PCA)
Dim 1 (57.47%)
Dim 2 (21.34%)
East
North
West
South
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
East
North
West
South
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Variables factor map (PCA)
Dim 1 (55.85%)
Dim 2 (21.73%)
T9
T12
T15
Ne9
Ne12
Ne15
Vx9
Vx12
Vx15
maxO3v
maxO3
22 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
23 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Uncertainty with incomplete case
⇒ A new source of variability to take into account
• less data: more uncertainty
• iterative PCA: single imputation → residual bootstrap on the
completed data leads to underestimate the variability
⇒ Multiple imputation
1 Generating B imputed data sets
2 Performing the analysis on each imputed data set
3 Combining: variance = within + between imputation variance
24 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Uncertainty with incomplete case
⇒ A new source of variability to take into account
• less data: more uncertainty
• iterative PCA: single imputation → residual bootstrap on the
completed data leads to underestimate the variability
⇒ Multiple imputation
1 Generating B imputed data sets: b = 1, ..., B,
missing values xb
ij
drawn from the predictive N (FV )ij, ˆ
σ2
⇒ "improper" imputation
2 Performing the analysis on each imputed data set
3 Combining: variance = within + between imputation variance
24 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
“proper” multiple imputation
1 Variability of the parameters: obtaining B plausible sets of
parameters, (F, V )1, ..., (F, V )B ⇒ bootstrap/bayesian
2 Noise: for b = 1, ..., B, missing values xb
ij
are imputing by
drawing from the predictive distribution N (FV )b
ij
, ˆ
σ2
( ˆ
F ˆ
U′)ik
( ˆ
F ˆ
U′)1
ik
+ ε1
ik
( ˆ
F ˆ
U′)2
ik
+ ε2
ik
( ˆ
F ˆ
U′)3
ik
+ ε3
ik
( ˆ
F ˆ
U′)B
ik
+ εB
ik
25 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Supplementary projection
⇒ Individuals position (and variables) with other predictions
Supplementary
projection
PCA
Regularized iterative PCA
⇒ reference configuration
26 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Supplementary projection
⇒ Individuals position (and variables) with other predictions
Supplementary
projection
PCA
Regularized iterative PCA
⇒ reference configuration
26 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Supplementary projection
⇒ Individuals position (and variables) with other predictions
Supplementary
projection
PCA
Regularized iterative PCA
⇒ reference configuration
26 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation in practice
q
−5 0 5
−8 −6 −4 −2 0 2 4 6
Supplementary projection
Dim 1 (57.20%)
Dim 2 (20.27%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
4142 43
44
45 46
47
48 49
50
51
52
53
54 55 56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
7677
78
79
80
81
82 83
84
85
86
87
88
89
9091
92
93
94
95
96
97
98
99
100
101
102
103
104 105
106
107
108
109
110
111
112
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Variable representation
Dim 1 (57.20%)
Dim 2 (20.27%)
maxO3
T9
T12
T15
Ne9
Ne12 Ne15
Vx9
Vx12
Vx15
maxO3v
27 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Between imputation variability
⇒ Influence of the different predictions on the parameters (PCA
on each table)
PCA
28 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Between imputation variability
⇒ Influence of the different predictions on the parameters (PCA
on each table)
PCA
( ˜
F ˜
U′)1 ( ˜
F ˜
U′)2 ( ˜
F ˜
U′)3 ( ˜
F ˜
U′)B
28 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Between imputation variability
⇒ Influence of the different predictions on the parameters (PCA
on each table)
Procrustean rotation
PCA
( ˜
F ˜
U′)1 ( ˜
F ˜
U′)2 ( ˜
F ˜
U′)3 ( ˜
F ˜
U′)B
28 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Between imputation variability
⇒ Influence of the different predictions on the parameters (PCA
on each table)
Procrustean rotation
PCA
( ˜
F ˜
U′)1 ( ˜
F ˜
U′)2 ( ˜
F ˜
U′)3 ( ˜
F ˜
U′)B
28 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Between imputation variability
q
−4 −2 0 2 4 6
−4 −2 0 2
Multiple imputation using Procrustes
Dim 1 (71.33%)
Dim 2 (17.17%)
q
q
q
q
q
q
q
q
q
q
q
q
1 2
3
4
5
6
7
8
9
10
11
12
29 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
30 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MCA for categorical data
MCA can be seen as the PCA of (data, metric, row masses)
IXD−1
Σ
,
1
IJ
DΣ,
1
I II
with X the indicator matrix and DΣ the diagonal matrix of the
column margins of X,
xik
I1
Ik
IK
J
J
J
IJ
X = DΣ =
I1
Ik
IK
.
.
.
..
.
.
..
.
.
.
.
.
.
..
.
.
..
.
.
.
.
.
.
..
.
.
..
.
.
.
.
.
.
..
.
.
..
.
.
.
0
0
1 0 0 1 0 0 1 ... 0 1
1 0 0 1 0 1 0 ... NA NA
NA NA NA 0 1 0 0 ... 0 1
1 0 0 1 0 0 1 ... 0 1
0 0 1 NA NA 0 ... 0 1
1 0 0 1 0 0 1 ... 0 1
31 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Regularized iterative MCA (Josse et al., 2012)
• Initialization: imputation of the indicator matrix (proportion)
• Iterate until convergence
1 Estimation of F , V : MCA on the completed indicator matrix
2 Imputation of the missing values with the model matrix
3 Column margins are updated
V1 V2 V3 … V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h …
ind 1 a NA g … u ind 1 1 0 0 0.71 0.29 1 0 …
ind 2 NA f g u ind 2 0.12 0.29 0.59 0 1 1 0 …
ind 3 a e h v ind 3 1 0 0 1 0 0 1 …
ind 4 a e h v ind 4 1 0 0 1 0 0 1 …
ind 5 b f h u ind 5 0 1 0 0 1 0 1 …
ind 6 c f h u ind 6 0 0 1 0 1 0 1 …
ind 7 c f NA v ind 7 0 0 1 0 1 0.37 0.63 …
… … … … … … … … … … … … … …
ind 1232 c f h v ind 1232 0 0 1 0 1 0 1 …
⇒ Imputed values can be seen as degree of membership
⇒ Missing values mask an underlying value
32 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
A real example
• 1232 respondents, 14 questions, 35 categories, 9% of missing
values concerning 42% of respondents
q
0 1 2 3 4 5 6
−3 −2 −1 0 1 2 3
Missing single: categories
Dim 1 (11.74%)
Dim 2 (8.618%)
Q1.NA
Q1_1
Q1_2
Q1_3
Q2.NA
Q2_1
Q2_2
Q2_3
Q3.NA
Q3_1
Q3_2
Q3_3
Q4.NA
Q4_1
Q4_2
Q5.NA
Q5_1
Q5_2
Q6.NA
Q6_1
Q6_2
Q7.NA
Q7_1
Q7_2
Q8.NA
Q8_1
Q8_2
Q9.NA
Q9_1
Q9_2
Q9_3
Q10.NA
Q10_1
Q10_2
Q11.NA
Q11_1
Q11_2
Q12.NA
Q12_1
Q12_2
Q12_3
Q13.NA
Q13_1
Q13_2
Q13_3
Q14.NA
Q14_1
Q14_2
Q14_3
q
0 1 2 3 4 5
−3 −2 −1 0 1 2 3
Missing single: subjects
Dim 1 (11.74%)
Dim 2 (8.618%)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
33 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
A real example
• 1232 respondents, 14 questions, 35 categories, 9% of missing
values concerning 42% of respondents
q
0 1 2 3 4 5 6
−3 −2 −1 0 1 2 3
Missing single: categories
Dim 1 (11.74%)
Dim 2 (8.618%)
Q1.NA
Q1_1
Q1_2
Q1_3
Q2.NA
Q2_1
Q2_2
Q2_3
Q3.NA
Q3_1
Q3_2
Q3_3
Q4.NA
Q4_1
Q4_2
Q5.NA
Q5_1
Q5_2
Q6.NA
Q6_1
Q6_2
Q7.NA
Q7_1
Q7_2
Q8.NA
Q8_1
Q8_2
Q9.NA
Q9_1
Q9_2
Q9_3
Q10.NA
Q10_1
Q10_2
Q11.NA
Q11_1
Q11_2
Q12.NA
Q12_1
Q12_2
Q12_3
Q13.NA
Q13_1
Q13_2
Q13_3
Q14.NA
Q14_1
Q14_2
Q14_3
q
0 1 2 3 4 5
−3 −2 −1 0 1 2 3
Missing single: subjects
Dim 1 (11.74%)
Dim 2 (8.618%)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−1.0 −0.5 0.0 0.5 1.0 1.5
−0.5 0.0 0.5 1.0 1.5
Regularized iterative MCA: categories
Dim 1 (14.58%)
Dim 2 (11.21%)
Q1.1
Q1.2
Q1.3
Q2.1
Q2.2
Q2.3
Q3.1
Q3.2
Q3.3
Q4.1
Q4.2
Q5.1
Q5.2
Q6.1
Q6.2
Q7.1
Q7.2
Q8.1
Q8.2
Q9.1
Q9.2
Q9.3
Q10.1
Q10.2
Q11.1
Q11.2
Q12.1
Q12.2
Q12.3
Q13.1
Q13.2
Q13.3
Q14.1
Q14.2
Q14.3
q
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0 −0.5 0.0 0.5 1.0 1.5
Regularized iterative MCA: subjects
Dim 1 (14.58%)
Dim 2 (11.21%)
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
33 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multi-blocks data set
• Biology: 10 samples without expression data
• Sensory analysis: each judge can’t evaluate more than a
certain number of products (saturation)
Planned missing products judge, experimental design: BIB
⇒ Missing rows per subtable
⇒ Regularized iterative MFA (Husson & Josse, 2013)
34 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Journal impact factors
journalmetrics.com provides 27000 journals/ 15 years of metrics.
443 journals (Computer Science, Statistics, Probability and
Mathematics). 45 metrics, some may be NA, 15 years by 3 types
of measures:
• IPP - Impact Per Publication (like the ISI impact factor but
for 3 (rather than 2) years.
• SNIP - Source Normalized Impact Per Paper: Tries to weight
by the number of citations per subject field to adjust for
different citation cultures.
• SJR - SCImago Journal Rank: Tries to capture average
prestige per publication.
35 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MFA with missing values
-5 0 5 10 15 20
-4 -2 0 2 4 6
Journals
Dim 1 (74.03%)
Dim 2 (8.29%)
ACM Transactions on Autonomous and Adaptive Systems
ACM Transactions on Mathematical Software
ACM Transactions on Programming Languages and Systems
ACM Transactions on Software Engineering and Methodology
Ad Hoc Networks
Advances in Engineering Software (1978)
Annals of Applied Probability
Annals of Probability
Annals of Statistics
Bioinformatics
Biometrics
Biometrika
Biostatistics
Computer Vision and Image Understanding
Finance and Stochastics
IBM Systems Journal IEEE Micro
IEEE Network
IEEE Pervasive Computing
IEEE Transactions on Affective Computing IEEE Transactions on Evolutionary Computation
IEEE Transactions on Image Processing
IEEE Transactions on Medical Imaging
IEEE Transactions on Mobile Computing
IEEE Transactions on Neural Networks
IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Software Engineering
IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics
IEEE Transactions on Visualization and Computer Graphics
IEEE/ACM Transactions on Networking
Information Systems International Journal of Computer Vision
International Journal of Robotics Research
Journal of Business
Journal of Business and Economic Statistics
Journal of Cryptology
Journal of Informetrics
Journal of Machine Learning Research
Journal of the ACM
Journal of the American Society for Information Science and Technology
Journal of the American Statistical Association
Journal of the Royal Statistical Society. Series B: Statistical Methodology
Machine Learning
Mathematical Programming, Series B
Multivariate Behavioral Research
New Zealand Statistician
Pattern Recognition
Physical Review E - Statistical, Nonlinear, and Soft Matter Physics
Probability Surveys
Probability Theory and Related Fields
Journal of Computational and Graphical Statistics
R Journal
Annals of Applied Statistics
Journal of Statistical Software
36 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MFA with missing values
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Correlation circle
Dim 1 (74.03%)
Dim 2 (8.29%)
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Correlation circle
Dim 1 (74.03%)
Dim 2 (8.29%)
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008 SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
36 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MFA with missing values
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Correlation circle
Dim 1 (74.03%)
Dim 2 (8.29%)
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
IPP_1999 IPP_2000
IPP_2001
IPP_2002
IPP_2003
IPP_2004
IPP_2005
IPP_2006 IPP_2007
IPP_2008
IPP_2009
IPP_2010
IPP_2011
IPP_2012
IPP_2013
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Correlation circle
Dim 1 (74.03%)
Dim 2 (8.29%)
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
SNIP_1999
SNIP_2000
SNIP_2001
SNIP_2002
SNIP_2003
SNIP_2004
SNIP_2005
SNIP_2006
SNIP_2007
SNIP_2008
SNIP_2009
SNIP_2010
SNIP_2011
SNIP_2012
SNIP_2013
36 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MFA with missing values
ACM Transactions on Networking trajectory.pdf
q
−20 −10 0 10 20 30 40 50
−20 −10 0 10 20 30 40
Individual factor map
Dim 1 (74.03%)
Dim 2 (8.29%)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
IEEE/ACM Transactions on Networking
q
year_1999
year_2000
year_2001
year_2002
year_2003
year_2004
year_2005
year_2006
year_2007
year_2008
year_2009
year_2010
year_2011
year_2012
year_2013
36 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
After performing principal component methods despite missing
entries (getting the graphical outputs and the principal component
and axes), we use these methods as tools of single and multiple
imputation and compare them to the state of the art methods.
PC methods are powerful to impute, since they use similarities
between rows, relationship between columns and require a small
number of parameters (dimensionality reduction)
With single imputation, the aim to complete a dataset as best as
possible (prediction). With multiple imputation the aim is to
perform other statistical methods after and to estimate parameters
and their variability taking into account the missing values
uncertainty.
37 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
38 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Principal component method for mixed data (complete)
Factorial Analysis on Mixed Data (Escofier, 1979), PCAMIX (Kiers, 1991)
Categorical
variables
Continuous
variables
0 1 0 1 0
centring &
scaling
I1
I2
Ik
division by
and centring
I/Ik
0 1 0 1 0
0 1 0 0 1
51 100 190
70 96 196
38 69 166
0 1
1 0
1 0
1 0 0
0 1 0
0 1 0
Indicator matrix
Matrix which balances the
influence of each variable
A PCA is performed on the weighted matrix: SVD (X, D−1
Σ
, 1
I
II
), with X the
matrix with the continuous variables and the indicator matrix, DΣ
, the diagonal
matrix with the standard deviation and the weights (Ik
/I).
39 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Properties of the method
• The distance between individuals is:
d2(i, l) =
Kcont
k=1
1
σk
(xik − xlk)2 +
Q
q=1
Kq
k=1
1
Ikq
(xiq − xlq)2
• The principal component Fs maximises:
Kcont
k=1
r2(Fs, vk) +
Qcat
q=1
η2(Fs, vq)
40 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative FAMD algorithm
1 Initialization: imputation mean (continuous) and proportion (dummy)
2 Iterate until convergence
(a) estimation: FAMD on the completed data ⇒ U, Λ, V
(b) imputation of the missing values with the fitted matrix
ˆ
X = US
Λ1/2
S
VS
(c) means, standard deviations and column margins are updated
age weight size alcohol sex snore tobacco
NA 100 190 NA M yes no
70 96 186 1-2 gl/d M NA <=1
NA 104 194 No W no NA
62 68 165 1-2 gl/d M no <=1
age weight size alcohol sex snore tobacco
51 100 190 1-2 gl/d M yes no
70 96 186 1-2 gl/d M no <=1
48 104 194 No W no <=1
62 68 165 1-2 gl/d M no <=1
51 100 190 0.2 0.7 0.1 1 0 0 1 1 0 0
70 96 186 0 1 0 1 0 0.8 0.2 0 1 0
48 104 194 1 0 0 0 1 1 0 0.1 0.8 0.1
62 68 165 0 1 0 1 0 1 0 0 1 0
NA 100 190 NA NA NA 1 0 0 1 1 0 0
70 96 186 0 1 0 1 0 NA NA 0 1 0
NA 104 194 1 0 0 0 1 1 0 NA NA NA
62 68 165 0 1 0 1 0 1 0 0 1 0
imputeAFDM
⇒ Imputed values can be seen as degrees of membership
41 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Iterative Random Forests imputation
1 Initial imputation: mean imputation - random category
Sort the variables according to the amount of missing values
2 Fit a RF Xobs
j
on variables Xobs
−j
and then predict Xmiss
j
3 Cycling through variables
4 Repeat step 2 and 3 until convergence
• number of trees: 100
• number of variables randomly selected at each node
√
p
• number of iterations: 4-5
Implemented in the R package missForest (Daniel J. Stekhoven, Peter
Buhlmann, 2011)
42 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Simulation study
Several data sets
• Relationships between variables
• Number of categories
• percentage of missing values (10%,20%,30%)
Criteria:
• for continuous data: RMSE
• for categorical data: proportion of falsely classified entries
43 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Comparison on real data sets
Imputations obtained with random forest & FAMD algorithm
44 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Summary
Imputations with PC methods are good:
• for strong linear relationships
• for categorical variables
• especially for rare categories (weights of MCA)
⇒ Number of components S?? Cross-Validation (GCV)
Imputations with RF are good:
• for strong non-linear relationships between continuous
variables
• when there are interactions
⇒ No tunning parameters?
Rq: categorical data improve the imputation on continuous data
and continuous data improve the imputation on categorical data
45 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Summary
Imputations with PC methods are good:
• for strong linear relationships
• for categorical variables
• especially for rare categories (weights of MCA)
⇒ Number of components S?? Cross-Validation (GCV)
Imputations with RF are good:
• for strong non-linear relationships between continuous
variables (cutting continuous variables into categories)
• when there are interactions (creating interactions)
⇒ No tunning parameters?
Rq: categorical data improve the imputation on continuous data
and continuous data improve the imputation on categorical data
45 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
46 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation continuous data: bivariate case
⇒ Proper multiple imputation with yi = xi β + εi
1 Variability of the parameters, M plausible: (ˆ
β)1, ..., (ˆ
β)M
⇒ Bootstrap
⇒ Posterior distribution: Data Augmentation (Tanner & Wong, 1987)
2 Noise: for m = 1, ..., M, missing values ym
i
are imputed by
drawing from the predictive distribution N(xi
ˆ
βm, (ˆ
σ2)m)
Improper Proper
CIµy 95% 0.818 0.935
47 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Joint modeling
⇒ Hypothesis xi. ∼ N (µ, Σ)
Algorithm Expectation Maximization Bootstrap:
1 Bootstrap rows: X1, ... , XM
EM algorithm: (ˆ
µ1, ˆ
Σ1), ... , (ˆ
µM, ˆ
ΣM)
2 Imputation: xm
ij
drawn from N ˆ
µm, ˆ
Σm
Easy to parallelized. Implemented in Amelia (website)
Amelia Earhart
James Honaker Gary King Matt Blackwell
48 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
(Fully) Conditional modeling
⇒ Hypothesis: one model/variable
1 Initial imputation: mean imputation
2 For a variable j
2.1 (β−j , σ−j ) drawn from a Bootstrap or a posterior distribution
2.2 Imputation: stochastic regression xij
from N X−j
β−j , σ−j
3 Cycling through variables
4 Repeat M times steps 2 and 3
⇒ Iteratively refine the imputation.
Implemented in mice (website)
“There is no clear-cut method for determining
whether the MICE algorithm has converged” Stef van Buuren
49 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
(Fully) Conditional modeling
⇒ Hypothesis: one model/variable
1 Initial imputation: mean imputation
2 For a variable j
2.1 (β−j , σ−j ) drawn from a Bootstrap or a posterior distribution
2.2 Imputation: stochastic regression xij
from N X−j
β−j , σ−j
3 Cycling through variables
4 Repeat M times steps 2 and 3
⇒ Iteratively refine the imputation.
⇒ With continuous variables and a regression/variable: N (µ, Σ)
Implemented in mice (website)
“There is no clear-cut method for determining
whether the MICE algorithm has converged” Stef van Buuren
49 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Joint / Conditional modeling
⇒ Both seen imputed values are drawn from a Joint distribution
(even if joint does not exist)
⇒ Conditional modeling takes the lead?
• Flexible: one model/variable. Easy to deal with interactions
and variables of different nature (binary, ordinal, categorical...)
• Many statistical models are conditional models!
• Tailor to your data
• Appears to work quite well in practice
⇒ Drawbacks: one model/variable... tedious...
50 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Joint / Conditional modeling
⇒ Both seen imputed values are drawn from a Joint distribution
(even if joint does not exist)
⇒ Conditional modeling takes the lead?
• Flexible: one model/variable. Easy to deal with interactions
and variables of different nature (binary, ordinal, categorical...)
• Many statistical models are conditional models!
• Tailor to your data
• Appears to work quite well in practice
⇒ Drawbacks: one model/variable... tedious...
⇒ What to do with high correlation or when n < p?
• JM shrinks the covariance Σ + kI (selection of k?)
• CM: ridge regression or predictors selection/variable ⇒ a lot
of tuning ... not so easy ...
50 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation with Bootstrap/Bayesian PCA
xij = ˜
xij + εij =
S
s=1
λsuisvjs + εij , εij ∼ N(0, σ2)
1 Variability of the parameters, M plausible: (ˆ
xij)1, ..., (ˆ
xij)M
Bootstrap - Iterative PCA
2 Noise: for m = 1, ..., M, missing values xm
ij
drawn N(ˆ
xm
ij
, ˆ
σ2)
Implemented in missMDA (website)
François Husson
51 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Simulations
• 1000 simulations
• data set drawn from Np
(µ, Σ) with
a two-block structure, varying n
(30 or 200), p (6 or 60) and ρ (0.3
or 0.9)
0
0
0
0
0
0
0
0
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
• 10% or 30% of missing values using a MCAR mechanism
• multiple imputation using M = 20 imputed data
• Quantities of interest: θ1 = E [Y ] , θ2 = β1, θ3 = ρ
• Criteria
• bias
• CI width, coverage
52 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Results for the expectation
parameters confidence interval width coverage
n p ρ %
Amelia
MICE
BayesMIPCA
Amelia
MICE
BayesMIPCA
1 30 6 0.3 0.1 0.803 0.805 0.781 0.955 0.953 0.950
2 30 6 0.3 0.3 1.010 0.898 0.971 0.949
3 30 6 0.9 0.1 0.763 0.759 0.756 0.952 0.95 0.949
4 30 6 0.9 0.3 0.818 0.783 0.965 0.953
5 30 60 0.3 0.1 0.775 0.955
6 30 60 0.3 0.3 0.864 0.952
7 30 60 0.9 0.1 0.742 0.953
8 30 60 0.9 0.3 0.759 0.954
9 200 6 0.3 0.1 0.291 0.294 0.292 0.947 0.947 0.946
10 200 6 0.3 0.3 0.328 0.334 0.325 0.954 0.959 0.952
11 200 6 0.9 0.1 0.281 0.281 0.281 0.953 0.95 0.952
12 200 6 0.9 0.3 0.288 0.289 0.288 0.948 0.951 0.951
13 200 60 0.3 0.1 0.304 0.289 0.957 0.945
14 200 60 0.3 0.3 0.384 0.313 0.981 0.958
15 200 60 0.9 0.1 0.282 0.279 0.951 0.948
16 200 60 0.9 0.3 0.296 0.283 0.958 0.952
53 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Joint, conditional and PCA
⇒ Good estimates of the parameters and their variance from an
incomplete data (coverage close to 0.95)
The variability due to missing values is well taken into account
Amelia & mice have difficulties with large correlations or n < p
missMDA does not but requires a tuning parameter: number of dim.
Amelia & missMDA are based on linear relationships
mice is more flexible (one model per variable)
MI based on PCA works in a large range of configuration, n < p, n > p
strong or weak relationships, low or high percentage of missing values
54 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Remarks
⇒ MI theory: good theory for regression parameters. Others?
⇒ Imputation model as complex as the analysis model
(interaction)
55 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Remarks
⇒ MI theory: good theory for regression parameters. Others?
⇒ Imputation model as complex as the analysis model
(interaction)
⇒ Some practical issues:
• Imputation not in agreement (X and X2): missing passive
• Imputation out of range? (Predictive mean matching pmm)
• Problems of logical bounds (> 0) ⇒ truncation?
55 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MI for categorical variables
• Loglinear model: R package cat (J.L. Schafer)
• Fully conditional specification: R package mice (Van Burren)
• Imputation with Gaussian distribution
• Latent Class Variables: mixture models: each sample belongs
to a latent class in which variables are independent (D.
Vidotto, M. C. Kapteijn, and Vermunt J.K, 2014)
Non-parametric version: Dirichlet process mixture of products
of multinomial distributions model DPMPM (Y. Si and J.P.
Reiter, 2014)
56 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation for categorical data using MCA
A set of parameters:
UI×S
, Λ1/2
S×S
, VJ×S
1
, . . . , UI×S
, Λ1/2
S×S
, VJ×S
M
obtained using a non-parametric Bootstrap approach:
1 Generate M bootstrap replicates
2 Estimate the parameters on each incomplete replicate
3 Add uncertainty on the prediction
57 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation with MCA
1 Variability of the parameters of MCA (UI×S, Λ1/2
S×S
, VJ×S
)
using a non-parametric bootstrap:
→ define M weightings (Rm)1≤m≤M
for the individuals
58 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation with MCA
1 Variability of the parameters of MCA (UI×S, Λ1/2
S×S
, VJ×S
)
using a non-parametric bootstrap:
→ define M weightings (Rm)1≤m≤M
for the individuals
2 Estimate MCA parameters using SVD of X, 1
K
(DΣ)−1 , Rm
58 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation with MCA
1 Variability of the parameters of MCA (UI×S, Λ1/2
S×S
, VJ×S
)
using a non-parametric bootstrap:
→ define M weightings (Rm)1≤m≤M
for the individuals
2 Estimate MCA parameters using SVD of X, 1
K
(DΣ)−1 , Rm
ˆ
X1
ˆ
X2
ˆ
XM
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.81 0.19
0.25 0.75
0 1
0 1 0 1
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.60 0.40
0.26 0.74
0 1
0 1 0 1
. . .
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.74 0.16
0.20 0.80
0 1
0 1 0 1
58 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation with MCA
1 Variability of the parameters of MCA (UI×S, Λ1/2
S×S
, VJ×S
)
using a non-parametric bootstrap:
→ define M weightings (Rm)1≤m≤M
for the individuals
2 Estimate MCA parameters using SVD of X, 1
K
(DΣ)−1 , Rm
ˆ
X1
ˆ
X2
ˆ
XM
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.81 0.19
0.25 0.75
0 1
0 1 0 1
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.60 0.40
0.26 0.74
0 1
0 1 0 1
. . .
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.74 0.16
0.20 0.80
0 1
0 1 0 1
A . . . A
A . . . A
A . . .
A
B
. . . C
B . . . B
A . . . A
A . . . A
A . . .
A
B
. . . C
B . . . B
. . .
A . . . A
A . . . A
A . . .
A
B
. . . C
B . . . B
majority ⇒ lack of variability
58 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation with MCA
1 Variability of the parameters of MCA (UI×S, Λ1/2
S×S
, VJ×S
)
using a non-parametric bootstrap:
→ define M weightings (Rm)1≤m≤M
for the individuals
2 Estimate MCA parameters using SVD of X, 1
K
(DΣ)−1 , Rm
ˆ
X1
ˆ
X2
ˆ
XM
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.81 0.19
0.25 0.75
0 1
0 1 0 1
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.60 0.40
0.26 0.74
0 1
0 1 0 1
. . .
1 0 . . . 1 0
1 0 . . . 1 0
1 0 . . .
0.74 0.16
0.20 0.80
0 1
0 1 0 1
3 Draw categories from the values of ˆ
Xm
1≤m≤M
A . . . A
A . . . A
A . . .
B
B
. . . C
B . . . B
A . . . A
A . . . A
A . . .
A
B
. . . C
B . . . B
. . .
A . . . A
A . . . A
A . . .
B
B
. . . C
B . . . B
58 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Simulations
• Quantities of interest: θ = parameters of a logistic model
• 200 simulations from real data sets
• the real data set is considered as a population
• drawn one sample from the data set
• generate 20% of missing values
• multiple imputation using M = 5 imputed data
• Criteria
• bias
• CI width, coverage
59 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Results - Inference
q
MIMCA 5
Loglinear
Latent class
FCS−log
FCS−rf
0.80
0.85
0.90
0.95
1.00
Titanic
coverage
q
q
q
q
MIMCA 2
Loglinear
Latent class
FCS−log
FCS−rf
0.80
0.85
0.90
0.95
1.00
Galetas
coverage
q
MIMCA 5
Latent class
FCS−log
FCS−rf
0.80
0.85
0.90
0.95
1.00
Income
coverage
Titanic Galetas Income
Number of variables 4 4 14
Number of categories ≤ 4 ≤ 11 ≤ 9
60 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Results - Time
Titanic Galetas Income
MIMCA 2.750 8.972 58.729
Loglinear 0.740 4.597 NA
Latent class model 10.854 17.414 143.652
FCS logistic 4.781 38.016 881.188
FCS forests 265.771 112.987 6329.514
Table : Time in second
Titanic Galetas Income
Number of individuals 2201 1192 6876
Number of variables 4 4 14
61 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Conclusion
Multiple imputation methods for continuous and categorical data
using dimensionality reduction method
Properties:
• requires a small number of parameters
• captures the relationships between variables
• captures the similarities between individuals
From a practical point of view:
• can be applied on data sets of various dimensions
• provides correct inferences for analysis model based on
relationships between pairs of variables
• requires to choose the number of dimensions S
Perspective:
• mixed data
62 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Mixed variables
⇒ Joint modeling:
• General location model (Schafer, 1997) =⇒ pb when many
categories
• Transform the categorical variables into dummy variables and
deal as continuous variables (Amelia)
• Latent class models (Vermunt) – nonparametric Bayesian
models (work in progress, Dunson, Reiter, Duke University)
⇒ Conditional modeling: linear, logistic, multinomial logit models
(mice), Random forests
63 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
To conclude
Take home message:
• “The idea of imputation is both seductive and dangerous. It is seductive
because it can lull the user into the pleasurable state of believing that the data
are complete after all, and it is dangerous because it lumps together situations
where the problem is sufficiently minor that it can be legitimately handled in
this way and situations where standard estimators applied to the real and
imputed data have substantial biases.” (Dempster and Rubin, 1983)
• Advanced methods are available to estimate parameters and
their variance (taking into account the variability due to
missing values)
• Multiple imputation is an appealing method .... but ... how
can we do with big data?
• Still an active area of research
64 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Ressources
⇒ Softwares:
• van Buuren webpage:
http://www.stefvanbuuren.nl/mi/Software.html
• R task View: Official Statistics & Survey Methodology
⇒ Recent Books:
• van Buuren (2012). Flexible Imputation of Missing Data. Chapman & Hall/CRC
• Carpenter & Kenward (2013). Multiple Imputation and its Application. Wiley
• G. Molenberghs, G. Fitzmaurice, M.G. Kenward, A. Tsiatis & G. Verbeke (nov
2014). Handbook of Missing Data. Chapman & Hall/CRC
⇒ Little & Rubin (2002). Statistical Analysis with missing data - Schafer (1997)
Analysis of incomplete multivariate data
⇒ J.L. Schafer & J.W. Graham, 2002. Missing Data: Our View of the State of the
Art. Psychological Methods, 7 147-177
⇒ B. Efron. 1989. Missing data, Imputation and the Bootstrap. Journal of the
American Statistical Association, 426 463-475
65 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Contributors on the topic of multiple imputation
• J. Honaker - G. King - M. Blackwell (Harvard): Amelia
• S. van Buuren (Utrecht): mice
• F. Husson - J. Josse (Rennes): missMDA
• A. Gelman - J. Hill - Y. Su (Colombia): mi
• J. Reiter (Duke): NPBayesImpute Non-Parametric Bayesian
Multiple Imputation for Categorical Data
• J. Bartlett - J. Carpenter - M. Kenward (UCL): smcfcs
Substantive model compatible FCS multiple imputation
• H. Goldstein (Bristol) : realcom for multi-level data
• J.K. Vermunt (Tilburg): poLCA latent class models
• Shaun Seaman (Medical Research Council Biostatistics Unit,
UK), Roderick Little (Michigan)...
• Donald B Rubin (Harvard)
66 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Conference on missing data and matrix completion
http://missdata2015.agrocampus-ouest.fr/
67 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
68 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
A real dataset
O3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 O3v
0601 NA 15.6 18.5 18.4 4 4 8 NA -1.7101 -0.6946 84
0602 82 17 18.4 17.7 5 5 7 NA NA NA 87
0603 92 NA 17.6 19.5 2 5 4 2.9544 1.8794 0.5209 82
0604 114 16.2 NA NA 1 1 0 NA NA NA 92
0605 94 17.4 20.5 NA 8 8 7 -0.5 NA -4.3301 114
0606 80 17.7 NA 18.3 NA NA NA -5.6382 -5 -6 94
0607 NA 16.8 15.6 14.9 7 8 8 -4.3301 -1.8794 -3.7588 80
0610 79 14.9 17.5 18.9 5 5 4 0 -1.0419 -1.3892 NA
0611 101 NA 19.6 21.4 2 4 4 -0.766 NA -2.2981 79
0612 NA 18.3 21.9 22.9 5 6 8 1.2856 -2.2981 -3.9392 101
0613 101 17.3 19.3 20.2 NA NA NA -1.5 -1.5 -0.8682 NA
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0919 NA 14.8 16.3 15.9 7 7 7 -4.3301 -6.0622 -5.1962 42
0920 71 15.5 18 17.4 7 7 6 -3.9392 -3.0642 0 NA
0921 96 NA NA NA 3 3 3 NA NA NA 71
0922 98 NA NA NA 2 2 2 4 5 4.3301 96
0923 92 14.7 17.6 18.2 1 4 6 5.1962 5.1423 3.5 98
0924 NA 13.3 17.7 17.7 NA NA NA -0.9397 -0.766 -0.5 92
0925 84 13.3 17.7 17.8 3 5 6 0 -1 -1.2856 NA
0927 NA 16.2 20.8 22.1 6 5 5 -0.6946 -2 -1.3681 71
0928 99 16.9 23 22.6 NA 4 7 1.5 0.8682 0.8682 NA
0929 NA 16.9 19.8 22.1 6 5 3 -4 -3.7588 -4 99
0930 70 15.7 18.6 20.7 NA NA NA 0 -1.0419 -4 NA
69 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Count missing values
> library(VIM)
> aggr(don,only.miss=TRUE,sortVar=TRUE)
> res<-summary(aggr(don,prop=TRUE,combined=TRUE))$combinations
> res[rev(order(res[,2])),]
Variables sorted by
number of missings: Combinations Count Percent
Variable Count 0:0:0:0:0:0:0:0:0:0:0 13 11.6071429
Ne12 0.37500000 0:1:1:1:0:0:0:0:0:0:0 7 6.2500000
T9 0.33035714 0:0:0:0:0:1:0:0:0:0:0 5 4.4642857
T15 0.33035714 0:1:0:0:0:0:0:0:0:0:0 4 3.5714286
Ne9 0.30357143 0:1:0:0:1:1:1:0:0:0:0 3 2.6785714
T12 0.29464286 0:0:1:0:0:0:0:0:0:0:0 3 2.6785714
Ne15 0.28571429 0:0:0:1:0:0:0:0:0:0:0 3 2.6785714
Vx15 0.18750000 0:0:0:0:1:1:1:0:0:0:0 3 2.6785714
Vx9 0.16071429 0:0:0:0:0:1:0:0:0:0:1 3 2.6785714
maxO3 0.14285714 0:1:1:1:1:0:0:0:0:0:0 2 1.7857143
maxO3v 0.10714286 0:0:0:0:1:0:0:0:0:1:0 2 1.7857143
Vx12 0.08928571 0:0:0:0:0:0:1:1:0:0:0 2 1.7857143
0:0:0:0:0:0:1:0:0:0:0 2 1.7857143
..................... . ...
70 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Pattern visualization
Proportion of missings
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Ne12
T9
T15
Ne9
T12
Ne15
Vx15
Vx9
maxO3
maxO3v
Vx12
Combinations
Ne12
T9
T15
Ne9
T12
Ne15
Vx15
Vx9
maxO3
maxO3v
Vx12
> aggr(don,only.miss=TRUE,sortVar=TRUE)
71 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Visualization
maxO3
T9
T12
T15
Ne9
Ne12
Ne15
Vx9
Vx12
Vx15
maxO3v
0 20 40 60 80 100
Index
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q q
q
q
q q
q q
q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
16
37
4
12 14 16 18 20 22 24
40 60 80 100 120 140 160
T9
maxO3
> matrixplot(don,sortby=2)
> marginplot(don[,c("T9","maxO3")])
72 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Visualization with Multiple Correspondence Analysis
⇒ Create the missingness matrix
> mis.ind <- matrix("o",nrow=nrow(don),ncol=ncol(don))
> mis.ind[is.na(don)]="m"
> dimnames(mis.ind)=dimnames(don)
> mis.ind
maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
20010601 "o" "o" "o" "m" "o" "o" "o" "o" "o" "o" "o"
20010602 "o" "m" "m" "m" "o" "o" "o" "o" "o" "o" "o"
20010603 "o" "o" "o" "o" "o" "m" "m" "o" "m" "o" "o"
20010604 "o" "o" "o" "m" "o" "o" "o" "m" "o" "o" "o"
20010605 "o" "m" "o" "o" "m" "m" "m" "o" "o" "o" "o"
20010606 "o" "o" "o" "o" "o" "m" "o" "o" "o" "o" "o"
20010607 "o" "o" "o" "o" "o" "o" "m" "o" "o" "o" "o"
20010610 "o" "o" "o" "o" "o" "o" "m" "o" "o" "o" "o"
73 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Visualization with Multiple Correspondence Analysis
q
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0 −0.5 0.0 0.5 1.0
MCA graph of the categories
Dim 1 (19.07%)
Dim 2 (17.71%)
maxO3_m
maxO3_o
T9_m
T9_o
T12_m
T12_o
T15_m
T15_o
Ne9_m
Ne9_o
Ne12_m
Ne12_o
Ne15_m
Ne15_o
Vx9_m
Vx9_o
Vx12_m
Vx12_o
Vx15_m
Vx15_o
maxO3v_m
maxO3v_o
> library(FactoMineR)
> resMCA <- MCA(mis.ind)
> plot(resMCA,invis="ind",title="MCA graph of the categories")
74 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Imputation with PCA
⇒ Step 1: Estimation of the number of dimensions
> library(missMDA)
> nb <- estim_ncpPCA(don,method.cv="Kfold")
> nb$ncp #2
> plot(0:5,nb$criterion,xlab="nb dim", ylab="MSEP")
q
q
q
q
q q
0 1 2 3 4 5
4000 5000 6000 7000
nb dim
MSEP
75 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Imputation with PCA
⇒ Step 2: Imputation of the missing values
> res.comp <- imputePCA(don,ncp=2)
> res.comp$completeObs[1:3,]
maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
0601 87 15.60 18.50 20.47 4 4.00 8.00 0.69 -1.71 -0.69 84
0602 82 18.51 20.88 21.81 5 5.00 7.00 -4.33 -4.00 -3.00 87
0603 92 15.30 17.60 19.50 2 3.98 3.81 2.95 1.97 0.52 82
76 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
PCA representation
⇒ Step 3: PCA on the completed data set
q
−4 −2 0 2 4 6
−6 −4 −2 0 2 4
Individuals factor map (PCA)
Dim 1 (57.47%)
Dim 2 (21.34%)
East
North
West
South
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
East
North
West
South
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Variables factor map (PCA)
Dim 1 (55.85%)
Dim 2 (21.73%)
T9
T12
T15
Ne9
Ne12
Ne15
Vx9
Vx12
Vx15
maxO3v
maxO3
> imp <- cbind.data.frame(res.comp$completeObs,WindDirection)
> res.pca <- PCA(imp,quanti.sup=1,quali.sup=12)
> plot(res.pca, hab=12, lab="quali"); plot(res.pca, choix="var")
> res.pca$ind$coord #scores (principal components)
77 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation in practice
⇒ Step 1: Generate M imputed data sets
> library(Amelia)
> res.amelia <- amelia(don,m=100) ## in combination with zelig
> library(mice)
> res.mice <- mice(don,m=100,defaultMethod="norm.boot")
> library(missMDA)
> res.MIPCA <- MIPCA(don,ncp=2,nboot=100)
> res.MIPCA$res.MI
78 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation in practice
⇒ Step 2: visualization
10 15 20 25 30 35
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Observed and Imputed values of T12
T12 −− Fraction Missing: 0.295
Relative Density
Mean Imputations
Observed Values
40 60 80 100 120 140 160
50 100 150 200
Observed versus Imputed Values of maxO3
Observed Values
Imputed Values
0−.2 .2−.4 .4−.6 .6−.8 .8−1
> library(Amelia)
> compare.density(res.amelia, var="T12")
> overimpute(res.amelia, var="maxO3")
function stripplot in mice
79 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation in practice
⇒ Step 2: visualization
> res.MIPCA <- MIPCA(don,ncp=2)
> plot(res.MIPCA,choice= "ind.supp"); plot(res.MIPCA,choice= "var")
q
−5 0 5
−8 −6 −4 −2 0 2 4 6
Supplementary projection
Dim 1 (57.20%)
Dim 2 (20.27%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
4142 43
44
45 46
47
48 49
50
51
52
53
54 55 56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
7677
78
79
80
81
82 83
84
85
86
87
88
89
9091
92
93
94
95
96
97
98
99
100
101
102
103
104 105
106
107
108
109
110
111
112
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0 0.5 1.0
Variable representation
Dim 1 (57.20%)
Dim 2 (20.27%)
maxO3
T9
T12
T15
Ne9
Ne12 Ne15
Vx9
Vx12
Vx15
maxO3v
80 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Multiple imputation in practice
⇒ Step 3. Regression on each table and pool the results
ˆ
β = 1
M
M
m=1
ˆ
βm
T = 1
M m
Var ˆ
βm + 1 + 1
M
1
M−1 m
ˆ
βm − ˆ
β
2
> library(mice)
> imp.mice <- mice(don,m=100,defaultMethod="norm")
> lm.mice.out <- with(res.mice, lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+
Ne15+Vx9+Vx12+Vx15+maxO3v))
> pool.mice <- pool(lm.mice.out)
> summary(pool.mice)
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 19.31 16.30 1.18 50.48 0.24 -13.43 52.05 NA 0.46 0.44
T9 -0.88 2.25 -0.39 26.43 0.70 -5.50 3.75 37 0.71 0.69
T12 3.29 2.38 1.38 27.54 0.18 -1.59 8.18 33 0.70 0.68
....
Vx15 0.23 1.33 0.17 39.00 0.87 -2.47 2.93 21 0.57 0.55
maxO3v 0.36 0.10 3.65 46.03 0.00 0.16 0.56 12 0.50 0.48
81 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Mixed imputation in practice
> library(missMDA)
> imputeFAMD(mydata,ncp=2)
> library(missForest)
> missForest(mydata)
> library(mice)
> mice(mydata)
> mice(mydata, defaultMethod = "rf") ## mice with random forests
82 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
An ecological data set
Glopnet data: 2494 species described by 6 quantitative variables
• LMA (leaf mass per area)
• LL (leaf lifespan)
• Amass (photosynthetic assimilation)
• Nmass (leaf nitrogen),
• Pmass (leaf phosphorus)
• Rmass (dark respiration rate)
and 1 categorical variable: the biome
Wright IJ, et al. (2004). The worldwide leaf economics spectrum.
Nature, 428:821.
www.nature.com/nature/journal/v428/n6985/extref/nature02403-s2.xls
83 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
An ecological data set
> sum(is.na(don))/(nrow(don)*ncol(don)) # 53% of missing values
[1] 0.5338145
> dim(na.omit(don)) ## Delete species with missing values
[1] 72 6 ## only 72 remaining species!
> library(VIM)
> aggr(don,numbers=TRUE,sortVar=TRUE)
Proportion of missings
0.0 0.2 0.4 0.6 0.8
Rmass
LL
Pmass
Amass
Nmass
LMA
Combinations
Rmass
LL
Pmass
Amass
Nmass
LMA
0.2326
0.1985
0.1359
0.0714
0.0589
0.0573
0.0525
0.0397
0.0289
0.0180
0.0180
0.0152
0.0124
0.0124
0.0120
0.0080
0.0056
0.0052
0.0036
0.0028
0.0024
0.0024
0.0024
0.0020
0.0004
0.0004
0.0004
0.0004
84 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
An ecological data set
q
−1 0 1 2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
MCA graph of the categories
Dim 1 (33.67%)
Dim 2 (21.07%)
LL_m
LL_o
LMA_m
LMA_o
Nmass_m
Nmass_o
Pmass_m
Pmass_o
Amass_m
Amass_o
Rmass_m
Rmass_o
> mis.ind <- matrix("o",nrow=nrow(don),ncol=ncol(don))
> mis.ind[is.na(don)] <- "m"
> dimnames(mis.ind) <- dimnames(don)
> library(FactoMineR)
> resMCA <- MCA(mis.ind)
> plot(resMCA,invis="ind",title="MCA graph of the categories")
85 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
An ecological data set
What about mean imputation?
q
−5 0 5
−6 −4 −2 0 2 4 6 8
Individuals factor map (PCA)
Dim 1 (44.79%)
Dim 2 (23.50%)
alpine
boreal
desert
grass/m
temp_for
temp_rf
trop_for
trop_rf
tundra
wland
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q q
q
q
q q q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q q
q
qq
q
q q
q
q
q
q
q
q
q
q
alpine
boreal
desert
grass/m
temp_for
temp_rf
trop_for
trop_rf
tundra
wland
86 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
An ecological data set
q
−10 −5 0 5
−6 −4 −2 0 2 4 6
Individuals factor map (PCA)
Dim 1 (91.18%)
Dim 2 (4.97%)
alpine
boreal
desert
grass/m
temp_for
temp_rf
trop_for
trop_rf
tundra
wland
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q q qq
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
qq q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q q q
q
q q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q q q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q q q
q
q q q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q
q
q q q q
q
q
q
q q
q q q q q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q qq q q
q
q
q
q
q
q
q q
q
q
q q
q
q q q q
q
q
q
q
q
q
qq
q
q q q q
q q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
qq q
q
q
q q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q qq
q
q
q
q
q
q q
q q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q q
q
q q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q q
q
q
q
q
qq
q
q q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q q q
q
q
q
q
q
q
q
q
q
q
q q
q q q
q q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq q
q
q
q
q q
q q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q q
q
q
q
q q q
q
q q
q q q
q q
q q q
qq q q
q q
q q
q
q q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
alpine
boreal
desert
grass/m
temp_for
temp_rf
trop_for
trop_rf
tundra
wland
q
−1 0 1 2
−2 −1 0 1
Individuals factor map (PCA)
Dim 1 (91.18%)
Dim 2 (4.97%)
alpine
boreal
desert
grass/m
temp_for
temp_rf
trop_for
trop_rf
tundra
wland
alpine
boreal
desert
grass/m
temp_for
temp_rf
trop_for
trop_rf
tundra
wland
q
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.0 −0.5 0.0 0.5 1.0
Variables factor map (PCA)
Dim 1 (91.18%)
Dim 2 (4.97%)
LL
LMA
Nmass
Pmass
Amass
Rmass
> library(missMDA)
> nb <- estim_ncpPCA(don,method.cv="Kfold",nbsim=100)
> res.comp <- imputePCA(don,ncp=2)
> imp <- cbind.data.frame(res.comp$completeObs,tab.init[,1:4])
> res.pca <- PCA(imp,quanti.sup=1,quali.sup=12)
> plot(res.pca, hab=12, lab="quali"); plot(res.pca, choix="var")
> res.pca$ind$coord #scores (principal components)
> res.MIPCA <- MIPCA(don,ncp=2)
> plot(res.MIPCA,choice= "ind.supp"); plot(res.MIPCA,choice= "var ")
87 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Outline
1 Introduction
2 Point estimates of the PCA axes and components
3 Uncertainty
4 MCA/MFA
5 Single imputation for mixed variables
6 Multiple imputation
7 Practice
8 Appendix
88 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Expectation - Maximization (Dempster et al., 1977)
Need the modification of the estimation process (not always easy!)
Rationale to get ML estimates on the observed values max Lobs
through max of Lcomp of X = (Xobs, Xmiss). Augment the data to
simplify the problem
E step (conditional expectation):
Q(θ, θ ) = ln(f (X|θ))f (Xmiss|Xobs, θ )dXmiss
M step (maximization):
θ +1 = argmaxθ
Q(θ, θ )
Result: when θ +1 max Q(θ, θ ) then L(Xobs, θ +1) ≥ L(Xobs, θ )
89 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Maximum likelihood approach
Hypothesis xi. ∼ N (µ, Σ)
⇒ Point estimates with EM:
> library(norm)
> pre <- prelim.norm(as.matrix(don))
> thetahat <- em.norm(pre)
> getparam.norm(pre,thetahat)
90 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Maximum likelihood approach
Hypothesis xi. ∼ N (µ, Σ)
⇒ Point estimates with EM:
> library(norm)
> pre <- prelim.norm(as.matrix(don))
> thetahat <- em.norm(pre)
> getparam.norm(pre,thetahat)
⇒ Variances:
• Supplemented EM (Meng, 1991)
• Bootstrap approach:
• Bootstrap rows: X1, ... , XB
• EM algorithm: (ˆ
µ1, ˆ
Σ1
), ... , (ˆ
µB, ˆ
ΣB
)
90 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
Maximum likelihood approach
Hypothesis xi. ∼ N (µ, Σ)
⇒ Point estimates with EM:
> library(norm)
> pre <- prelim.norm(as.matrix(don))
> thetahat <- em.norm(pre)
> getparam.norm(pre,thetahat)
⇒ Variances:
• Supplemented EM (Meng, 1991)
• Bootstrap approach:
• Bootstrap rows: X1, ... , XB
• EM algorithm: (ˆ
µ1, ˆ
Σ1
), ... , (ˆ
µB, ˆ
ΣB
)
Issue: develop a specific method for each statistical method
90 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MI using the loglinear model
• Hypothesis X = (xijk)i,j,k:
X|θ ∼ M (n, θ) where:
log(θijk) = λ0 + λA
i
+ λB
j
+ λC
k
+ λAB
ij
+ λAC
ik
+ λBC
jk
+ λABC
ijk
1 Variability of the parameters
• prior on θ : θ|θ ∈ Θ ∼ D(α)
• posterior: θ|x, θ ∈ Θ ∼ D(α )
• Data Augmentation (M.A. Tanner, W.H. Wong, 1987)
2 Imputation according to the loglinear model using the set of
M parameters
• Implemented: R package cat (J.L. Schafer)
91 / 92
Introduction Point estimates Confidence Areas MCA/MFA SI for mixed var. Multiple imputation Practice Appendix
MI using a DPMPM model (Si and Reiter, 2013)
• Hypothesis: P (X = (x1, . . . , xK ); θ) =
L
=1
θ
K
k=1
θ( )
xk
1 Variability of the parameters:
• a hierarchic prior on θ:
α ∼ G(.25, .25) ζ ∼ B(1, α) θ = ζ
g<
(1 − ζg
) for in 1, . . . , ∞
• posterior on θ: untractable
→ Gibbs sampler and Data Augmentation
2 Imputation according to the mixture model using the set of M
parameters
• Implemented: R package mi (Gelman et al.)
92 / 92