julie josse
June 14, 2016
340

# Multinomial Logit bilinear model

June 14, 2016

## Transcript

1. Introduction MCA MultiLogit
Multiple Correspondence Analysis & the
MultiLogit Model
Julie Josse - William Fithian - Patrick Groenen
Agrocampus, INRIA - Berkeley Statistics - Econometric Rotterdam
AgroParisTech, Paris, 13 June 2016
1 / 35

2. Introduction MCA MultiLogit
Outline
1 Introduction
2 Multiple Correspondence Analysis
3 MultiLogit model for MCA
2 / 35

3. Introduction MCA MultiLogit
Exploratory multivariate data analysis
Descriptive methods, data visualization:
• Principal Component Analysis ⇒ continuous variables
• Correspondence Analysis ⇒ contingency table
• Multiple Correspondence Analysis ⇒ categorical variables
⇒ Dimensionality reduction (describe the data with a smaller
number of variables)
⇒ Geometrical approach: importance to graphical displays
⇒ No probabilistic framework, in line with Benzecri (1973)’s idea:
“Let the data speak for themselves”
3 / 35

4. Introduction MCA MultiLogit
Underlying model?
⇒ SVD of certain matrices with speciﬁc row and column weights
and metrics (used to compute the distances).
“Doing a data analysis, in good mathematics, is simply searching
eigenvectors, all the science of it (the art) is just to ﬁnd the right
matrix to diagonalize. (Benzecri, 1973)”
⇒ Speciﬁc choices of weights and metrics can be viewed as
inducing speciﬁc models for the data under analysis.
⇒ Understanding the connections between exploratory multivariate
methods and their cognate models (selecting number of PC, missing values;
estimation with SVD, graphics, etc..)
4 / 35

5. Introduction MCA MultiLogit
The linear-bilinear model & PCA
⇒ The ﬁxed-eﬀects model (Caussinus, 1986) for X ∈ Rn×m:
xij ∼ N(µij, σ2), with µij = βj + Γij = βj +
K
k=1
dkuikvjk,
with identiﬁability constraint UT U(n×K)
= VT V(m×K)
= IK .
⇒ Population data... (sensory analysis) - (PPCA: random eﬀect)
5 / 35

6. Introduction MCA MultiLogit
The linear-bilinear model & PCA
⇒ The ﬁxed-eﬀects model (Caussinus, 1986) for X ∈ Rn×m:
xij ∼ N(µij, σ2), with µij = βj + Γij = βj +
K
k=1
dkuikvjk,
with identiﬁability constraint UT U(n×K)
= VT V(m×K)
= IK .
⇒ Population data... (sensory analysis) - (PPCA: random eﬀect)
⇒ MLE of Γ amounts to LS approx of Z = In − 1
n
11T X:
SVD (ALS algorithms) Γ = UK DK VT
K
Fixed factors scores models (De Leeuw, 1997) - Anova: linear-bilinear models
(Mandel, 1969, Denis 1994), AMMI, biadditive models (Gabriel 1978, Gower, 1995).
Useful in Anova without replication.
5 / 35

7. Introduction MCA MultiLogit
The log-bilinear model & CA
⇒ The saturated log-linear models (Christensen, 1990; Agresti, 2013).
log µij = αi + βj + Γij
⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI)
log µij = αi + βj +
K
k=1
dkuikvjk
Estimation: iterative weighted least squares, steps of GLM.
6 / 35

8. Introduction MCA MultiLogit
The log-bilinear model & CA
⇒ The saturated log-linear models (Christensen, 1990; Agresti, 2013).
log µij = αi + βj + Γij
⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI)
log µij = αi + βj +
K
k=1
dkuikvjk
Estimation: iterative weighted least squares, steps of GLM.
⇒ CA (Greenacre,1984) texts corpus, spectral clustering on graphs:
zij =
xij/N − ri cj

ri cj
i.e. Z = D−1/2
r (X/N − rcT )D−1/2
c
if X adjacency matrix, Z is symmetric normalized graph Laplacian
6 / 35

9. Introduction MCA MultiLogit
CA approximates the log-bilinear model
⇒ SVD Z = UK DK VT
K
. Standard row and col coord
UK = D−1/2
r UK , VK = D−1/2
c VK . If the low-rank approx is good:
UK DK VT
K
≈ D−1/2
r ZD−1/2
c = D−1
r
(X/N − rcT )D−1
c
(1)
By “solving for X” in (1), we get the reconstruction formula:
X/N = rcT + Dr(UK DK VT
K
)Dc i.e.
ˆ
xij
N
= ri cj 1 +
K
k=1
dkuikvjc (2)
7 / 35

10. Introduction MCA MultiLogit
CA approximates the log-bilinear model
⇒ SVD Z = UK DK VT
K
. Standard row and col coord
UK = D−1/2
r UK , VK = D−1/2
c VK . If the low-rank approx is good:
UK DK VT
K
≈ D−1/2
r ZD−1/2
c = D−1
r
(X/N − rcT )D−1
c
(1)
By “solving for X” in (1), we get the reconstruction formula:
X/N = rcT + Dr(UK DK VT
K
)Dc i.e.
ˆ
xij
N
= ri cj 1 +
K
k=1
dkuikvjc (2)
⇒ Connection (Escoﬁer, 1982) : when K
k=1
dkuikvjk << 1, eq. (2) is:
log(ˆ
xij) ≈ log(N) + log(ri ) + log(cj) +
K
k=1
dkuikvjk
7 / 35

11. Introduction MCA MultiLogit
Outline
1 Introduction
2 Multiple Correspondence Analysis
3 MultiLogit model for MCA
8 / 35

12. Introduction MCA MultiLogit
Data - examples
• large-scale survey datasets in the social sciences
• medical research: understand the genetic and environmental
risk factors of diseases. Ex diabetes: 300 questions (56 pages
of questionnaire!) on the food consumption habits, the
previous illness in the family, the presence of animals in the
household, the kind of paint used in the rooms, etc.
• genetic study: the relationship between a sequence of ACGT
nucleotides
9 / 35

13. Introduction MCA MultiLogit
Alcohol data
INPES (Santé publique France)
region sexe age year edu
Ile de France : 8120 F:29776 18_25: 6920 2005:27907 1:12684
Rhône Alpes : 5421 M:23165 26_34: 9401 2010:25034 2:23521
Provence Alpes Cote d’Azur: 4116 35_44:10899 3: 6563
Nord Pas de Calais : 3819 45_54: 9505 4:10173
Pays de Loire : 3152 55_64: 9503
Bretagne : 3038 65_+ : 6713
(Other) :25275
drunk alcohol glasses binge
0 :44237 <1/m :12889 0 : 2812 <2/m:10323
1-2 : 4952 0 : 6133 0-2:37867 0 :34345
10-19: 839 1-2/m: 7583 10+: 590 1/m : 6018
20-29: 212 1-2/w: 9526 3-4: 9486 1/w : 1881
3-5 : 1908 3-4/w: 6815 5-6: 1795 7/w : 374
30+ : 404 5-6/w: 3402 7-9: 391
6-9 : 389 7/w : 6593
region sexe age year edu drunk alcohol glasses binge
1 Rhône Alpes M 45_54 2005 1 0 0 0-2 0
2 Rhône Alpes M 45_54 2005 2 0 0 0-2 0
3 Rhône Alpes M 55_64 2005 2 0 0 0-2 0
4 Rhône Alpes M 18_25 2005 3 0 0 0-2 0
5 Rhône Alpes M 18_25 2005 2 0 0 0-2 0
6 Rhône Alpes M 26_34 2005 2 0 0 0-2 0
10 / 35

14. Introduction MCA MultiLogit
Coding categorical variables
X: n and m variables - the indicator matrix A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj , with
row i corresponding to a dummy coding of xij .
X =

1 1
2 3
1 2
2 3
2 2
2 2

⇐⇒ A =

1 0 1 0 0
0 1 0 0 1
1 0 0 1 0
0 1 0 0 1
0 1 0 1 0
0 1 0 1 0

=⇒ B =

2 0 1 1 0
0 4 0 2 2
1 0 1 0 0
1 2 0 3 0
0 2 0 0 2

pj (c) = 1
n
Aj
·c
: the cth normalized column margin of Aj , with p = (p1, . . . , pm)T .
All row margins of A are exactly m.
tab.disjonctif(don)[1:5, 22:47]
Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 1 2 3 4 0 1-2 10-19 20-29 3-5 30+ 6-9 <1/m
1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0
2 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
3 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
4 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0
5 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
11 / 35

15. Introduction MCA MultiLogit
Multiple Correspondence Analysis
ZA =
1

mn
(A − 1pT )D−1/2
p
ΓMCA to be the SVD decomposition UK DK VT
K
of ZA.
Homogeneity Analysis (Giﬁ 1990, J. de Leeuw, J. Meulman), Dual scaling (Nishisato,
1980, Guttman, 1941)
⇒ Interpreting the graphical displays where rows are represented
with F = UKDK and categories with VK = D−1/2
p VK
Properties:
• Fk = arg maxFk∈Rn
m
j=1
η2(Fk, Xm) - counterpart of PCA
• the distances between the rows and between the columns
coincide with the χ2 distances.
12 / 35

16. Introduction MCA MultiLogit
Individuals graph
Distance between individuals:
d2
i,i
= n
m
m
j=1
Cj
c=1
(Aj
ic
−Aj
i c
)2
pj (c)
• diﬀerent when they don’t take
same levels
• be careful, the frequency of
the levels are important!
Individuals not interesting.
−1 0 1 2 3
−1 0 1 2 3 4 5
MCA factor map
Dim 1 (10.87%)
Dim 2 (7.86%)
13 / 35

17. Introduction MCA MultiLogit
Individuals graph
Distance between individuals:
d2
i,i
= n
m
m
j=1
Cj
c=1
(Aj
ic
−Aj
i c
)2
pj (c)
Individuals not interesting.
Using categories for the interpre-
tation
13 / 35

18. Introduction MCA MultiLogit
Individuals graph
Distance between individuals:
d2
i,i
= n
m
m
j=1
Cj
c=1
(Aj
ic
−Aj
i c
)2
pj (c)
Individuals not interesting.
Using categories for the inter-
pretation : a category is at the
barycenter of individuals taking
the category
13 / 35

19. Introduction MCA MultiLogit
Individuals graph
Distance between individuals:
d2
i,i
= n
m
m
j=1
Cj
c=1
(Aj
ic
−Aj
i c
)2
pj (c)
Individuals not interesting.
Using categories for the inter-
pretation : a category is at the
barycenter of individuals taking
the category
13 / 35

20. Introduction MCA MultiLogit
Individuals graph
Distance between individuals:
d2
i,i
= n
m
m
j=1
Cj
c=1
(Aj
ic
−Aj
i c
)2
pj (c)
Individuals not interesting.
Using categories for the inter-
pretation : a category is at the
barycenter of individuals taking
the category
13 / 35

21. Introduction MCA MultiLogit
Categories graph
Distance between categories:
d2
c,c
= n
i=1
Aic
p(c)
− Aic
p(c )
2
• 2 levels are close if indiv taking
these levels are the same (ex:
65 years & retiree)/ if indiv take
the same levels for other var.
(ex: 60 years & 65 years)
• rare levels are far away from
the others
14 / 35

22. Introduction MCA MultiLogit
Supplementary variables
q
−0.4 −0.2 0.0 0.2 0.4
−0.2 0.0 0.2 0.4
Dim 1 (10.87%)
Dim 2 (7.86%)
Alsace
Aquitaine
Auvergne
Bourgogne
Bretagne
Centre
Champagne Ardennes
Corse
Franche Comté
Ile de France
Languedoc Roussillon
Limousin
Lorraine
Midi Pyrénées
Nord Pas de Calais
Normandie ( Basse )
Normandie ( Haute )
Pays de Loire
Picardie
Poitou Charentes
Provence Alpes Cote d'Azur
Rhône Alpes
F
M
18_25
26_34
35_44
45_54
55_64
65_+
2005
2010
edu.1
edu.2
edu.3
edu.4
q
−0.4 −0.2 0.0 0.2 0.4
−0.2 0.0 0.2 0.4
Dim 1 (10.87%)
Dim 2 (7.86%)
18_25
26_34
35_44
45_54
55_64
65_+
15 / 35

23. Introduction MCA MultiLogit
Supplementary variables
-2 0 2 4
-1 0 1 2 3 4 5
MCA factor map
Dim 1 (10.87%)
Dim 2 (7.86%)
drunk_0
drunk_1-2
drunk_10-19
drunk_20-29
drunk_3-5
drunk_30+
drunk_6-9
alcohol_<1/m
alcohol_0
alcohol_1-2/m
alcohol_1-2/w
alcohol_3-4/w
alcohol_5-6/w
alcohol_7/w
glasses_0
glasses_0-2
glasses_10+
glasses_3-4
glasses_5-6
glasses_7-9
binge_<2/m
binge_0
binge_1/m
binge_1/w
binge_7/w
15 / 35

24. Introduction MCA MultiLogit
Recontruction formula
Objective: best low rank K approx of ZA = 1

mn
(A − 1pT )D−1/2
p
Solution: SVD ˆ
ZA = UK DK VT
K
.
The standard row and columns coordinates are UK = UK and
VK = D−1/2
p VK . If the low-rank approximation is good, we have:
UK DK VT
K
p
= (A − 1pT )D−1
p
, (3)
in a weighted least-squares sense. By “solving for A” in (3), we
obtain the reconstruction formula:
A ≈ 1pT + (UK DK VT
K
)Dp
16 / 35

25. Introduction MCA MultiLogit
Reconstruction with MCA
ˆ
Aj
ic
≈ pj(c) 1 +
K
k=1
dkuikvjk(c)
Category c chosen by person i on variable j is modeled by: main
eﬀect + low rank sructure
initial data
<1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w
1 0 1 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 1 0 0 0 0 0
5 0 1 0 0 0 0 0
reconstruction with 2 dimensions
<1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w
[1,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
[2,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
[3,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
[4,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
[5,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
17 / 35

26. Introduction MCA MultiLogit
Outline
1 Introduction
2 Multiple Correspondence Analysis
3 MultiLogit model for MCA
18 / 35

27. Introduction MCA MultiLogit
Models for categorical variables
• Log-linear models - gold standard (Christensen, 1990; Agresti, 2013)
⇒ Pb with high dimensional data.
• Latent variables models:
• categoricals: latent class models (Goodman, 1974) -
unsupervised clustering for one latent variable.
Nonparametric Bayesian extensions (Dunson, 2009, 2012)
• continuous: latent-trait models (Lazar, 1968) - item
response theory (psychology & education, Van der Linden, 1997)
• ﬁxed: often one latent variable, diﬃculties to estimate.
• random: Gaussian distribution on the latent variables
(Moustaki, 2000; Sanchez, 2013)
⇒ Binary data: Collins, Dasgupta, & Schapire (2001), Buntine
(2002), Hoﬀ (2009), De Leeuw (2006), Li & Tao (2013)
19 / 35

28. Introduction MCA MultiLogit
Multilogit-bilinear model
P(xij = c) = πijc =
eθij (c)
Cj
c =1
eθij (c )
,
θij(c) = βj(c) + Γj
i
(c) = βj(c) +
K
k=1
dkuikvjk(c)
˜
vj(c) = (

d1vj1(c),

d2vj2(c)):
question j, category c with coor-
dinates one point/ Cj categories.
The latent variables ˜
ui = D1/2
2
ui
P(xij = c) ∝
exp ˜
βj(c) − 1
2
˜
vj(c) − ˜
ui
2
q
q
q
q
Latent Space
Dimension 1
Dimension 2
vj
(1)
vj
(2)
vj
(3)
vj
(4)
u1
u2
20 / 35

29. Introduction MCA MultiLogit
Relationship with MCA
Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj
Model: πijc = eβj (c)+Γ
j
i
(c)
Cj
c =1
eβj (c )+Γ
j
i
(c )
Param: ζ = β
vec(Γ)
; ζ0 = β0
0
21 / 35

30. Introduction MCA MultiLogit
Relationship with MCA
Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj
Model: πijc = eβj (c)+Γ
j
i
(c)
Cj
c =1
eβj (c )+Γ
j
i
(c )
Param: ζ = β
vec(Γ)
; ζ0 = β0
0
⇒ Rationale: Taylor expand around the independence model ζ0:
˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) +
1
2
(ζ − ζ0)T (ζ0)(ζ − ζ0)
˜(β, Γ) a quadratic function of its arguments, then maximizing the
latter amounts to a generalized SVD ⇒ MCA.
21 / 35

31. Introduction MCA MultiLogit
Relationship with MCA
Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj
Model: πijc = eβj (c)+Γ
j
i
(c)
Cj
c =1
eβj (c )+Γ
j
i
(c )
Param: ζ = β
vec(Γ)
; ζ0 = β0
0
⇒ Rationale: Taylor expand around the independence model ζ0:
˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) +
1
2
(ζ − ζ0)T (ζ0)(ζ − ζ0)
˜(β, Γ) a quadratic function of its arguments, then maximizing the
latter amounts to a generalized SVD ⇒ MCA.
⇒ The joint likelihood is n
i=1
m
j=1
Cj
c=1
πAj
ic
ijc
(independence)
⇒ The log-likelihood for the MultiLogit Bilinear model is:
= i,j,c
Aj
ic
log(πijk) = i,j,c
Aj
ic
log exp(βj (c)+Γj
i
(c))
Cj
c =1
exp(βj (c )+Γj
i
(c ))
21 / 35

32. Introduction MCA MultiLogit
Relationship with MCA
(β, Γ; A) = βj(Aj
i
) + Γj
i
(Aj
i
) − log

Cj
c=1
eβj (c)+Γj
i
(c)

∂Γj
i
(c)
= 1xij =c −
eβj (c)+Γj
i
(c)
Cj
c =1
eβj (c )+Γj
i
(c )
= Aj
ic
− πijc (4)

∂Γj
i
(c)∂Γj
i
(c )
=
πijcπijc − πijc1c=c j = j , i = i
0 o.w.
(5)
Evaluating (4) at ζ0 = (β0 = log(p), 0) gives Aj
ic
− pj(c) - idem (5)
˜(β, Γ) ≈ Γ, A − 1pT −
1
2
ΓD1/2
p
2
F
22 / 35

33. Introduction MCA MultiLogit
Relationship with MCA
Lemma
Let G ∈ Rn×n, H1 ∈ Rn×n, H2 ∈ Rm×m, with H1, H2 0.
argmaxΓ: rank(Γ)≤K
Γ, G −
1
2
H1ΓH2
2
F
Γ∗ = H−1
1
SVDK (H−1
1
GH−1
2
) H−1
2
Thus, using Lemma 1, the solution Γ, A − 1pT − 1
2
ΓD1/2
p
2
F
is
given by the rank K SVD of (A − 1pT )D−1/2
p which is precisely
the SVD performed in MCA.
Theorem
The one-step likelihood estimate for the MultiLogit Bilinear model
with rank constraint K, obtained by expanding around the
independence model (β0
= log p, Γ0 = 0), is (β0
, ΓMCA).
23 / 35

34. Introduction MCA MultiLogit
Maximizing the likelihood
Data: A = [A1, . . . , Am]
Model: πijc = exp(θij(c)
)
Cj
c =1
exp(θij(c )
)
θij(c) = βj(c) + K
k=1
dkuikvjk(c)
Identiﬁcation constraint: βj
1 = 0, 1 U = 0, U U = I, 1 Vj = 0
Maximum dimensionality: K = min(n − 1, m
j=1
Cj − m)
Estimation: MLE
⇒ Problem: overﬁtting: θij(c)
→ ∞ or θij(c)
→ −∞
⇒ Solution: penalized likelihood.
L(β, U, D, V) = −
n
i=1
m
j=1
Cj
c=1
Aj
ic
log(πijc) + λ
K
k=1
dk
24 / 35

35. Introduction MCA MultiLogit
Majorization
• Majorization (or MM) algorithms (De Leeuw & Heiser, 1977; Lange,
2004) use in each iteration a majorizing function g(θ, θ0).
• Current estimate θ0 is called supporting point.
• Requirements:
1 f (θ0
) = g(θ0, θ0
).
2 f (θ) ≤ g(θ, θ0
).
• Sandwich inequality: f (θ+) ≤ g(θ+, θ0) ≤ g(θ0, θ0) = f (θ0)
with θ+ = argming(θ, θ0)
• Any majorization algorithm is guaranteed to descent.
25 / 35

36. Introduction MCA MultiLogit
Majorizing Function
fij(θi ) = − Cj
c=1
Aj
ic
log(πijc) = − Cj
c=1
Aj
ic
log exp(θij(c)
)
Cj
c =1
exp(θij(c )
)
With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij
)
Theorem
gij(θi , θ(0)
i
) = fij(θ(0)
i
) + (θi − θ(0)
i
) ∇fij(θ(0)
i
) + 1/4 θi − θ(0)
i
2 is a
majorizing function of fij(θi ) (using De Leeuw, 2005)
Proof h(θi ) = gij(θi , θ(0)
i
) − fij(θi ) ≥ 0:
• For θi
= θ(0)
i
we have gij
(θ(0)
i
, θ(0)
i
) − fij
(θ(0)
i
) = 0
• At θ(0)
i
we have ∇fij
(θ(0)
i
) = ∇gij
(θ(0)
i
, θ(0)
i
)
• 1
2
I − ∇2fij
(θi
) is positive semi-deﬁnite (largest eigenvalue of
∇2fij
(θi
) is smaller than 1/2)
26 / 35

37. Introduction MCA MultiLogit
Majorizing Function
fij(θi ) = − Cj
c=1
Aj
ic
log(πijc) = − Cj
c=1
Aj
ic
log exp(θij(c)
)
Cj
c =1
exp(θij(c )
)
With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij
)
Theorem
gij(θi , θ(0)
i
) = fij(θ(0)
i
) + (θi − θ(0)
i
) ∇fij(θ(0)
i
) + 1/4 θi − θ(0)
i
2 is a
majorizing function of fij(θi ) (using De Leeuw, 2005)
H =

π1(1 − π1) −π1π2 −π1π3 ...
−π1π2 π2(1 − π2) −π2π3 ...
... ... π3(1 − π3) ...
... ... .... ...

Gerschgorin disks: the eigenvalue φ is always smaller than a diagonal element plus the
sum of its absolute oﬀ-diagonal row (or col) values
φ ≤ πijc − π2
ijc
+ πijc =c
πij
φ ≤ πijc − π2
ijc
+ πijc
Cj
=1
πij − π2
ijc
φ ≤ 2(πijc − π2
ijc
) = 2πijc (1 − πijc ).
2πijc (1 − πijc ) reaches its maximum of 1/2 at πijc = 1/2
26 / 35

38. Introduction MCA MultiLogit
Majorizing Function
fij(θi ) = − Cj
c=1
Aj
ic
log(πijc) = − Cj
c=1
Aj
ic
log exp(θij(c)
)
Cj
c =1
exp(θij(c )
)
With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij
)
Theorem
gij(θi , θ(0)
i
) = fij(θ(0)
i
) + (θi − θ(0)
i
) ∇fij(θ(0)
i
) + 1/4 θi − θ(0)
i
2 is a
majorizing function of fij(θi ) (using De Leeuw, 2005)
Proof h(θi ) = gij(θi , θ(0)
i
) − fij(θi ) ≥ 0:
• For θi
= θ(0)
i
we have gij
(θ(0)
i
, θ(0)
i
) − fij
(θ(0)
i
) = 0
• At θ(0)
i
we have ∇fij
(θ(0)
i
) = ∇gij
(θ(0)
i
, θ(0)
i
)
• 1
2
I − ∇2fij
(θi
) is positive semi-deﬁnite (largest eigenvalue of
∇2fij
(θi
) is smaller than 1/2)
26 / 35

39. Introduction MCA MultiLogit
Updating parameters
θij(c) = βj(c) + K
k=1
dkuikvjk(c)
L(β, U, D, V) ≤
1
4
i,j,c
Aj
ic
(zijc − θij(c)
)2 + λ
K
k=1
dk + c
zijc = β(0)
jc
+ u(0)
i
D(0)v(0)
jk
+ 2(Aijc − πijc(θ0))
• Update β: β = n−1Z 1
• Update U and V: Let (I − n−111 )Z = PΦQ be the SVD.
U = P and V = Q.
• Update D: K
k=1
[(φk − dk)2 + λdk]
dk = max(0, φk − λ) nuclear norm: there is automatic
dimension selection.
27 / 35

40. Introduction MCA MultiLogit
Selecting λ
⇒ 2 steps : QUT to select the rank K - Shrinkage with CV
Rationale with Lasso:
• Lasso often used for screening (Buhlmann van de Geer, 2011)
• Selecting λ with CV or STEIN focuses on predictive properties
• Optimal threshold for prediction = optimal for selecting var
⇒ Quantile Universal Threshold (Sardy, 2016) : select the threshold
at the bulk edge of what a threshold should be under the null.
Guaranteed variable screening with high proba. Be careful, biaised!
28 / 35

41. Introduction MCA MultiLogit
Quantile Universal Threshold
Ex PCA. X = µ + ε, with εij ∼ N(0, σ2) → ˆ
µK = K
k=1
uk dk vk
Soft-threshold: argminµ
X − µ 2
2
+ λ µ ∗ → dk max 1 − λ
dk
, 0
⇒ Selecting λ to have good estimation of the rank
1 Generate data under the null hypothesis of no signal, µ = 0
2 Compute the ﬁrst singular value d1
3 Repeat 1000 times 1 and 2
4 Use the (1 − α)-quantile of the distribution of d1 as threshold
(Exact results Zanella 2009; Asymptotic results, random matrix theory Shabalin 2013,
Paul 2007, Baik 2006...)
⇒ Suppose to know σ!
29 / 35

42. Introduction MCA MultiLogit
Selecting λ
⇒ 2 steps : QUT to select the rank K - Shrinkage with CV
Model: πijc = exp(βj (c)+ K
k=1
dk uik vjk (c))
Cj
c =1
exp(βj (c )+ K
k=1
dk uik vjk (c ))
Lik:L(β, U, D, V) = − n
i=1
m
j=1
Cj
c=1
Aj
ic
log(πijc) + λ K
k=1
dk
1 Generate under the null of no interaction and take λ the
quantile of the distribution of d1: good rank recovery
2 For a rank KQUT, estimate λ with Cross-Validation to
determine the amount of shrinkage (Lasso + LS)
k-fold CV, λ with the best out-of-sample deviance is chosen.
30 / 35

43. Introduction MCA MultiLogit
Simulations
• n: 50, 100, 300
• m: 20, 100, 300 - 3 categories/variables
• Interaction K: 2, 6
• (d1/d2): 2, 1
• the strength of the interaction (low, strong).
˜
ui ∼ NK 0,
d1 0
0 dK
˜
vj(c) ∼ NK 0,
d1 0
0 dK
θc
ij
= −
1
2
˜
ui − ˜
vj(c) 2
P(xij = c) ∝ eθc
ij
31 / 35

44. Introduction MCA MultiLogit
Simulations
n p rank ratio strength model MCA
1 50 20 2 1 0.1 0.044 0.035
2 50 20 2 1 1 0.020 0.045
3 50 20 2 2 0.1 0.048 0.036
4 50 20 2 2 1 0.0206 0.042
5 50 20 6 1 0.1 0.111 0.064
6 50 20 6 1 1 0.045 0.026
7 50 20 6 2 0.1 0.115 (0.028) 0.071
8 50 20 6 2 1 0.032 0.051
9 300 100 2 1 0.1 0.005 0.006
10 300 100 2 1 1 0.004 0.042
11 300 100 2 2 0.1 0.0047 0.005
12 300 100 2 2 1 0.0037 (0.00369) 0.040
13 300 300 2 1 0.1 0.003 0.004
14 300 300 2 1 1 0.002 0.039
15 300 300 2 2 0.1 0.003 0.004
16 300 300 2 2 1 0.002 0.039
17 300 100 6 1 0.1 0.019 0.015
18 300 100 6 1 1 0.011 0.023
19 300 100 6 2 0.1 0.018 (0.010) 0.017
20 300 100 6 2 1 0.010 0.056
21 300 300 6 1 0.1 0.011 0.008
22 300 300 6 1 1 0.006 0.022
23 300 300 6 2 0.1 0.009 0.012
32 / 35

45. Introduction MCA MultiLogit
Simulation
33 / 35

46. Introduction MCA MultiLogit
Simulation
33 / 35

47. Introduction MCA MultiLogit
Overﬁtting
⇒ n = 50, p = 20, r = 6, strength = 0.1 ⇒ Alcohol data....
0 5000 10000 15000
300 400 500 600 700 800
Index
essai\$dev
34 / 35

48. Introduction MCA MultiLogit
Overﬁtting
⇒ n = 50, p = 20, r = 6, strength = 0.1 ⇒ Alcohol data....
0 100 200 300 400 500 600
20000 21000 22000 23000 24000 25000 26000
Index
essai\$dev
34 / 35

49. Introduction MCA MultiLogit
Conclusion
MCA can be seen as a linearized estimate of the parameters of the
multinomial logit bilinear model ⇒ MCA a proxy to estimate the
model’s parameters (small interaction)
• graphics
• mixed data (quanti, quali) - FAMD / Multiple Factor Analysis
for groups of variables
• mixture of MCA/ mixture of PPCA
• selecting the rank with BIC?? Fixed eﬀect, asymptotics, n - p?
• regularization in MCA to tackle overﬁtting issues
• missing values
35 / 35