290

# Multinomial Logit bilinear model

June 14, 2016

## Transcript

1. ### Introduction MCA MultiLogit Multiple Correspondence Analysis & the MultiLogit Model

Julie Josse - William Fithian - Patrick Groenen Agrocampus, INRIA - Berkeley Statistics - Econometric Rotterdam AgroParisTech, Paris, 13 June 2016 1 / 35
2. ### Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis

3 MultiLogit model for MCA 2 / 35
3. ### Introduction MCA MultiLogit Exploratory multivariate data analysis Descriptive methods, data

visualization: • Principal Component Analysis ⇒ continuous variables • Correspondence Analysis ⇒ contingency table • Multiple Correspondence Analysis ⇒ categorical variables ⇒ Dimensionality reduction (describe the data with a smaller number of variables) ⇒ Geometrical approach: importance to graphical displays ⇒ No probabilistic framework, in line with Benzecri (1973)’s idea: “Let the data speak for themselves” 3 / 35
4. ### Introduction MCA MultiLogit Underlying model? ⇒ SVD of certain matrices

with speciﬁc row and column weights and metrics (used to compute the distances). “Doing a data analysis, in good mathematics, is simply searching eigenvectors, all the science of it (the art) is just to ﬁnd the right matrix to diagonalize. (Benzecri, 1973)” ⇒ Speciﬁc choices of weights and metrics can be viewed as inducing speciﬁc models for the data under analysis. ⇒ Understanding the connections between exploratory multivariate methods and their cognate models (selecting number of PC, missing values; estimation with SVD, graphics, etc..) 4 / 35
5. ### Introduction MCA MultiLogit The linear-bilinear model & PCA ⇒ The

ﬁxed-eﬀects model (Caussinus, 1986) for X ∈ Rn×m: xij ∼ N(µij, σ2), with µij = βj + Γij = βj + K k=1 dkuikvjk, with identiﬁability constraint UT U(n×K) = VT V(m×K) = IK . ⇒ Population data... (sensory analysis) - (PPCA: random eﬀect) 5 / 35
6. ### Introduction MCA MultiLogit The linear-bilinear model & PCA ⇒ The

ﬁxed-eﬀects model (Caussinus, 1986) for X ∈ Rn×m: xij ∼ N(µij, σ2), with µij = βj + Γij = βj + K k=1 dkuikvjk, with identiﬁability constraint UT U(n×K) = VT V(m×K) = IK . ⇒ Population data... (sensory analysis) - (PPCA: random eﬀect) ⇒ MLE of Γ amounts to LS approx of Z = In − 1 n 11T X: SVD (ALS algorithms) Γ = UK DK VT K PCA scores FK = UK DK and loadings VK Fixed factors scores models (De Leeuw, 1997) - Anova: linear-bilinear models (Mandel, 1969, Denis 1994), AMMI, biadditive models (Gabriel 1978, Gower, 1995). Useful in Anova without replication. 5 / 35
7. ### Introduction MCA MultiLogit The log-bilinear model & CA ⇒ The

saturated log-linear models (Christensen, 1990; Agresti, 2013). log µij = αi + βj + Γij ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI) log µij = αi + βj + K k=1 dkuikvjk Estimation: iterative weighted least squares, steps of GLM. 6 / 35
8. ### Introduction MCA MultiLogit The log-bilinear model & CA ⇒ The

saturated log-linear models (Christensen, 1990; Agresti, 2013). log µij = αi + βj + Γij ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI) log µij = αi + βj + K k=1 dkuikvjk Estimation: iterative weighted least squares, steps of GLM. ⇒ CA (Greenacre,1984) texts corpus, spectral clustering on graphs: zij = xij/N − ri cj √ ri cj i.e. Z = D−1/2 r (X/N − rcT )D−1/2 c if X adjacency matrix, Z is symmetric normalized graph Laplacian 6 / 35
9. ### Introduction MCA MultiLogit CA approximates the log-bilinear model ⇒ SVD

Z = UK DK VT K . Standard row and col coord UK = D−1/2 r UK , VK = D−1/2 c VK . If the low-rank approx is good: UK DK VT K ≈ D−1/2 r ZD−1/2 c = D−1 r (X/N − rcT )D−1 c (1) By “solving for X” in (1), we get the reconstruction formula: X/N = rcT + Dr(UK DK VT K )Dc i.e. ˆ xij N = ri cj 1 + K k=1 dkuikvjc (2) 7 / 35
10. ### Introduction MCA MultiLogit CA approximates the log-bilinear model ⇒ SVD

Z = UK DK VT K . Standard row and col coord UK = D−1/2 r UK , VK = D−1/2 c VK . If the low-rank approx is good: UK DK VT K ≈ D−1/2 r ZD−1/2 c = D−1 r (X/N − rcT )D−1 c (1) By “solving for X” in (1), we get the reconstruction formula: X/N = rcT + Dr(UK DK VT K )Dc i.e. ˆ xij N = ri cj 1 + K k=1 dkuikvjc (2) ⇒ Connection (Escoﬁer, 1982) : when K k=1 dkuikvjk << 1, eq. (2) is: log(ˆ xij) ≈ log(N) + log(ri ) + log(cj) + K k=1 dkuikvjk 7 / 35
11. ### Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis

3 MultiLogit model for MCA 8 / 35
12. ### Introduction MCA MultiLogit Data - examples • large-scale survey datasets

in the social sciences • medical research: understand the genetic and environmental risk factors of diseases. Ex diabetes: 300 questions (56 pages of questionnaire!) on the food consumption habits, the previous illness in the family, the presence of animals in the household, the kind of paint used in the rooms, etc. • genetic study: the relationship between a sequence of ACGT nucleotides 9 / 35
13. ### Introduction MCA MultiLogit Alcohol data INPES (Santé publique France) region

sexe age year edu Ile de France : 8120 F:29776 18_25: 6920 2005:27907 1:12684 Rhône Alpes : 5421 M:23165 26_34: 9401 2010:25034 2:23521 Provence Alpes Cote d’Azur: 4116 35_44:10899 3: 6563 Nord Pas de Calais : 3819 45_54: 9505 4:10173 Pays de Loire : 3152 55_64: 9503 Bretagne : 3038 65_+ : 6713 (Other) :25275 drunk alcohol glasses binge 0 :44237 <1/m :12889 0 : 2812 <2/m:10323 1-2 : 4952 0 : 6133 0-2:37867 0 :34345 10-19: 839 1-2/m: 7583 10+: 590 1/m : 6018 20-29: 212 1-2/w: 9526 3-4: 9486 1/w : 1881 3-5 : 1908 3-4/w: 6815 5-6: 1795 7/w : 374 30+ : 404 5-6/w: 3402 7-9: 391 6-9 : 389 7/w : 6593 region sexe age year edu drunk alcohol glasses binge 1 Rhône Alpes M 45_54 2005 1 0 0 0-2 0 2 Rhône Alpes M 45_54 2005 2 0 0 0-2 0 3 Rhône Alpes M 55_64 2005 2 0 0 0-2 0 4 Rhône Alpes M 18_25 2005 3 0 0 0-2 0 5 Rhône Alpes M 18_25 2005 2 0 0 0-2 0 6 Rhône Alpes M 26_34 2005 2 0 0 0-2 0 10 / 35
14. ### Introduction MCA MultiLogit Coding categorical variables X: n and m

variables - the indicator matrix A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj , with row i corresponding to a dummy coding of xij . X =      1 1 2 3 1 2 2 3 2 2 2 2      ⇐⇒ A =      1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0      =⇒ B =      2 0 1 1 0 0 4 0 2 2 1 0 1 0 0 1 2 0 3 0 0 2 0 0 2      pj (c) = 1 n Aj ·c : the cth normalized column margin of Aj , with p = (p1, . . . , pm)T . All row margins of A are exactly m. tab.disjonctif(don)[1:5, 22:47] Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 1 2 3 4 0 1-2 10-19 20-29 3-5 30+ 6-9 <1/m 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 3 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 4 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 5 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 11 / 35
15. ### Introduction MCA MultiLogit Multiple Correspondence Analysis ZA = 1 √

mn (A − 1pT )D−1/2 p ΓMCA to be the SVD decomposition UK DK VT K of ZA. Homogeneity Analysis (Giﬁ 1990, J. de Leeuw, J. Meulman), Dual scaling (Nishisato, 1980, Guttman, 1941) ⇒ Interpreting the graphical displays where rows are represented with F = UKDK and categories with VK = D−1/2 p VK Properties: • Fk = arg maxFk∈Rn m j=1 η2(Fk, Xm) - counterpart of PCA • the distances between the rows and between the columns coincide with the χ2 distances. 12 / 35
16. ### Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

= n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) • diﬀerent when they don’t take same levels • be careful, the frequency of the levels are important! Individuals not interesting. −1 0 1 2 3 −1 0 1 2 3 4 5 MCA factor map Dim 1 (10.87%) Dim 2 (7.86%) 13 / 35
17. ### Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

= n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the interpre- tation 13 / 35
18. ### Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

= n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35
19. ### Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

= n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35
20. ### Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

= n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35
21. ### Introduction MCA MultiLogit Categories graph Distance between categories: d2 c,c

= n i=1 Aic p(c) − Aic p(c ) 2 • 2 levels are close if indiv taking these levels are the same (ex: 65 years & retiree)/ if indiv take the same levels for other var. (ex: 60 years & 65 years) • rare levels are far away from the others 14 / 35
22. ### Introduction MCA MultiLogit Supplementary variables q −0.4 −0.2 0.0 0.2

0.4 −0.2 0.0 0.2 0.4 Dim 1 (10.87%) Dim 2 (7.86%) Alsace Aquitaine Auvergne Bourgogne Bretagne Centre Champagne Ardennes Corse Franche Comté Ile de France Languedoc Roussillon Limousin Lorraine Midi Pyrénées Nord Pas de Calais Normandie ( Basse ) Normandie ( Haute ) Pays de Loire Picardie Poitou Charentes Provence Alpes Cote d'Azur Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 edu.1 edu.2 edu.3 edu.4 q −0.4 −0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4 Dim 1 (10.87%) Dim 2 (7.86%) 18_25 26_34 35_44 45_54 55_64 65_+ 15 / 35
23. ### Introduction MCA MultiLogit Supplementary variables -2 0 2 4 -1

0 1 2 3 4 5 MCA factor map Dim 1 (10.87%) Dim 2 (7.86%) drunk_0 drunk_1-2 drunk_10-19 drunk_20-29 drunk_3-5 drunk_30+ drunk_6-9 alcohol_<1/m alcohol_0 alcohol_1-2/m alcohol_1-2/w alcohol_3-4/w alcohol_5-6/w alcohol_7/w glasses_0 glasses_0-2 glasses_10+ glasses_3-4 glasses_5-6 glasses_7-9 binge_<2/m binge_0 binge_1/m binge_1/w binge_7/w 15 / 35
24. ### Introduction MCA MultiLogit Recontruction formula Objective: best low rank K

approx of ZA = 1 √ mn (A − 1pT )D−1/2 p Solution: SVD ˆ ZA = UK DK VT K . The standard row and columns coordinates are UK = UK and VK = D−1/2 p VK . If the low-rank approximation is good, we have: UK DK VT K ≈ ZAD−1/2 p = (A − 1pT )D−1 p , (3) in a weighted least-squares sense. By “solving for A” in (3), we obtain the reconstruction formula: A ≈ 1pT + (UK DK VT K )Dp 16 / 35
25. ### Introduction MCA MultiLogit Reconstruction with MCA ˆ Aj ic ≈

pj(c) 1 + K k=1 dkuikvjk(c) Category c chosen by person i on variable j is modeled by: main eﬀect + low rank sructure initial data <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w 1 0 1 0 0 0 0 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0 4 0 1 0 0 0 0 0 5 0 1 0 0 0 0 0 reconstruction with 2 dimensions <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w [1,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [2,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [3,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [4,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [5,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 17 / 35
26. ### Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis

3 MultiLogit model for MCA 18 / 35
27. ### Introduction MCA MultiLogit Models for categorical variables • Log-linear models

- gold standard (Christensen, 1990; Agresti, 2013) ⇒ Pb with high dimensional data. • Latent variables models: • categoricals: latent class models (Goodman, 1974) - unsupervised clustering for one latent variable. Nonparametric Bayesian extensions (Dunson, 2009, 2012) • continuous: latent-trait models (Lazar, 1968) - item response theory (psychology & education, Van der Linden, 1997) • ﬁxed: often one latent variable, diﬃculties to estimate. • random: Gaussian distribution on the latent variables (Moustaki, 2000; Sanchez, 2013) ⇒ Binary data: Collins, Dasgupta, & Schapire (2001), Buntine (2002), Hoﬀ (2009), De Leeuw (2006), Li & Tao (2013) 19 / 35
28. ### Introduction MCA MultiLogit Multilogit-bilinear model P(xij = c) = πijc

= eθij (c) Cj c =1 eθij (c ) , θij(c) = βj(c) + Γj i (c) = βj(c) + K k=1 dkuikvjk(c) ˜ vj(c) = ( √ d1vj1(c), √ d2vj2(c)): question j, category c with coor- dinates one point/ Cj categories. The latent variables ˜ ui = D1/2 2 ui P(xij = c) ∝ exp ˜ βj(c) − 1 2 ˜ vj(c) − ˜ ui 2 q q q q Latent Space Dimension 1 Dimension 2 vj (1) vj (2) vj (3) vj (4) u1 u2 20 / 35
29. ### Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A

= A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 21 / 35
30. ### Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A

= A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 ⇒ Rationale: Taylor expand around the independence model ζ0: ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) + 1 2 (ζ − ζ0)T (ζ0)(ζ − ζ0) ˜(β, Γ) a quadratic function of its arguments, then maximizing the latter amounts to a generalized SVD ⇒ MCA. 21 / 35
31. ### Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A

= A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 ⇒ Rationale: Taylor expand around the independence model ζ0: ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) + 1 2 (ζ − ζ0)T (ζ0)(ζ − ζ0) ˜(β, Γ) a quadratic function of its arguments, then maximizing the latter amounts to a generalized SVD ⇒ MCA. ⇒ The joint likelihood is n i=1 m j=1 Cj c=1 πAj ic ijc (independence) ⇒ The log-likelihood for the MultiLogit Bilinear model is: = i,j,c Aj ic log(πijk) = i,j,c Aj ic log exp(βj (c)+Γj i (c)) Cj c =1 exp(βj (c )+Γj i (c )) 21 / 35
32. ### Introduction MCA MultiLogit Relationship with MCA (β, Γ; A) =

βj(Aj i ) + Γj i (Aj i ) − log   Cj c=1 eβj (c)+Γj i (c)   ∂ ∂Γj i (c) = 1xij =c − eβj (c)+Γj i (c) Cj c =1 eβj (c )+Γj i (c ) = Aj ic − πijc (4) ∂ ∂Γj i (c)∂Γj i (c ) = πijcπijc − πijc1c=c j = j , i = i 0 o.w. (5) Evaluating (4) at ζ0 = (β0 = log(p), 0) gives Aj ic − pj(c) - idem (5) ˜(β, Γ) ≈ Γ, A − 1pT − 1 2 ΓD1/2 p 2 F 22 / 35
33. ### Introduction MCA MultiLogit Relationship with MCA Lemma Let G ∈

Rn×n, H1 ∈ Rn×n, H2 ∈ Rm×m, with H1, H2 0. argmaxΓ: rank(Γ)≤K Γ, G − 1 2 H1ΓH2 2 F Γ∗ = H−1 1 SVDK (H−1 1 GH−1 2 ) H−1 2 Thus, using Lemma 1, the solution Γ, A − 1pT − 1 2 ΓD1/2 p 2 F is given by the rank K SVD of (A − 1pT )D−1/2 p which is precisely the SVD performed in MCA. Theorem The one-step likelihood estimate for the MultiLogit Bilinear model with rank constraint K, obtained by expanding around the independence model (β0 = log p, Γ0 = 0), is (β0 , ΓMCA). 23 / 35
34. ### Introduction MCA MultiLogit Maximizing the likelihood Data: A = [A1,

. . . , Am] Model: πijc = exp(θij(c) ) Cj c =1 exp(θij(c ) ) θij(c) = βj(c) + K k=1 dkuikvjk(c) Identiﬁcation constraint: βj 1 = 0, 1 U = 0, U U = I, 1 Vj = 0 Maximum dimensionality: K = min(n − 1, m j=1 Cj − m) Estimation: MLE ⇒ Problem: overﬁtting: θij(c) → ∞ or θij(c) → −∞ ⇒ Solution: penalized likelihood. L(β, U, D, V) = − n i=1 m j=1 Cj c=1 Aj ic log(πijc) + λ K k=1 dk 24 / 35
35. ### Introduction MCA MultiLogit Majorization • Majorization (or MM) algorithms (De

Leeuw & Heiser, 1977; Lange, 2004) use in each iteration a majorizing function g(θ, θ0). • Current estimate θ0 is called supporting point. • Requirements: 1 f (θ0 ) = g(θ0, θ0 ). 2 f (θ) ≤ g(θ, θ0 ). • Sandwich inequality: f (θ+) ≤ g(θ+, θ0) ≤ g(θ0, θ0) = f (θ0) with θ+ = argming(θ, θ0) • Any majorization algorithm is guaranteed to descent. 25 / 35
36. ### Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj

c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) Proof h(θi ) = gij(θi , θ(0) i ) − fij(θi ) ≥ 0: • For θi = θ(0) i we have gij (θ(0) i , θ(0) i ) − fij (θ(0) i ) = 0 • At θ(0) i we have ∇fij (θ(0) i ) = ∇gij (θ(0) i , θ(0) i ) • 1 2 I − ∇2fij (θi ) is positive semi-deﬁnite (largest eigenvalue of ∇2fij (θi ) is smaller than 1/2) 26 / 35
37. ### Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj

c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) H =   π1(1 − π1) −π1π2 −π1π3 ... −π1π2 π2(1 − π2) −π2π3 ... ... ... π3(1 − π3) ... ... ... .... ...   Gerschgorin disks: the eigenvalue φ is always smaller than a diagonal element plus the sum of its absolute oﬀ-diagonal row (or col) values φ ≤ πijc − π2 ijc + πijc =c πij φ ≤ πijc − π2 ijc + πijc Cj =1 πij − π2 ijc φ ≤ 2(πijc − π2 ijc ) = 2πijc (1 − πijc ). 2πijc (1 − πijc ) reaches its maximum of 1/2 at πijc = 1/2 26 / 35
38. ### Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj

c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) Proof h(θi ) = gij(θi , θ(0) i ) − fij(θi ) ≥ 0: • For θi = θ(0) i we have gij (θ(0) i , θ(0) i ) − fij (θ(0) i ) = 0 • At θ(0) i we have ∇fij (θ(0) i ) = ∇gij (θ(0) i , θ(0) i ) • 1 2 I − ∇2fij (θi ) is positive semi-deﬁnite (largest eigenvalue of ∇2fij (θi ) is smaller than 1/2) 26 / 35
39. ### Introduction MCA MultiLogit Updating parameters θij(c) = βj(c) + K

k=1 dkuikvjk(c) L(β, U, D, V) ≤ 1 4 i,j,c Aj ic (zijc − θij(c) )2 + λ K k=1 dk + c zijc = β(0) jc + u(0) i D(0)v(0) jk + 2(Aijc − πijc(θ0)) • Update β: β = n−1Z 1 • Update U and V: Let (I − n−111 )Z = PΦQ be the SVD. U = P and V = Q. • Update D: K k=1 [(φk − dk)2 + λdk] dk = max(0, φk − λ) nuclear norm: there is automatic dimension selection. 27 / 35
40. ### Introduction MCA MultiLogit Selecting λ ⇒ 2 steps : QUT

to select the rank K - Shrinkage with CV Rationale with Lasso: • Lasso often used for screening (Buhlmann van de Geer, 2011) • Selecting λ with CV or STEIN focuses on predictive properties • Optimal threshold for prediction = optimal for selecting var ⇒ Quantile Universal Threshold (Sardy, 2016) : select the threshold at the bulk edge of what a threshold should be under the null. Guaranteed variable screening with high proba. Be careful, biaised! 28 / 35
41. ### Introduction MCA MultiLogit Quantile Universal Threshold Ex PCA. X =

µ + ε, with εij ∼ N(0, σ2) → ˆ µK = K k=1 uk dk vk Soft-threshold: argminµ X − µ 2 2 + λ µ ∗ → dk max 1 − λ dk , 0 ⇒ Selecting λ to have good estimation of the rank 1 Generate data under the null hypothesis of no signal, µ = 0 2 Compute the ﬁrst singular value d1 3 Repeat 1000 times 1 and 2 4 Use the (1 − α)-quantile of the distribution of d1 as threshold (Exact results Zanella 2009; Asymptotic results, random matrix theory Shabalin 2013, Paul 2007, Baik 2006...) ⇒ Suppose to know σ! 29 / 35
42. ### Introduction MCA MultiLogit Selecting λ ⇒ 2 steps : QUT

to select the rank K - Shrinkage with CV Model: πijc = exp(βj (c)+ K k=1 dk uik vjk (c)) Cj c =1 exp(βj (c )+ K k=1 dk uik vjk (c )) Lik:L(β, U, D, V) = − n i=1 m j=1 Cj c=1 Aj ic log(πijc) + λ K k=1 dk 1 Generate under the null of no interaction and take λ the quantile of the distribution of d1: good rank recovery 2 For a rank KQUT, estimate λ with Cross-Validation to determine the amount of shrinkage (Lasso + LS) k-fold CV, λ with the best out-of-sample deviance is chosen. 30 / 35
43. ### Introduction MCA MultiLogit Simulations • n: 50, 100, 300 •

m: 20, 100, 300 - 3 categories/variables • Interaction K: 2, 6 • (d1/d2): 2, 1 • the strength of the interaction (low, strong). ˜ ui ∼ NK 0, d1 0 0 dK ˜ vj(c) ∼ NK 0, d1 0 0 dK θc ij = − 1 2 ˜ ui − ˜ vj(c) 2 P(xij = c) ∝ eθc ij 31 / 35
44. ### Introduction MCA MultiLogit Simulations n p rank ratio strength model

MCA 1 50 20 2 1 0.1 0.044 0.035 2 50 20 2 1 1 0.020 0.045 3 50 20 2 2 0.1 0.048 0.036 4 50 20 2 2 1 0.0206 0.042 5 50 20 6 1 0.1 0.111 0.064 6 50 20 6 1 1 0.045 0.026 7 50 20 6 2 0.1 0.115 (0.028) 0.071 8 50 20 6 2 1 0.032 0.051 9 300 100 2 1 0.1 0.005 0.006 10 300 100 2 1 1 0.004 0.042 11 300 100 2 2 0.1 0.0047 0.005 12 300 100 2 2 1 0.0037 (0.00369) 0.040 13 300 300 2 1 0.1 0.003 0.004 14 300 300 2 1 1 0.002 0.039 15 300 300 2 2 0.1 0.003 0.004 16 300 300 2 2 1 0.002 0.039 17 300 100 6 1 0.1 0.019 0.015 18 300 100 6 1 1 0.011 0.023 19 300 100 6 2 0.1 0.018 (0.010) 0.017 20 300 100 6 2 1 0.010 0.056 21 300 300 6 1 0.1 0.011 0.008 22 300 300 6 1 1 0.006 0.022 23 300 300 6 2 0.1 0.009 0.012 32 / 35

47. ### Introduction MCA MultiLogit Overﬁtting ⇒ n = 50, p =

20, r = 6, strength = 0.1 ⇒ Alcohol data.... 0 5000 10000 15000 300 400 500 600 700 800 Index essai\$dev 34 / 35
48. ### Introduction MCA MultiLogit Overﬁtting ⇒ n = 50, p =

20, r = 6, strength = 0.1 ⇒ Alcohol data.... 0 100 200 300 400 500 600 20000 21000 22000 23000 24000 25000 26000 Index essai\$dev 34 / 35
49. ### Introduction MCA MultiLogit Conclusion MCA can be seen as a

linearized estimate of the parameters of the multinomial logit bilinear model ⇒ MCA a proxy to estimate the model’s parameters (small interaction) • graphics • mixed data (quanti, quali) - FAMD / Multiple Factor Analysis for groups of variables • mixture of MCA/ mixture of PPCA • selecting the rank with BIC?? Fixed eﬀect, asymptotics, n - p? • regularization in MCA to tackle overﬁtting issues • missing values 35 / 35