Slide 1

Slide 1 text

Introduction MCA MultiLogit Multiple Correspondence Analysis & the MultiLogit Model Julie Josse - William Fithian - Patrick Groenen Agrocampus, INRIA - Berkeley Statistics - Econometric Rotterdam AgroParisTech, Paris, 13 June 2016 1 / 35

Slide 2

Slide 2 text

Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis 3 MultiLogit model for MCA 2 / 35

Slide 3

Slide 3 text

Introduction MCA MultiLogit Exploratory multivariate data analysis Descriptive methods, data visualization: • Principal Component Analysis ⇒ continuous variables • Correspondence Analysis ⇒ contingency table • Multiple Correspondence Analysis ⇒ categorical variables ⇒ Dimensionality reduction (describe the data with a smaller number of variables) ⇒ Geometrical approach: importance to graphical displays ⇒ No probabilistic framework, in line with Benzecri (1973)’s idea: “Let the data speak for themselves” 3 / 35

Slide 4

Slide 4 text

Introduction MCA MultiLogit Underlying model? ⇒ SVD of certain matrices with specific row and column weights and metrics (used to compute the distances). “Doing a data analysis, in good mathematics, is simply searching eigenvectors, all the science of it (the art) is just to find the right matrix to diagonalize. (Benzecri, 1973)” ⇒ Specific choices of weights and metrics can be viewed as inducing specific models for the data under analysis. ⇒ Understanding the connections between exploratory multivariate methods and their cognate models (selecting number of PC, missing values; estimation with SVD, graphics, etc..) 4 / 35

Slide 5

Slide 5 text

Introduction MCA MultiLogit The linear-bilinear model & PCA ⇒ The fixed-effects model (Caussinus, 1986) for X ∈ Rn×m: xij ∼ N(µij, σ2), with µij = βj + Γij = βj + K k=1 dkuikvjk, with identifiability constraint UT U(n×K) = VT V(m×K) = IK . ⇒ Population data... (sensory analysis) - (PPCA: random effect) 5 / 35

Slide 6

Slide 6 text

Introduction MCA MultiLogit The linear-bilinear model & PCA ⇒ The fixed-effects model (Caussinus, 1986) for X ∈ Rn×m: xij ∼ N(µij, σ2), with µij = βj + Γij = βj + K k=1 dkuikvjk, with identifiability constraint UT U(n×K) = VT V(m×K) = IK . ⇒ Population data... (sensory analysis) - (PPCA: random effect) ⇒ MLE of Γ amounts to LS approx of Z = In − 1 n 11T X: SVD (ALS algorithms) Γ = UK DK VT K PCA scores FK = UK DK and loadings VK Fixed factors scores models (De Leeuw, 1997) - Anova: linear-bilinear models (Mandel, 1969, Denis 1994), AMMI, biadditive models (Gabriel 1978, Gower, 1995). Useful in Anova without replication. 5 / 35

Slide 7

Slide 7 text

Introduction MCA MultiLogit The log-bilinear model & CA ⇒ The saturated log-linear models (Christensen, 1990; Agresti, 2013). log µij = αi + βj + Γij ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI) log µij = αi + βj + K k=1 dkuikvjk Estimation: iterative weighted least squares, steps of GLM. 6 / 35

Slide 8

Slide 8 text

Introduction MCA MultiLogit The log-bilinear model & CA ⇒ The saturated log-linear models (Christensen, 1990; Agresti, 2013). log µij = αi + βj + Γij ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI) log µij = αi + βj + K k=1 dkuikvjk Estimation: iterative weighted least squares, steps of GLM. ⇒ CA (Greenacre,1984) texts corpus, spectral clustering on graphs: zij = xij/N − ri cj √ ri cj i.e. Z = D−1/2 r (X/N − rcT )D−1/2 c if X adjacency matrix, Z is symmetric normalized graph Laplacian 6 / 35

Slide 9

Slide 9 text

Introduction MCA MultiLogit CA approximates the log-bilinear model ⇒ SVD Z = UK DK VT K . Standard row and col coord UK = D−1/2 r UK , VK = D−1/2 c VK . If the low-rank approx is good: UK DK VT K ≈ D−1/2 r ZD−1/2 c = D−1 r (X/N − rcT )D−1 c (1) By “solving for X” in (1), we get the reconstruction formula: X/N = rcT + Dr(UK DK VT K )Dc i.e. ˆ xij N = ri cj 1 + K k=1 dkuikvjc (2) 7 / 35

Slide 10

Slide 10 text

Introduction MCA MultiLogit CA approximates the log-bilinear model ⇒ SVD Z = UK DK VT K . Standard row and col coord UK = D−1/2 r UK , VK = D−1/2 c VK . If the low-rank approx is good: UK DK VT K ≈ D−1/2 r ZD−1/2 c = D−1 r (X/N − rcT )D−1 c (1) By “solving for X” in (1), we get the reconstruction formula: X/N = rcT + Dr(UK DK VT K )Dc i.e. ˆ xij N = ri cj 1 + K k=1 dkuikvjc (2) ⇒ Connection (Escofier, 1982) : when K k=1 dkuikvjk << 1, eq. (2) is: log(ˆ xij) ≈ log(N) + log(ri ) + log(cj) + K k=1 dkuikvjk 7 / 35

Slide 11

Slide 11 text

Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis 3 MultiLogit model for MCA 8 / 35

Slide 12

Slide 12 text

Introduction MCA MultiLogit Data - examples • large-scale survey datasets in the social sciences • medical research: understand the genetic and environmental risk factors of diseases. Ex diabetes: 300 questions (56 pages of questionnaire!) on the food consumption habits, the previous illness in the family, the presence of animals in the household, the kind of paint used in the rooms, etc. • genetic study: the relationship between a sequence of ACGT nucleotides 9 / 35

Slide 13

Slide 13 text

Introduction MCA MultiLogit Alcohol data INPES (Santé publique France) region sexe age year edu Ile de France : 8120 F:29776 18_25: 6920 2005:27907 1:12684 Rhône Alpes : 5421 M:23165 26_34: 9401 2010:25034 2:23521 Provence Alpes Cote d’Azur: 4116 35_44:10899 3: 6563 Nord Pas de Calais : 3819 45_54: 9505 4:10173 Pays de Loire : 3152 55_64: 9503 Bretagne : 3038 65_+ : 6713 (Other) :25275 drunk alcohol glasses binge 0 :44237 <1/m :12889 0 : 2812 <2/m:10323 1-2 : 4952 0 : 6133 0-2:37867 0 :34345 10-19: 839 1-2/m: 7583 10+: 590 1/m : 6018 20-29: 212 1-2/w: 9526 3-4: 9486 1/w : 1881 3-5 : 1908 3-4/w: 6815 5-6: 1795 7/w : 374 30+ : 404 5-6/w: 3402 7-9: 391 6-9 : 389 7/w : 6593 region sexe age year edu drunk alcohol glasses binge 1 Rhône Alpes M 45_54 2005 1 0 0 0-2 0 2 Rhône Alpes M 45_54 2005 2 0 0 0-2 0 3 Rhône Alpes M 55_64 2005 2 0 0 0-2 0 4 Rhône Alpes M 18_25 2005 3 0 0 0-2 0 5 Rhône Alpes M 18_25 2005 2 0 0 0-2 0 6 Rhône Alpes M 26_34 2005 2 0 0 0-2 0 10 / 35

Slide 14

Slide 14 text

Introduction MCA MultiLogit Coding categorical variables X: n and m variables - the indicator matrix A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj , with row i corresponding to a dummy coding of xij . X =      1 1 2 3 1 2 2 3 2 2 2 2      ⇐⇒ A =      1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0      =⇒ B =      2 0 1 1 0 0 4 0 2 2 1 0 1 0 0 1 2 0 3 0 0 2 0 0 2      pj (c) = 1 n Aj ·c : the cth normalized column margin of Aj , with p = (p1, . . . , pm)T . All row margins of A are exactly m. tab.disjonctif(don)[1:5, 22:47] Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 1 2 3 4 0 1-2 10-19 20-29 3-5 30+ 6-9 <1/m 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 3 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 4 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 5 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 11 / 35

Slide 15

Slide 15 text

Introduction MCA MultiLogit Multiple Correspondence Analysis ZA = 1 √ mn (A − 1pT )D−1/2 p ΓMCA to be the SVD decomposition UK DK VT K of ZA. Homogeneity Analysis (Gifi 1990, J. de Leeuw, J. Meulman), Dual scaling (Nishisato, 1980, Guttman, 1941) ⇒ Interpreting the graphical displays where rows are represented with F = UKDK and categories with VK = D−1/2 p VK Properties: • Fk = arg maxFk∈Rn m j=1 η2(Fk, Xm) - counterpart of PCA • the distances between the rows and between the columns coincide with the χ2 distances. 12 / 35

Slide 16

Slide 16 text

Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) • different when they don’t take same levels • be careful, the frequency of the levels are important! Individuals not interesting. −1 0 1 2 3 −1 0 1 2 3 4 5 MCA factor map Dim 1 (10.87%) Dim 2 (7.86%) 13 / 35

Slide 17

Slide 17 text

Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the interpre- tation 13 / 35

Slide 18

Slide 18 text

Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35

Slide 19

Slide 19 text

Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35

Slide 20

Slide 20 text

Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35

Slide 21

Slide 21 text

Introduction MCA MultiLogit Categories graph Distance between categories: d2 c,c = n i=1 Aic p(c) − Aic p(c ) 2 • 2 levels are close if indiv taking these levels are the same (ex: 65 years & retiree)/ if indiv take the same levels for other var. (ex: 60 years & 65 years) • rare levels are far away from the others 14 / 35

Slide 22

Slide 22 text

Introduction MCA MultiLogit Supplementary variables q −0.4 −0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4 Dim 1 (10.87%) Dim 2 (7.86%) Alsace Aquitaine Auvergne Bourgogne Bretagne Centre Champagne Ardennes Corse Franche Comté Ile de France Languedoc Roussillon Limousin Lorraine Midi Pyrénées Nord Pas de Calais Normandie ( Basse ) Normandie ( Haute ) Pays de Loire Picardie Poitou Charentes Provence Alpes Cote d'Azur Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 edu.1 edu.2 edu.3 edu.4 q −0.4 −0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4 Dim 1 (10.87%) Dim 2 (7.86%) 18_25 26_34 35_44 45_54 55_64 65_+ 15 / 35

Slide 23

Slide 23 text

Introduction MCA MultiLogit Supplementary variables -2 0 2 4 -1 0 1 2 3 4 5 MCA factor map Dim 1 (10.87%) Dim 2 (7.86%) drunk_0 drunk_1-2 drunk_10-19 drunk_20-29 drunk_3-5 drunk_30+ drunk_6-9 alcohol_<1/m alcohol_0 alcohol_1-2/m alcohol_1-2/w alcohol_3-4/w alcohol_5-6/w alcohol_7/w glasses_0 glasses_0-2 glasses_10+ glasses_3-4 glasses_5-6 glasses_7-9 binge_<2/m binge_0 binge_1/m binge_1/w binge_7/w 15 / 35

Slide 24

Slide 24 text

Introduction MCA MultiLogit Recontruction formula Objective: best low rank K approx of ZA = 1 √ mn (A − 1pT )D−1/2 p Solution: SVD ˆ ZA = UK DK VT K . The standard row and columns coordinates are UK = UK and VK = D−1/2 p VK . If the low-rank approximation is good, we have: UK DK VT K ≈ ZAD−1/2 p = (A − 1pT )D−1 p , (3) in a weighted least-squares sense. By “solving for A” in (3), we obtain the reconstruction formula: A ≈ 1pT + (UK DK VT K )Dp 16 / 35

Slide 25

Slide 25 text

Introduction MCA MultiLogit Reconstruction with MCA ˆ Aj ic ≈ pj(c) 1 + K k=1 dkuikvjk(c) Category c chosen by person i on variable j is modeled by: main effect + low rank sructure initial data <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w 1 0 1 0 0 0 0 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0 4 0 1 0 0 0 0 0 5 0 1 0 0 0 0 0 reconstruction with 2 dimensions <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w [1,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [2,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [3,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [4,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [5,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 17 / 35

Slide 26

Slide 26 text

Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis 3 MultiLogit model for MCA 18 / 35

Slide 27

Slide 27 text

Introduction MCA MultiLogit Models for categorical variables • Log-linear models - gold standard (Christensen, 1990; Agresti, 2013) ⇒ Pb with high dimensional data. • Latent variables models: • categoricals: latent class models (Goodman, 1974) - unsupervised clustering for one latent variable. Nonparametric Bayesian extensions (Dunson, 2009, 2012) • continuous: latent-trait models (Lazar, 1968) - item response theory (psychology & education, Van der Linden, 1997) • fixed: often one latent variable, difficulties to estimate. • random: Gaussian distribution on the latent variables (Moustaki, 2000; Sanchez, 2013) ⇒ Binary data: Collins, Dasgupta, & Schapire (2001), Buntine (2002), Hoff (2009), De Leeuw (2006), Li & Tao (2013) 19 / 35

Slide 28

Slide 28 text

Introduction MCA MultiLogit Multilogit-bilinear model P(xij = c) = πijc = eθij (c) Cj c =1 eθij (c ) , θij(c) = βj(c) + Γj i (c) = βj(c) + K k=1 dkuikvjk(c) ˜ vj(c) = ( √ d1vj1(c), √ d2vj2(c)): question j, category c with coor- dinates one point/ Cj categories. The latent variables ˜ ui = D1/2 2 ui P(xij = c) ∝ exp ˜ βj(c) − 1 2 ˜ vj(c) − ˜ ui 2 q q q q Latent Space Dimension 1 Dimension 2 vj (1) vj (2) vj (3) vj (4) u1 u2 20 / 35

Slide 29

Slide 29 text

Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 21 / 35

Slide 30

Slide 30 text

Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 ⇒ Rationale: Taylor expand around the independence model ζ0: ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) + 1 2 (ζ − ζ0)T (ζ0)(ζ − ζ0) ˜(β, Γ) a quadratic function of its arguments, then maximizing the latter amounts to a generalized SVD ⇒ MCA. 21 / 35

Slide 31

Slide 31 text

Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 ⇒ Rationale: Taylor expand around the independence model ζ0: ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) + 1 2 (ζ − ζ0)T (ζ0)(ζ − ζ0) ˜(β, Γ) a quadratic function of its arguments, then maximizing the latter amounts to a generalized SVD ⇒ MCA. ⇒ The joint likelihood is n i=1 m j=1 Cj c=1 πAj ic ijc (independence) ⇒ The log-likelihood for the MultiLogit Bilinear model is: = i,j,c Aj ic log(πijk) = i,j,c Aj ic log exp(βj (c)+Γj i (c)) Cj c =1 exp(βj (c )+Γj i (c )) 21 / 35

Slide 32

Slide 32 text

Introduction MCA MultiLogit Relationship with MCA (β, Γ; A) = βj(Aj i ) + Γj i (Aj i ) − log   Cj c=1 eβj (c)+Γj i (c)   ∂ ∂Γj i (c) = 1xij =c − eβj (c)+Γj i (c) Cj c =1 eβj (c )+Γj i (c ) = Aj ic − πijc (4) ∂ ∂Γj i (c)∂Γj i (c ) = πijcπijc − πijc1c=c j = j , i = i 0 o.w. (5) Evaluating (4) at ζ0 = (β0 = log(p), 0) gives Aj ic − pj(c) - idem (5) ˜(β, Γ) ≈ Γ, A − 1pT − 1 2 ΓD1/2 p 2 F 22 / 35

Slide 33

Slide 33 text

Introduction MCA MultiLogit Relationship with MCA Lemma Let G ∈ Rn×n, H1 ∈ Rn×n, H2 ∈ Rm×m, with H1, H2 0. argmaxΓ: rank(Γ)≤K Γ, G − 1 2 H1ΓH2 2 F Γ∗ = H−1 1 SVDK (H−1 1 GH−1 2 ) H−1 2 Thus, using Lemma 1, the solution Γ, A − 1pT − 1 2 ΓD1/2 p 2 F is given by the rank K SVD of (A − 1pT )D−1/2 p which is precisely the SVD performed in MCA. Theorem The one-step likelihood estimate for the MultiLogit Bilinear model with rank constraint K, obtained by expanding around the independence model (β0 = log p, Γ0 = 0), is (β0 , ΓMCA). 23 / 35

Slide 34

Slide 34 text

Introduction MCA MultiLogit Maximizing the likelihood Data: A = [A1, . . . , Am] Model: πijc = exp(θij(c) ) Cj c =1 exp(θij(c ) ) θij(c) = βj(c) + K k=1 dkuikvjk(c) Identification constraint: βj 1 = 0, 1 U = 0, U U = I, 1 Vj = 0 Maximum dimensionality: K = min(n − 1, m j=1 Cj − m) Estimation: MLE ⇒ Problem: overfitting: θij(c) → ∞ or θij(c) → −∞ ⇒ Solution: penalized likelihood. L(β, U, D, V) = − n i=1 m j=1 Cj c=1 Aj ic log(πijc) + λ K k=1 dk 24 / 35

Slide 35

Slide 35 text

Introduction MCA MultiLogit Majorization • Majorization (or MM) algorithms (De Leeuw & Heiser, 1977; Lange, 2004) use in each iteration a majorizing function g(θ, θ0). • Current estimate θ0 is called supporting point. • Requirements: 1 f (θ0 ) = g(θ0, θ0 ). 2 f (θ) ≤ g(θ, θ0 ). • Sandwich inequality: f (θ+) ≤ g(θ+, θ0) ≤ g(θ0, θ0) = f (θ0) with θ+ = argming(θ, θ0) • Any majorization algorithm is guaranteed to descent. 25 / 35

Slide 36

Slide 36 text

Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) Proof h(θi ) = gij(θi , θ(0) i ) − fij(θi ) ≥ 0: • For θi = θ(0) i we have gij (θ(0) i , θ(0) i ) − fij (θ(0) i ) = 0 • At θ(0) i we have ∇fij (θ(0) i ) = ∇gij (θ(0) i , θ(0) i ) • 1 2 I − ∇2fij (θi ) is positive semi-definite (largest eigenvalue of ∇2fij (θi ) is smaller than 1/2) 26 / 35

Slide 37

Slide 37 text

Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) H =   π1(1 − π1) −π1π2 −π1π3 ... −π1π2 π2(1 − π2) −π2π3 ... ... ... π3(1 − π3) ... ... ... .... ...   Gerschgorin disks: the eigenvalue φ is always smaller than a diagonal element plus the sum of its absolute off-diagonal row (or col) values φ ≤ πijc − π2 ijc + πijc =c πij φ ≤ πijc − π2 ijc + πijc Cj =1 πij − π2 ijc φ ≤ 2(πijc − π2 ijc ) = 2πijc (1 − πijc ). 2πijc (1 − πijc ) reaches its maximum of 1/2 at πijc = 1/2 26 / 35

Slide 38

Slide 38 text

Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) Proof h(θi ) = gij(θi , θ(0) i ) − fij(θi ) ≥ 0: • For θi = θ(0) i we have gij (θ(0) i , θ(0) i ) − fij (θ(0) i ) = 0 • At θ(0) i we have ∇fij (θ(0) i ) = ∇gij (θ(0) i , θ(0) i ) • 1 2 I − ∇2fij (θi ) is positive semi-definite (largest eigenvalue of ∇2fij (θi ) is smaller than 1/2) 26 / 35

Slide 39

Slide 39 text

Introduction MCA MultiLogit Updating parameters θij(c) = βj(c) + K k=1 dkuikvjk(c) L(β, U, D, V) ≤ 1 4 i,j,c Aj ic (zijc − θij(c) )2 + λ K k=1 dk + c zijc = β(0) jc + u(0) i D(0)v(0) jk + 2(Aijc − πijc(θ0)) • Update β: β = n−1Z 1 • Update U and V: Let (I − n−111 )Z = PΦQ be the SVD. U = P and V = Q. • Update D: K k=1 [(φk − dk)2 + λdk] dk = max(0, φk − λ) nuclear norm: there is automatic dimension selection. 27 / 35

Slide 40

Slide 40 text

Introduction MCA MultiLogit Selecting λ ⇒ 2 steps : QUT to select the rank K - Shrinkage with CV Rationale with Lasso: • Lasso often used for screening (Buhlmann van de Geer, 2011) • Selecting λ with CV or STEIN focuses on predictive properties • Optimal threshold for prediction = optimal for selecting var ⇒ Quantile Universal Threshold (Sardy, 2016) : select the threshold at the bulk edge of what a threshold should be under the null. Guaranteed variable screening with high proba. Be careful, biaised! 28 / 35

Slide 41

Slide 41 text

Introduction MCA MultiLogit Quantile Universal Threshold Ex PCA. X = µ + ε, with εij ∼ N(0, σ2) → ˆ µK = K k=1 uk dk vk Soft-threshold: argminµ X − µ 2 2 + λ µ ∗ → dk max 1 − λ dk , 0 ⇒ Selecting λ to have good estimation of the rank 1 Generate data under the null hypothesis of no signal, µ = 0 2 Compute the first singular value d1 3 Repeat 1000 times 1 and 2 4 Use the (1 − α)-quantile of the distribution of d1 as threshold (Exact results Zanella 2009; Asymptotic results, random matrix theory Shabalin 2013, Paul 2007, Baik 2006...) ⇒ Suppose to know σ! 29 / 35

Slide 42

Slide 42 text

Introduction MCA MultiLogit Selecting λ ⇒ 2 steps : QUT to select the rank K - Shrinkage with CV Model: πijc = exp(βj (c)+ K k=1 dk uik vjk (c)) Cj c =1 exp(βj (c )+ K k=1 dk uik vjk (c )) Lik:L(β, U, D, V) = − n i=1 m j=1 Cj c=1 Aj ic log(πijc) + λ K k=1 dk 1 Generate under the null of no interaction and take λ the quantile of the distribution of d1: good rank recovery 2 For a rank KQUT, estimate λ with Cross-Validation to determine the amount of shrinkage (Lasso + LS) k-fold CV, λ with the best out-of-sample deviance is chosen. 30 / 35

Slide 43

Slide 43 text

Introduction MCA MultiLogit Simulations • n: 50, 100, 300 • m: 20, 100, 300 - 3 categories/variables • Interaction K: 2, 6 • (d1/d2): 2, 1 • the strength of the interaction (low, strong). ˜ ui ∼ NK 0, d1 0 0 dK ˜ vj(c) ∼ NK 0, d1 0 0 dK θc ij = − 1 2 ˜ ui − ˜ vj(c) 2 P(xij = c) ∝ eθc ij 31 / 35

Slide 44

Slide 44 text

Introduction MCA MultiLogit Simulations n p rank ratio strength model MCA 1 50 20 2 1 0.1 0.044 0.035 2 50 20 2 1 1 0.020 0.045 3 50 20 2 2 0.1 0.048 0.036 4 50 20 2 2 1 0.0206 0.042 5 50 20 6 1 0.1 0.111 0.064 6 50 20 6 1 1 0.045 0.026 7 50 20 6 2 0.1 0.115 (0.028) 0.071 8 50 20 6 2 1 0.032 0.051 9 300 100 2 1 0.1 0.005 0.006 10 300 100 2 1 1 0.004 0.042 11 300 100 2 2 0.1 0.0047 0.005 12 300 100 2 2 1 0.0037 (0.00369) 0.040 13 300 300 2 1 0.1 0.003 0.004 14 300 300 2 1 1 0.002 0.039 15 300 300 2 2 0.1 0.003 0.004 16 300 300 2 2 1 0.002 0.039 17 300 100 6 1 0.1 0.019 0.015 18 300 100 6 1 1 0.011 0.023 19 300 100 6 2 0.1 0.018 (0.010) 0.017 20 300 100 6 2 1 0.010 0.056 21 300 300 6 1 0.1 0.011 0.008 22 300 300 6 1 1 0.006 0.022 23 300 300 6 2 0.1 0.009 0.012 32 / 35

Slide 45

Slide 45 text

Introduction MCA MultiLogit Simulation 33 / 35

Slide 46

Slide 46 text

Introduction MCA MultiLogit Simulation 33 / 35

Slide 47

Slide 47 text

Introduction MCA MultiLogit Overfitting ⇒ n = 50, p = 20, r = 6, strength = 0.1 ⇒ Alcohol data.... 0 5000 10000 15000 300 400 500 600 700 800 Index essai$dev 34 / 35

Slide 48

Slide 48 text

Introduction MCA MultiLogit Overfitting ⇒ n = 50, p = 20, r = 6, strength = 0.1 ⇒ Alcohol data.... 0 100 200 300 400 500 600 20000 21000 22000 23000 24000 25000 26000 Index essai$dev 34 / 35

Slide 49

Slide 49 text

Introduction MCA MultiLogit Conclusion MCA can be seen as a linearized estimate of the parameters of the multinomial logit bilinear model ⇒ MCA a proxy to estimate the model’s parameters (small interaction) • graphics • mixed data (quanti, quali) - FAMD / Multiple Factor Analysis for groups of variables • mixture of MCA/ mixture of PPCA • selecting the rank with BIC?? Fixed effect, asymptotics, n - p? • regularization in MCA to tackle overfitting issues • missing values 35 / 35