Multinomial Logit bilinear model

Multinomial Logit bilinear model

Af0306863760ed78652ae9ad38c123c4?s=128

julie josse

June 14, 2016
Tweet

Transcript

  1. Introduction MCA MultiLogit Multiple Correspondence Analysis & the MultiLogit Model

    Julie Josse - William Fithian - Patrick Groenen Agrocampus, INRIA - Berkeley Statistics - Econometric Rotterdam AgroParisTech, Paris, 13 June 2016 1 / 35
  2. Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis

    3 MultiLogit model for MCA 2 / 35
  3. Introduction MCA MultiLogit Exploratory multivariate data analysis Descriptive methods, data

    visualization: • Principal Component Analysis ⇒ continuous variables • Correspondence Analysis ⇒ contingency table • Multiple Correspondence Analysis ⇒ categorical variables ⇒ Dimensionality reduction (describe the data with a smaller number of variables) ⇒ Geometrical approach: importance to graphical displays ⇒ No probabilistic framework, in line with Benzecri (1973)’s idea: “Let the data speak for themselves” 3 / 35
  4. Introduction MCA MultiLogit Underlying model? ⇒ SVD of certain matrices

    with specific row and column weights and metrics (used to compute the distances). “Doing a data analysis, in good mathematics, is simply searching eigenvectors, all the science of it (the art) is just to find the right matrix to diagonalize. (Benzecri, 1973)” ⇒ Specific choices of weights and metrics can be viewed as inducing specific models for the data under analysis. ⇒ Understanding the connections between exploratory multivariate methods and their cognate models (selecting number of PC, missing values; estimation with SVD, graphics, etc..) 4 / 35
  5. Introduction MCA MultiLogit The linear-bilinear model & PCA ⇒ The

    fixed-effects model (Caussinus, 1986) for X ∈ Rn×m: xij ∼ N(µij, σ2), with µij = βj + Γij = βj + K k=1 dkuikvjk, with identifiability constraint UT U(n×K) = VT V(m×K) = IK . ⇒ Population data... (sensory analysis) - (PPCA: random effect) 5 / 35
  6. Introduction MCA MultiLogit The linear-bilinear model & PCA ⇒ The

    fixed-effects model (Caussinus, 1986) for X ∈ Rn×m: xij ∼ N(µij, σ2), with µij = βj + Γij = βj + K k=1 dkuikvjk, with identifiability constraint UT U(n×K) = VT V(m×K) = IK . ⇒ Population data... (sensory analysis) - (PPCA: random effect) ⇒ MLE of Γ amounts to LS approx of Z = In − 1 n 11T X: SVD (ALS algorithms) Γ = UK DK VT K PCA scores FK = UK DK and loadings VK Fixed factors scores models (De Leeuw, 1997) - Anova: linear-bilinear models (Mandel, 1969, Denis 1994), AMMI, biadditive models (Gabriel 1978, Gower, 1995). Useful in Anova without replication. 5 / 35
  7. Introduction MCA MultiLogit The log-bilinear model & CA ⇒ The

    saturated log-linear models (Christensen, 1990; Agresti, 2013). log µij = αi + βj + Γij ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI) log µij = αi + βj + K k=1 dkuikvjk Estimation: iterative weighted least squares, steps of GLM. 6 / 35
  8. Introduction MCA MultiLogit The log-bilinear model & CA ⇒ The

    saturated log-linear models (Christensen, 1990; Agresti, 2013). log µij = αi + βj + Γij ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI) log µij = αi + βj + K k=1 dkuikvjk Estimation: iterative weighted least squares, steps of GLM. ⇒ CA (Greenacre,1984) texts corpus, spectral clustering on graphs: zij = xij/N − ri cj √ ri cj i.e. Z = D−1/2 r (X/N − rcT )D−1/2 c if X adjacency matrix, Z is symmetric normalized graph Laplacian 6 / 35
  9. Introduction MCA MultiLogit CA approximates the log-bilinear model ⇒ SVD

    Z = UK DK VT K . Standard row and col coord UK = D−1/2 r UK , VK = D−1/2 c VK . If the low-rank approx is good: UK DK VT K ≈ D−1/2 r ZD−1/2 c = D−1 r (X/N − rcT )D−1 c (1) By “solving for X” in (1), we get the reconstruction formula: X/N = rcT + Dr(UK DK VT K )Dc i.e. ˆ xij N = ri cj 1 + K k=1 dkuikvjc (2) 7 / 35
  10. Introduction MCA MultiLogit CA approximates the log-bilinear model ⇒ SVD

    Z = UK DK VT K . Standard row and col coord UK = D−1/2 r UK , VK = D−1/2 c VK . If the low-rank approx is good: UK DK VT K ≈ D−1/2 r ZD−1/2 c = D−1 r (X/N − rcT )D−1 c (1) By “solving for X” in (1), we get the reconstruction formula: X/N = rcT + Dr(UK DK VT K )Dc i.e. ˆ xij N = ri cj 1 + K k=1 dkuikvjc (2) ⇒ Connection (Escofier, 1982) : when K k=1 dkuikvjk << 1, eq. (2) is: log(ˆ xij) ≈ log(N) + log(ri ) + log(cj) + K k=1 dkuikvjk 7 / 35
  11. Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis

    3 MultiLogit model for MCA 8 / 35
  12. Introduction MCA MultiLogit Data - examples • large-scale survey datasets

    in the social sciences • medical research: understand the genetic and environmental risk factors of diseases. Ex diabetes: 300 questions (56 pages of questionnaire!) on the food consumption habits, the previous illness in the family, the presence of animals in the household, the kind of paint used in the rooms, etc. • genetic study: the relationship between a sequence of ACGT nucleotides 9 / 35
  13. Introduction MCA MultiLogit Alcohol data INPES (Santé publique France) region

    sexe age year edu Ile de France : 8120 F:29776 18_25: 6920 2005:27907 1:12684 Rhône Alpes : 5421 M:23165 26_34: 9401 2010:25034 2:23521 Provence Alpes Cote d’Azur: 4116 35_44:10899 3: 6563 Nord Pas de Calais : 3819 45_54: 9505 4:10173 Pays de Loire : 3152 55_64: 9503 Bretagne : 3038 65_+ : 6713 (Other) :25275 drunk alcohol glasses binge 0 :44237 <1/m :12889 0 : 2812 <2/m:10323 1-2 : 4952 0 : 6133 0-2:37867 0 :34345 10-19: 839 1-2/m: 7583 10+: 590 1/m : 6018 20-29: 212 1-2/w: 9526 3-4: 9486 1/w : 1881 3-5 : 1908 3-4/w: 6815 5-6: 1795 7/w : 374 30+ : 404 5-6/w: 3402 7-9: 391 6-9 : 389 7/w : 6593 region sexe age year edu drunk alcohol glasses binge 1 Rhône Alpes M 45_54 2005 1 0 0 0-2 0 2 Rhône Alpes M 45_54 2005 2 0 0 0-2 0 3 Rhône Alpes M 55_64 2005 2 0 0 0-2 0 4 Rhône Alpes M 18_25 2005 3 0 0 0-2 0 5 Rhône Alpes M 18_25 2005 2 0 0 0-2 0 6 Rhône Alpes M 26_34 2005 2 0 0 0-2 0 10 / 35
  14. Introduction MCA MultiLogit Coding categorical variables X: n and m

    variables - the indicator matrix A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj , with row i corresponding to a dummy coding of xij . X =      1 1 2 3 1 2 2 3 2 2 2 2      ⇐⇒ A =      1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0      =⇒ B =      2 0 1 1 0 0 4 0 2 2 1 0 1 0 0 1 2 0 3 0 0 2 0 0 2      pj (c) = 1 n Aj ·c : the cth normalized column margin of Aj , with p = (p1, . . . , pm)T . All row margins of A are exactly m. tab.disjonctif(don)[1:5, 22:47] Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 1 2 3 4 0 1-2 10-19 20-29 3-5 30+ 6-9 <1/m 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 3 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 4 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 5 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 11 / 35
  15. Introduction MCA MultiLogit Multiple Correspondence Analysis ZA = 1 √

    mn (A − 1pT )D−1/2 p ΓMCA to be the SVD decomposition UK DK VT K of ZA. Homogeneity Analysis (Gifi 1990, J. de Leeuw, J. Meulman), Dual scaling (Nishisato, 1980, Guttman, 1941) ⇒ Interpreting the graphical displays where rows are represented with F = UKDK and categories with VK = D−1/2 p VK Properties: • Fk = arg maxFk∈Rn m j=1 η2(Fk, Xm) - counterpart of PCA • the distances between the rows and between the columns coincide with the χ2 distances. 12 / 35
  16. Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

    = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) • different when they don’t take same levels • be careful, the frequency of the levels are important! Individuals not interesting. −1 0 1 2 3 −1 0 1 2 3 4 5 MCA factor map Dim 1 (10.87%) Dim 2 (7.86%) 13 / 35
  17. Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

    = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the interpre- tation 13 / 35
  18. Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

    = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35
  19. Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

    = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35
  20. Introduction MCA MultiLogit Individuals graph Distance between individuals: d2 i,i

    = n m m j=1 Cj c=1 (Aj ic −Aj i c )2 pj (c) Individuals not interesting. Using categories for the inter- pretation : a category is at the barycenter of individuals taking the category 13 / 35
  21. Introduction MCA MultiLogit Categories graph Distance between categories: d2 c,c

    = n i=1 Aic p(c) − Aic p(c ) 2 • 2 levels are close if indiv taking these levels are the same (ex: 65 years & retiree)/ if indiv take the same levels for other var. (ex: 60 years & 65 years) • rare levels are far away from the others 14 / 35
  22. Introduction MCA MultiLogit Supplementary variables q −0.4 −0.2 0.0 0.2

    0.4 −0.2 0.0 0.2 0.4 Dim 1 (10.87%) Dim 2 (7.86%) Alsace Aquitaine Auvergne Bourgogne Bretagne Centre Champagne Ardennes Corse Franche Comté Ile de France Languedoc Roussillon Limousin Lorraine Midi Pyrénées Nord Pas de Calais Normandie ( Basse ) Normandie ( Haute ) Pays de Loire Picardie Poitou Charentes Provence Alpes Cote d'Azur Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 edu.1 edu.2 edu.3 edu.4 q −0.4 −0.2 0.0 0.2 0.4 −0.2 0.0 0.2 0.4 Dim 1 (10.87%) Dim 2 (7.86%) 18_25 26_34 35_44 45_54 55_64 65_+ 15 / 35
  23. Introduction MCA MultiLogit Supplementary variables -2 0 2 4 -1

    0 1 2 3 4 5 MCA factor map Dim 1 (10.87%) Dim 2 (7.86%) drunk_0 drunk_1-2 drunk_10-19 drunk_20-29 drunk_3-5 drunk_30+ drunk_6-9 alcohol_<1/m alcohol_0 alcohol_1-2/m alcohol_1-2/w alcohol_3-4/w alcohol_5-6/w alcohol_7/w glasses_0 glasses_0-2 glasses_10+ glasses_3-4 glasses_5-6 glasses_7-9 binge_<2/m binge_0 binge_1/m binge_1/w binge_7/w 15 / 35
  24. Introduction MCA MultiLogit Recontruction formula Objective: best low rank K

    approx of ZA = 1 √ mn (A − 1pT )D−1/2 p Solution: SVD ˆ ZA = UK DK VT K . The standard row and columns coordinates are UK = UK and VK = D−1/2 p VK . If the low-rank approximation is good, we have: UK DK VT K ≈ ZAD−1/2 p = (A − 1pT )D−1 p , (3) in a weighted least-squares sense. By “solving for A” in (3), we obtain the reconstruction formula: A ≈ 1pT + (UK DK VT K )Dp 16 / 35
  25. Introduction MCA MultiLogit Reconstruction with MCA ˆ Aj ic ≈

    pj(c) 1 + K k=1 dkuikvjk(c) Category c chosen by person i on variable j is modeled by: main effect + low rank sructure initial data <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w 1 0 1 0 0 0 0 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0 4 0 1 0 0 0 0 0 5 0 1 0 0 0 0 0 reconstruction with 2 dimensions <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w [1,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [2,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [3,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [4,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 [5,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08 17 / 35
  26. Introduction MCA MultiLogit Outline 1 Introduction 2 Multiple Correspondence Analysis

    3 MultiLogit model for MCA 18 / 35
  27. Introduction MCA MultiLogit Models for categorical variables • Log-linear models

    - gold standard (Christensen, 1990; Agresti, 2013) ⇒ Pb with high dimensional data. • Latent variables models: • categoricals: latent class models (Goodman, 1974) - unsupervised clustering for one latent variable. Nonparametric Bayesian extensions (Dunson, 2009, 2012) • continuous: latent-trait models (Lazar, 1968) - item response theory (psychology & education, Van der Linden, 1997) • fixed: often one latent variable, difficulties to estimate. • random: Gaussian distribution on the latent variables (Moustaki, 2000; Sanchez, 2013) ⇒ Binary data: Collins, Dasgupta, & Schapire (2001), Buntine (2002), Hoff (2009), De Leeuw (2006), Li & Tao (2013) 19 / 35
  28. Introduction MCA MultiLogit Multilogit-bilinear model P(xij = c) = πijc

    = eθij (c) Cj c =1 eθij (c ) , θij(c) = βj(c) + Γj i (c) = βj(c) + K k=1 dkuikvjk(c) ˜ vj(c) = ( √ d1vj1(c), √ d2vj2(c)): question j, category c with coor- dinates one point/ Cj categories. The latent variables ˜ ui = D1/2 2 ui P(xij = c) ∝ exp ˜ βj(c) − 1 2 ˜ vj(c) − ˜ ui 2 q q q q Latent Space Dimension 1 Dimension 2 vj (1) vj (2) vj (3) vj (4) u1 u2 20 / 35
  29. Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A

    = A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 21 / 35
  30. Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A

    = A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 ⇒ Rationale: Taylor expand around the independence model ζ0: ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) + 1 2 (ζ − ζ0)T (ζ0)(ζ − ζ0) ˜(β, Γ) a quadratic function of its arguments, then maximizing the latter amounts to a generalized SVD ⇒ MCA. 21 / 35
  31. Introduction MCA MultiLogit Relationship with MCA Data: Xn×m - A

    = A1| · · · |Am , Aj ∈ {0, 1}n×Cj Model: πijc = eβj (c)+Γ j i (c) Cj c =1 eβj (c )+Γ j i (c ) Param: ζ = β vec(Γ) ; ζ0 = β0 0 ⇒ Rationale: Taylor expand around the independence model ζ0: ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) + 1 2 (ζ − ζ0)T (ζ0)(ζ − ζ0) ˜(β, Γ) a quadratic function of its arguments, then maximizing the latter amounts to a generalized SVD ⇒ MCA. ⇒ The joint likelihood is n i=1 m j=1 Cj c=1 πAj ic ijc (independence) ⇒ The log-likelihood for the MultiLogit Bilinear model is: = i,j,c Aj ic log(πijk) = i,j,c Aj ic log exp(βj (c)+Γj i (c)) Cj c =1 exp(βj (c )+Γj i (c )) 21 / 35
  32. Introduction MCA MultiLogit Relationship with MCA (β, Γ; A) =

    βj(Aj i ) + Γj i (Aj i ) − log   Cj c=1 eβj (c)+Γj i (c)   ∂ ∂Γj i (c) = 1xij =c − eβj (c)+Γj i (c) Cj c =1 eβj (c )+Γj i (c ) = Aj ic − πijc (4) ∂ ∂Γj i (c)∂Γj i (c ) = πijcπijc − πijc1c=c j = j , i = i 0 o.w. (5) Evaluating (4) at ζ0 = (β0 = log(p), 0) gives Aj ic − pj(c) - idem (5) ˜(β, Γ) ≈ Γ, A − 1pT − 1 2 ΓD1/2 p 2 F 22 / 35
  33. Introduction MCA MultiLogit Relationship with MCA Lemma Let G ∈

    Rn×n, H1 ∈ Rn×n, H2 ∈ Rm×m, with H1, H2 0. argmaxΓ: rank(Γ)≤K Γ, G − 1 2 H1ΓH2 2 F Γ∗ = H−1 1 SVDK (H−1 1 GH−1 2 ) H−1 2 Thus, using Lemma 1, the solution Γ, A − 1pT − 1 2 ΓD1/2 p 2 F is given by the rank K SVD of (A − 1pT )D−1/2 p which is precisely the SVD performed in MCA. Theorem The one-step likelihood estimate for the MultiLogit Bilinear model with rank constraint K, obtained by expanding around the independence model (β0 = log p, Γ0 = 0), is (β0 , ΓMCA). 23 / 35
  34. Introduction MCA MultiLogit Maximizing the likelihood Data: A = [A1,

    . . . , Am] Model: πijc = exp(θij(c) ) Cj c =1 exp(θij(c ) ) θij(c) = βj(c) + K k=1 dkuikvjk(c) Identification constraint: βj 1 = 0, 1 U = 0, U U = I, 1 Vj = 0 Maximum dimensionality: K = min(n − 1, m j=1 Cj − m) Estimation: MLE ⇒ Problem: overfitting: θij(c) → ∞ or θij(c) → −∞ ⇒ Solution: penalized likelihood. L(β, U, D, V) = − n i=1 m j=1 Cj c=1 Aj ic log(πijc) + λ K k=1 dk 24 / 35
  35. Introduction MCA MultiLogit Majorization • Majorization (or MM) algorithms (De

    Leeuw & Heiser, 1977; Lange, 2004) use in each iteration a majorizing function g(θ, θ0). • Current estimate θ0 is called supporting point. • Requirements: 1 f (θ0 ) = g(θ0, θ0 ). 2 f (θ) ≤ g(θ, θ0 ). • Sandwich inequality: f (θ+) ≤ g(θ+, θ0) ≤ g(θ0, θ0) = f (θ0) with θ+ = argming(θ, θ0) • Any majorization algorithm is guaranteed to descent. 25 / 35
  36. Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj

    c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) Proof h(θi ) = gij(θi , θ(0) i ) − fij(θi ) ≥ 0: • For θi = θ(0) i we have gij (θ(0) i , θ(0) i ) − fij (θ(0) i ) = 0 • At θ(0) i we have ∇fij (θ(0) i ) = ∇gij (θ(0) i , θ(0) i ) • 1 2 I − ∇2fij (θi ) is positive semi-definite (largest eigenvalue of ∇2fij (θi ) is smaller than 1/2) 26 / 35
  37. Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj

    c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) H =   π1(1 − π1) −π1π2 −π1π3 ... −π1π2 π2(1 − π2) −π2π3 ... ... ... π3(1 − π3) ... ... ... .... ...   Gerschgorin disks: the eigenvalue φ is always smaller than a diagonal element plus the sum of its absolute off-diagonal row (or col) values φ ≤ πijc − π2 ijc + πijc =c πij φ ≤ πijc − π2 ijc + πijc Cj =1 πij − π2 ijc φ ≤ 2(πijc − π2 ijc ) = 2πijc (1 − πijc ). 2πijc (1 − πijc ) reaches its maximum of 1/2 at πijc = 1/2 26 / 35
  38. Introduction MCA MultiLogit Majorizing Function fij(θi ) = − Cj

    c=1 Aj ic log(πijc) = − Cj c=1 Aj ic log exp(θij(c) ) Cj c =1 exp(θij(c ) ) With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij ) Theorem gij(θi , θ(0) i ) = fij(θ(0) i ) + (θi − θ(0) i ) ∇fij(θ(0) i ) + 1/4 θi − θ(0) i 2 is a majorizing function of fij(θi ) (using De Leeuw, 2005) Proof h(θi ) = gij(θi , θ(0) i ) − fij(θi ) ≥ 0: • For θi = θ(0) i we have gij (θ(0) i , θ(0) i ) − fij (θ(0) i ) = 0 • At θ(0) i we have ∇fij (θ(0) i ) = ∇gij (θ(0) i , θ(0) i ) • 1 2 I − ∇2fij (θi ) is positive semi-definite (largest eigenvalue of ∇2fij (θi ) is smaller than 1/2) 26 / 35
  39. Introduction MCA MultiLogit Updating parameters θij(c) = βj(c) + K

    k=1 dkuikvjk(c) L(β, U, D, V) ≤ 1 4 i,j,c Aj ic (zijc − θij(c) )2 + λ K k=1 dk + c zijc = β(0) jc + u(0) i D(0)v(0) jk + 2(Aijc − πijc(θ0)) • Update β: β = n−1Z 1 • Update U and V: Let (I − n−111 )Z = PΦQ be the SVD. U = P and V = Q. • Update D: K k=1 [(φk − dk)2 + λdk] dk = max(0, φk − λ) nuclear norm: there is automatic dimension selection. 27 / 35
  40. Introduction MCA MultiLogit Selecting λ ⇒ 2 steps : QUT

    to select the rank K - Shrinkage with CV Rationale with Lasso: • Lasso often used for screening (Buhlmann van de Geer, 2011) • Selecting λ with CV or STEIN focuses on predictive properties • Optimal threshold for prediction = optimal for selecting var ⇒ Quantile Universal Threshold (Sardy, 2016) : select the threshold at the bulk edge of what a threshold should be under the null. Guaranteed variable screening with high proba. Be careful, biaised! 28 / 35
  41. Introduction MCA MultiLogit Quantile Universal Threshold Ex PCA. X =

    µ + ε, with εij ∼ N(0, σ2) → ˆ µK = K k=1 uk dk vk Soft-threshold: argminµ X − µ 2 2 + λ µ ∗ → dk max 1 − λ dk , 0 ⇒ Selecting λ to have good estimation of the rank 1 Generate data under the null hypothesis of no signal, µ = 0 2 Compute the first singular value d1 3 Repeat 1000 times 1 and 2 4 Use the (1 − α)-quantile of the distribution of d1 as threshold (Exact results Zanella 2009; Asymptotic results, random matrix theory Shabalin 2013, Paul 2007, Baik 2006...) ⇒ Suppose to know σ! 29 / 35
  42. Introduction MCA MultiLogit Selecting λ ⇒ 2 steps : QUT

    to select the rank K - Shrinkage with CV Model: πijc = exp(βj (c)+ K k=1 dk uik vjk (c)) Cj c =1 exp(βj (c )+ K k=1 dk uik vjk (c )) Lik:L(β, U, D, V) = − n i=1 m j=1 Cj c=1 Aj ic log(πijc) + λ K k=1 dk 1 Generate under the null of no interaction and take λ the quantile of the distribution of d1: good rank recovery 2 For a rank KQUT, estimate λ with Cross-Validation to determine the amount of shrinkage (Lasso + LS) k-fold CV, λ with the best out-of-sample deviance is chosen. 30 / 35
  43. Introduction MCA MultiLogit Simulations • n: 50, 100, 300 •

    m: 20, 100, 300 - 3 categories/variables • Interaction K: 2, 6 • (d1/d2): 2, 1 • the strength of the interaction (low, strong). ˜ ui ∼ NK 0, d1 0 0 dK ˜ vj(c) ∼ NK 0, d1 0 0 dK θc ij = − 1 2 ˜ ui − ˜ vj(c) 2 P(xij = c) ∝ eθc ij 31 / 35
  44. Introduction MCA MultiLogit Simulations n p rank ratio strength model

    MCA 1 50 20 2 1 0.1 0.044 0.035 2 50 20 2 1 1 0.020 0.045 3 50 20 2 2 0.1 0.048 0.036 4 50 20 2 2 1 0.0206 0.042 5 50 20 6 1 0.1 0.111 0.064 6 50 20 6 1 1 0.045 0.026 7 50 20 6 2 0.1 0.115 (0.028) 0.071 8 50 20 6 2 1 0.032 0.051 9 300 100 2 1 0.1 0.005 0.006 10 300 100 2 1 1 0.004 0.042 11 300 100 2 2 0.1 0.0047 0.005 12 300 100 2 2 1 0.0037 (0.00369) 0.040 13 300 300 2 1 0.1 0.003 0.004 14 300 300 2 1 1 0.002 0.039 15 300 300 2 2 0.1 0.003 0.004 16 300 300 2 2 1 0.002 0.039 17 300 100 6 1 0.1 0.019 0.015 18 300 100 6 1 1 0.011 0.023 19 300 100 6 2 0.1 0.018 (0.010) 0.017 20 300 100 6 2 1 0.010 0.056 21 300 300 6 1 0.1 0.011 0.008 22 300 300 6 1 1 0.006 0.022 23 300 300 6 2 0.1 0.009 0.012 32 / 35
  45. Introduction MCA MultiLogit Simulation 33 / 35

  46. Introduction MCA MultiLogit Simulation 33 / 35

  47. Introduction MCA MultiLogit Overfitting ⇒ n = 50, p =

    20, r = 6, strength = 0.1 ⇒ Alcohol data.... 0 5000 10000 15000 300 400 500 600 700 800 Index essai$dev 34 / 35
  48. Introduction MCA MultiLogit Overfitting ⇒ n = 50, p =

    20, r = 6, strength = 0.1 ⇒ Alcohol data.... 0 100 200 300 400 500 600 20000 21000 22000 23000 24000 25000 26000 Index essai$dev 34 / 35
  49. Introduction MCA MultiLogit Conclusion MCA can be seen as a

    linearized estimate of the parameters of the multinomial logit bilinear model ⇒ MCA a proxy to estimate the model’s parameters (small interaction) • graphics • mixed data (quanti, quali) - FAMD / Multiple Factor Analysis for groups of variables • mixture of MCA/ mixture of PPCA • selecting the rank with BIC?? Fixed effect, asymptotics, n - p? • regularization in MCA to tackle overfitting issues • missing values 35 / 35