Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multinomial Logit bilinear model

Multinomial Logit bilinear model

julie josse

June 14, 2016
Tweet

More Decks by julie josse

Other Decks in Research

Transcript

  1. Introduction MCA MultiLogit
    Multiple Correspondence Analysis & the
    MultiLogit Model
    Julie Josse - William Fithian - Patrick Groenen
    Agrocampus, INRIA - Berkeley Statistics - Econometric Rotterdam
    AgroParisTech, Paris, 13 June 2016
    1 / 35

    View full-size slide

  2. Introduction MCA MultiLogit
    Outline
    1 Introduction
    2 Multiple Correspondence Analysis
    3 MultiLogit model for MCA
    2 / 35

    View full-size slide

  3. Introduction MCA MultiLogit
    Exploratory multivariate data analysis
    Descriptive methods, data visualization:
    • Principal Component Analysis ⇒ continuous variables
    • Correspondence Analysis ⇒ contingency table
    • Multiple Correspondence Analysis ⇒ categorical variables
    ⇒ Dimensionality reduction (describe the data with a smaller
    number of variables)
    ⇒ Geometrical approach: importance to graphical displays
    ⇒ No probabilistic framework, in line with Benzecri (1973)’s idea:
    “Let the data speak for themselves”
    3 / 35

    View full-size slide

  4. Introduction MCA MultiLogit
    Underlying model?
    ⇒ SVD of certain matrices with specific row and column weights
    and metrics (used to compute the distances).
    “Doing a data analysis, in good mathematics, is simply searching
    eigenvectors, all the science of it (the art) is just to find the right
    matrix to diagonalize. (Benzecri, 1973)”
    ⇒ Specific choices of weights and metrics can be viewed as
    inducing specific models for the data under analysis.
    ⇒ Understanding the connections between exploratory multivariate
    methods and their cognate models (selecting number of PC, missing values;
    estimation with SVD, graphics, etc..)
    4 / 35

    View full-size slide

  5. Introduction MCA MultiLogit
    The linear-bilinear model & PCA
    ⇒ The fixed-effects model (Caussinus, 1986) for X ∈ Rn×m:
    xij ∼ N(µij, σ2), with µij = βj + Γij = βj +
    K
    k=1
    dkuikvjk,
    with identifiability constraint UT U(n×K)
    = VT V(m×K)
    = IK .
    ⇒ Population data... (sensory analysis) - (PPCA: random effect)
    5 / 35

    View full-size slide

  6. Introduction MCA MultiLogit
    The linear-bilinear model & PCA
    ⇒ The fixed-effects model (Caussinus, 1986) for X ∈ Rn×m:
    xij ∼ N(µij, σ2), with µij = βj + Γij = βj +
    K
    k=1
    dkuikvjk,
    with identifiability constraint UT U(n×K)
    = VT V(m×K)
    = IK .
    ⇒ Population data... (sensory analysis) - (PPCA: random effect)
    ⇒ MLE of Γ amounts to LS approx of Z = In − 1
    n
    11T X:
    SVD (ALS algorithms) Γ = UK DK VT
    K
    PCA scores FK = UK DK and loadings VK
    Fixed factors scores models (De Leeuw, 1997) - Anova: linear-bilinear models
    (Mandel, 1969, Denis 1994), AMMI, biadditive models (Gabriel 1978, Gower, 1995).
    Useful in Anova without replication.
    5 / 35

    View full-size slide

  7. Introduction MCA MultiLogit
    The log-bilinear model & CA
    ⇒ The saturated log-linear models (Christensen, 1990; Agresti, 2013).
    log µij = αi + βj + Γij
    ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI)
    log µij = αi + βj +
    K
    k=1
    dkuikvjk
    Estimation: iterative weighted least squares, steps of GLM.
    6 / 35

    View full-size slide

  8. Introduction MCA MultiLogit
    The log-bilinear model & CA
    ⇒ The saturated log-linear models (Christensen, 1990; Agresti, 2013).
    log µij = αi + βj + Γij
    ⇒ The RC association model (Goodman, 1985; Gower, 2011; GAMMI)
    log µij = αi + βj +
    K
    k=1
    dkuikvjk
    Estimation: iterative weighted least squares, steps of GLM.
    ⇒ CA (Greenacre,1984) texts corpus, spectral clustering on graphs:
    zij =
    xij/N − ri cj

    ri cj
    i.e. Z = D−1/2
    r (X/N − rcT )D−1/2
    c
    if X adjacency matrix, Z is symmetric normalized graph Laplacian
    6 / 35

    View full-size slide

  9. Introduction MCA MultiLogit
    CA approximates the log-bilinear model
    ⇒ SVD Z = UK DK VT
    K
    . Standard row and col coord
    UK = D−1/2
    r UK , VK = D−1/2
    c VK . If the low-rank approx is good:
    UK DK VT
    K
    ≈ D−1/2
    r ZD−1/2
    c = D−1
    r
    (X/N − rcT )D−1
    c
    (1)
    By “solving for X” in (1), we get the reconstruction formula:
    X/N = rcT + Dr(UK DK VT
    K
    )Dc i.e.
    ˆ
    xij
    N
    = ri cj 1 +
    K
    k=1
    dkuikvjc (2)
    7 / 35

    View full-size slide

  10. Introduction MCA MultiLogit
    CA approximates the log-bilinear model
    ⇒ SVD Z = UK DK VT
    K
    . Standard row and col coord
    UK = D−1/2
    r UK , VK = D−1/2
    c VK . If the low-rank approx is good:
    UK DK VT
    K
    ≈ D−1/2
    r ZD−1/2
    c = D−1
    r
    (X/N − rcT )D−1
    c
    (1)
    By “solving for X” in (1), we get the reconstruction formula:
    X/N = rcT + Dr(UK DK VT
    K
    )Dc i.e.
    ˆ
    xij
    N
    = ri cj 1 +
    K
    k=1
    dkuikvjc (2)
    ⇒ Connection (Escofier, 1982) : when K
    k=1
    dkuikvjk << 1, eq. (2) is:
    log(ˆ
    xij) ≈ log(N) + log(ri ) + log(cj) +
    K
    k=1
    dkuikvjk
    7 / 35

    View full-size slide

  11. Introduction MCA MultiLogit
    Outline
    1 Introduction
    2 Multiple Correspondence Analysis
    3 MultiLogit model for MCA
    8 / 35

    View full-size slide

  12. Introduction MCA MultiLogit
    Data - examples
    • large-scale survey datasets in the social sciences
    • medical research: understand the genetic and environmental
    risk factors of diseases. Ex diabetes: 300 questions (56 pages
    of questionnaire!) on the food consumption habits, the
    previous illness in the family, the presence of animals in the
    household, the kind of paint used in the rooms, etc.
    • genetic study: the relationship between a sequence of ACGT
    nucleotides
    9 / 35

    View full-size slide

  13. Introduction MCA MultiLogit
    Alcohol data
    INPES (Santé publique France)
    region sexe age year edu
    Ile de France : 8120 F:29776 18_25: 6920 2005:27907 1:12684
    Rhône Alpes : 5421 M:23165 26_34: 9401 2010:25034 2:23521
    Provence Alpes Cote d’Azur: 4116 35_44:10899 3: 6563
    Nord Pas de Calais : 3819 45_54: 9505 4:10173
    Pays de Loire : 3152 55_64: 9503
    Bretagne : 3038 65_+ : 6713
    (Other) :25275
    drunk alcohol glasses binge
    0 :44237 <1/m :12889 0 : 2812 <2/m:10323
    1-2 : 4952 0 : 6133 0-2:37867 0 :34345
    10-19: 839 1-2/m: 7583 10+: 590 1/m : 6018
    20-29: 212 1-2/w: 9526 3-4: 9486 1/w : 1881
    3-5 : 1908 3-4/w: 6815 5-6: 1795 7/w : 374
    30+ : 404 5-6/w: 3402 7-9: 391
    6-9 : 389 7/w : 6593
    region sexe age year edu drunk alcohol glasses binge
    1 Rhône Alpes M 45_54 2005 1 0 0 0-2 0
    2 Rhône Alpes M 45_54 2005 2 0 0 0-2 0
    3 Rhône Alpes M 55_64 2005 2 0 0 0-2 0
    4 Rhône Alpes M 18_25 2005 3 0 0 0-2 0
    5 Rhône Alpes M 18_25 2005 2 0 0 0-2 0
    6 Rhône Alpes M 26_34 2005 2 0 0 0-2 0
    10 / 35

    View full-size slide

  14. Introduction MCA MultiLogit
    Coding categorical variables
    X: n and m variables - the indicator matrix A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj , with
    row i corresponding to a dummy coding of xij .
    X =





    1 1
    2 3
    1 2
    2 3
    2 2
    2 2





    ⇐⇒ A =





    1 0 1 0 0
    0 1 0 0 1
    1 0 0 1 0
    0 1 0 0 1
    0 1 0 1 0
    0 1 0 1 0





    =⇒ B =





    2 0 1 1 0
    0 4 0 2 2
    1 0 1 0 0
    1 2 0 3 0
    0 2 0 0 2





    pj (c) = 1
    n
    Aj
    ·c
    : the cth normalized column margin of Aj , with p = (p1, . . . , pm)T .
    All row margins of A are exactly m.
    tab.disjonctif(don)[1:5, 22:47]
    Rhône Alpes F M 18_25 26_34 35_44 45_54 55_64 65_+ 2005 2010 1 2 3 4 0 1-2 10-19 20-29 3-5 30+ 6-9 <1/m
    1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0
    2 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
    3 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
    4 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0
    5 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
    11 / 35

    View full-size slide

  15. Introduction MCA MultiLogit
    Multiple Correspondence Analysis
    ZA =
    1

    mn
    (A − 1pT )D−1/2
    p
    ΓMCA to be the SVD decomposition UK DK VT
    K
    of ZA.
    Homogeneity Analysis (Gifi 1990, J. de Leeuw, J. Meulman), Dual scaling (Nishisato,
    1980, Guttman, 1941)
    ⇒ Interpreting the graphical displays where rows are represented
    with F = UKDK and categories with VK = D−1/2
    p VK
    Properties:
    • Fk = arg maxFk∈Rn
    m
    j=1
    η2(Fk, Xm) - counterpart of PCA
    • the distances between the rows and between the columns
    coincide with the χ2 distances.
    12 / 35

    View full-size slide

  16. Introduction MCA MultiLogit
    Individuals graph
    Distance between individuals:
    d2
    i,i
    = n
    m
    m
    j=1
    Cj
    c=1
    (Aj
    ic
    −Aj
    i c
    )2
    pj (c)
    • different when they don’t take
    same levels
    • be careful, the frequency of
    the levels are important!
    Individuals not interesting.
    −1 0 1 2 3
    −1 0 1 2 3 4 5
    MCA factor map
    Dim 1 (10.87%)
    Dim 2 (7.86%)
    13 / 35

    View full-size slide

  17. Introduction MCA MultiLogit
    Individuals graph
    Distance between individuals:
    d2
    i,i
    = n
    m
    m
    j=1
    Cj
    c=1
    (Aj
    ic
    −Aj
    i c
    )2
    pj (c)
    Individuals not interesting.
    Using categories for the interpre-
    tation
    13 / 35

    View full-size slide

  18. Introduction MCA MultiLogit
    Individuals graph
    Distance between individuals:
    d2
    i,i
    = n
    m
    m
    j=1
    Cj
    c=1
    (Aj
    ic
    −Aj
    i c
    )2
    pj (c)
    Individuals not interesting.
    Using categories for the inter-
    pretation : a category is at the
    barycenter of individuals taking
    the category
    13 / 35

    View full-size slide

  19. Introduction MCA MultiLogit
    Individuals graph
    Distance between individuals:
    d2
    i,i
    = n
    m
    m
    j=1
    Cj
    c=1
    (Aj
    ic
    −Aj
    i c
    )2
    pj (c)
    Individuals not interesting.
    Using categories for the inter-
    pretation : a category is at the
    barycenter of individuals taking
    the category
    13 / 35

    View full-size slide

  20. Introduction MCA MultiLogit
    Individuals graph
    Distance between individuals:
    d2
    i,i
    = n
    m
    m
    j=1
    Cj
    c=1
    (Aj
    ic
    −Aj
    i c
    )2
    pj (c)
    Individuals not interesting.
    Using categories for the inter-
    pretation : a category is at the
    barycenter of individuals taking
    the category
    13 / 35

    View full-size slide

  21. Introduction MCA MultiLogit
    Categories graph
    Distance between categories:
    d2
    c,c
    = n
    i=1
    Aic
    p(c)
    − Aic
    p(c )
    2
    • 2 levels are close if indiv taking
    these levels are the same (ex:
    65 years & retiree)/ if indiv take
    the same levels for other var.
    (ex: 60 years & 65 years)
    • rare levels are far away from
    the others
    14 / 35

    View full-size slide

  22. Introduction MCA MultiLogit
    Supplementary variables
    q
    −0.4 −0.2 0.0 0.2 0.4
    −0.2 0.0 0.2 0.4
    Dim 1 (10.87%)
    Dim 2 (7.86%)
    Alsace
    Aquitaine
    Auvergne
    Bourgogne
    Bretagne
    Centre
    Champagne Ardennes
    Corse
    Franche Comté
    Ile de France
    Languedoc Roussillon
    Limousin
    Lorraine
    Midi Pyrénées
    Nord Pas de Calais
    Normandie ( Basse )
    Normandie ( Haute )
    Pays de Loire
    Picardie
    Poitou Charentes
    Provence Alpes Cote d'Azur
    Rhône Alpes
    F
    M
    18_25
    26_34
    35_44
    45_54
    55_64
    65_+
    2005
    2010
    edu.1
    edu.2
    edu.3
    edu.4
    q
    −0.4 −0.2 0.0 0.2 0.4
    −0.2 0.0 0.2 0.4
    Dim 1 (10.87%)
    Dim 2 (7.86%)
    18_25
    26_34
    35_44
    45_54
    55_64
    65_+
    15 / 35

    View full-size slide

  23. Introduction MCA MultiLogit
    Supplementary variables
    -2 0 2 4
    -1 0 1 2 3 4 5
    MCA factor map
    Dim 1 (10.87%)
    Dim 2 (7.86%)
    drunk_0
    drunk_1-2
    drunk_10-19
    drunk_20-29
    drunk_3-5
    drunk_30+
    drunk_6-9
    alcohol_<1/m
    alcohol_0
    alcohol_1-2/m
    alcohol_1-2/w
    alcohol_3-4/w
    alcohol_5-6/w
    alcohol_7/w
    glasses_0
    glasses_0-2
    glasses_10+
    glasses_3-4
    glasses_5-6
    glasses_7-9
    binge_<2/m
    binge_0
    binge_1/m
    binge_1/w
    binge_7/w
    15 / 35

    View full-size slide

  24. Introduction MCA MultiLogit
    Recontruction formula
    Objective: best low rank K approx of ZA = 1

    mn
    (A − 1pT )D−1/2
    p
    Solution: SVD ˆ
    ZA = UK DK VT
    K
    .
    The standard row and columns coordinates are UK = UK and
    VK = D−1/2
    p VK . If the low-rank approximation is good, we have:
    UK DK VT
    K
    ≈ ZAD−1/2
    p
    = (A − 1pT )D−1
    p
    , (3)
    in a weighted least-squares sense. By “solving for A” in (3), we
    obtain the reconstruction formula:
    A ≈ 1pT + (UK DK VT
    K
    )Dp
    16 / 35

    View full-size slide

  25. Introduction MCA MultiLogit
    Reconstruction with MCA
    ˆ
    Aj
    ic
    ≈ pj(c) 1 +
    K
    k=1
    dkuikvjk(c)
    Category c chosen by person i on variable j is modeled by: main
    effect + low rank sructure
    initial data
    <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w
    1 0 1 0 0 0 0 0
    2 0 1 0 0 0 0 0
    3 0 1 0 0 0 0 0
    4 0 1 0 0 0 0 0
    5 0 1 0 0 0 0 0
    reconstruction with 2 dimensions
    <1/m 0 1-2/m 1-2/w 3-4/w 5-6/w 7/w
    [1,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
    [2,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
    [3,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
    [4,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
    [5,] 0.273 0.478 0.068 0.06 0.029 0.012 0.08
    17 / 35

    View full-size slide

  26. Introduction MCA MultiLogit
    Outline
    1 Introduction
    2 Multiple Correspondence Analysis
    3 MultiLogit model for MCA
    18 / 35

    View full-size slide

  27. Introduction MCA MultiLogit
    Models for categorical variables
    • Log-linear models - gold standard (Christensen, 1990; Agresti, 2013)
    ⇒ Pb with high dimensional data.
    • Latent variables models:
    • categoricals: latent class models (Goodman, 1974) -
    unsupervised clustering for one latent variable.
    Nonparametric Bayesian extensions (Dunson, 2009, 2012)
    • continuous: latent-trait models (Lazar, 1968) - item
    response theory (psychology & education, Van der Linden, 1997)
    • fixed: often one latent variable, difficulties to estimate.
    • random: Gaussian distribution on the latent variables
    (Moustaki, 2000; Sanchez, 2013)
    ⇒ Binary data: Collins, Dasgupta, & Schapire (2001), Buntine
    (2002), Hoff (2009), De Leeuw (2006), Li & Tao (2013)
    19 / 35

    View full-size slide

  28. Introduction MCA MultiLogit
    Multilogit-bilinear model
    P(xij = c) = πijc =
    eθij (c)
    Cj
    c =1
    eθij (c )
    ,
    θij(c) = βj(c) + Γj
    i
    (c) = βj(c) +
    K
    k=1
    dkuikvjk(c)
    ˜
    vj(c) = (

    d1vj1(c),

    d2vj2(c)):
    question j, category c with coor-
    dinates one point/ Cj categories.
    The latent variables ˜
    ui = D1/2
    2
    ui
    P(xij = c) ∝
    exp ˜
    βj(c) − 1
    2
    ˜
    vj(c) − ˜
    ui
    2
    q
    q
    q
    q
    Latent Space
    Dimension 1
    Dimension 2
    vj
    (1)
    vj
    (2)
    vj
    (3)
    vj
    (4)
    u1
    u2
    20 / 35

    View full-size slide

  29. Introduction MCA MultiLogit
    Relationship with MCA
    Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj
    Model: πijc = eβj (c)+Γ
    j
    i
    (c)
    Cj
    c =1
    eβj (c )+Γ
    j
    i
    (c )
    Param: ζ = β
    vec(Γ)
    ; ζ0 = β0
    0
    21 / 35

    View full-size slide

  30. Introduction MCA MultiLogit
    Relationship with MCA
    Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj
    Model: πijc = eβj (c)+Γ
    j
    i
    (c)
    Cj
    c =1
    eβj (c )+Γ
    j
    i
    (c )
    Param: ζ = β
    vec(Γ)
    ; ζ0 = β0
    0
    ⇒ Rationale: Taylor expand around the independence model ζ0:
    ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) +
    1
    2
    (ζ − ζ0)T (ζ0)(ζ − ζ0)
    ˜(β, Γ) a quadratic function of its arguments, then maximizing the
    latter amounts to a generalized SVD ⇒ MCA.
    21 / 35

    View full-size slide

  31. Introduction MCA MultiLogit
    Relationship with MCA
    Data: Xn×m - A = A1| · · · |Am , Aj ∈ {0, 1}n×Cj
    Model: πijc = eβj (c)+Γ
    j
    i
    (c)
    Cj
    c =1
    eβj (c )+Γ
    j
    i
    (c )
    Param: ζ = β
    vec(Γ)
    ; ζ0 = β0
    0
    ⇒ Rationale: Taylor expand around the independence model ζ0:
    ˜(β, Γ) = (ζ0) + (ζ0)T (ζ − ζ0) +
    1
    2
    (ζ − ζ0)T (ζ0)(ζ − ζ0)
    ˜(β, Γ) a quadratic function of its arguments, then maximizing the
    latter amounts to a generalized SVD ⇒ MCA.
    ⇒ The joint likelihood is n
    i=1
    m
    j=1
    Cj
    c=1
    πAj
    ic
    ijc
    (independence)
    ⇒ The log-likelihood for the MultiLogit Bilinear model is:
    = i,j,c
    Aj
    ic
    log(πijk) = i,j,c
    Aj
    ic
    log exp(βj (c)+Γj
    i
    (c))
    Cj
    c =1
    exp(βj (c )+Γj
    i
    (c ))
    21 / 35

    View full-size slide

  32. Introduction MCA MultiLogit
    Relationship with MCA
    (β, Γ; A) = βj(Aj
    i
    ) + Γj
    i
    (Aj
    i
    ) − log


    Cj
    c=1
    eβj (c)+Γj
    i
    (c)



    ∂Γj
    i
    (c)
    = 1xij =c −
    eβj (c)+Γj
    i
    (c)
    Cj
    c =1
    eβj (c )+Γj
    i
    (c )
    = Aj
    ic
    − πijc (4)

    ∂Γj
    i
    (c)∂Γj
    i
    (c )
    =
    πijcπijc − πijc1c=c j = j , i = i
    0 o.w.
    (5)
    Evaluating (4) at ζ0 = (β0 = log(p), 0) gives Aj
    ic
    − pj(c) - idem (5)
    ˜(β, Γ) ≈ Γ, A − 1pT −
    1
    2
    ΓD1/2
    p
    2
    F
    22 / 35

    View full-size slide

  33. Introduction MCA MultiLogit
    Relationship with MCA
    Lemma
    Let G ∈ Rn×n, H1 ∈ Rn×n, H2 ∈ Rm×m, with H1, H2 0.
    argmaxΓ: rank(Γ)≤K
    Γ, G −
    1
    2
    H1ΓH2
    2
    F
    Γ∗ = H−1
    1
    SVDK (H−1
    1
    GH−1
    2
    ) H−1
    2
    Thus, using Lemma 1, the solution Γ, A − 1pT − 1
    2
    ΓD1/2
    p
    2
    F
    is
    given by the rank K SVD of (A − 1pT )D−1/2
    p which is precisely
    the SVD performed in MCA.
    Theorem
    The one-step likelihood estimate for the MultiLogit Bilinear model
    with rank constraint K, obtained by expanding around the
    independence model (β0
    = log p, Γ0 = 0), is (β0
    , ΓMCA).
    23 / 35

    View full-size slide

  34. Introduction MCA MultiLogit
    Maximizing the likelihood
    Data: A = [A1, . . . , Am]
    Model: πijc = exp(θij(c)
    )
    Cj
    c =1
    exp(θij(c )
    )
    θij(c) = βj(c) + K
    k=1
    dkuikvjk(c)
    Identification constraint: βj
    1 = 0, 1 U = 0, U U = I, 1 Vj = 0
    Maximum dimensionality: K = min(n − 1, m
    j=1
    Cj − m)
    Estimation: MLE
    ⇒ Problem: overfitting: θij(c)
    → ∞ or θij(c)
    → −∞
    ⇒ Solution: penalized likelihood.
    L(β, U, D, V) = −
    n
    i=1
    m
    j=1
    Cj
    c=1
    Aj
    ic
    log(πijc) + λ
    K
    k=1
    dk
    24 / 35

    View full-size slide

  35. Introduction MCA MultiLogit
    Majorization
    • Majorization (or MM) algorithms (De Leeuw & Heiser, 1977; Lange,
    2004) use in each iteration a majorizing function g(θ, θ0).
    • Current estimate θ0 is called supporting point.
    • Requirements:
    1 f (θ0
    ) = g(θ0, θ0
    ).
    2 f (θ) ≤ g(θ, θ0
    ).
    • Sandwich inequality: f (θ+) ≤ g(θ+, θ0) ≤ g(θ0, θ0) = f (θ0)
    with θ+ = argming(θ, θ0)
    • Any majorization algorithm is guaranteed to descent.
    25 / 35

    View full-size slide

  36. Introduction MCA MultiLogit
    Majorizing Function
    fij(θi ) = − Cj
    c=1
    Aj
    ic
    log(πijc) = − Cj
    c=1
    Aj
    ic
    log exp(θij(c)
    )
    Cj
    c =1
    exp(θij(c )
    )
    With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij
    )
    Theorem
    gij(θi , θ(0)
    i
    ) = fij(θ(0)
    i
    ) + (θi − θ(0)
    i
    ) ∇fij(θ(0)
    i
    ) + 1/4 θi − θ(0)
    i
    2 is a
    majorizing function of fij(θi ) (using De Leeuw, 2005)
    Proof h(θi ) = gij(θi , θ(0)
    i
    ) − fij(θi ) ≥ 0:
    • For θi
    = θ(0)
    i
    we have gij
    (θ(0)
    i
    , θ(0)
    i
    ) − fij
    (θ(0)
    i
    ) = 0
    • At θ(0)
    i
    we have ∇fij
    (θ(0)
    i
    ) = ∇gij
    (θ(0)
    i
    , θ(0)
    i
    )
    • 1
    2
    I − ∇2fij
    (θi
    ) is positive semi-definite (largest eigenvalue of
    ∇2fij
    (θi
    ) is smaller than 1/2)
    26 / 35

    View full-size slide

  37. Introduction MCA MultiLogit
    Majorizing Function
    fij(θi ) = − Cj
    c=1
    Aj
    ic
    log(πijc) = − Cj
    c=1
    Aj
    ic
    log exp(θij(c)
    )
    Cj
    c =1
    exp(θij(c )
    )
    With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij
    )
    Theorem
    gij(θi , θ(0)
    i
    ) = fij(θ(0)
    i
    ) + (θi − θ(0)
    i
    ) ∇fij(θ(0)
    i
    ) + 1/4 θi − θ(0)
    i
    2 is a
    majorizing function of fij(θi ) (using De Leeuw, 2005)
    H =


    π1(1 − π1) −π1π2 −π1π3 ...
    −π1π2 π2(1 − π2) −π2π3 ...
    ... ... π3(1 − π3) ...
    ... ... .... ...


    Gerschgorin disks: the eigenvalue φ is always smaller than a diagonal element plus the
    sum of its absolute off-diagonal row (or col) values
    φ ≤ πijc − π2
    ijc
    + πijc =c
    πij
    φ ≤ πijc − π2
    ijc
    + πijc
    Cj
    =1
    πij − π2
    ijc
    φ ≤ 2(πijc − π2
    ijc
    ) = 2πijc (1 − πijc ).
    2πijc (1 − πijc ) reaches its maximum of 1/2 at πijc = 1/2
    26 / 35

    View full-size slide

  38. Introduction MCA MultiLogit
    Majorizing Function
    fij(θi ) = − Cj
    c=1
    Aj
    ic
    log(πijc) = − Cj
    c=1
    Aj
    ic
    log exp(θij(c)
    )
    Cj
    c =1
    exp(θij(c )
    )
    With ∇fij(θi ) = Aij − πij and ∇2fij(θi ) = (Diag(πij) − πijπij
    )
    Theorem
    gij(θi , θ(0)
    i
    ) = fij(θ(0)
    i
    ) + (θi − θ(0)
    i
    ) ∇fij(θ(0)
    i
    ) + 1/4 θi − θ(0)
    i
    2 is a
    majorizing function of fij(θi ) (using De Leeuw, 2005)
    Proof h(θi ) = gij(θi , θ(0)
    i
    ) − fij(θi ) ≥ 0:
    • For θi
    = θ(0)
    i
    we have gij
    (θ(0)
    i
    , θ(0)
    i
    ) − fij
    (θ(0)
    i
    ) = 0
    • At θ(0)
    i
    we have ∇fij
    (θ(0)
    i
    ) = ∇gij
    (θ(0)
    i
    , θ(0)
    i
    )
    • 1
    2
    I − ∇2fij
    (θi
    ) is positive semi-definite (largest eigenvalue of
    ∇2fij
    (θi
    ) is smaller than 1/2)
    26 / 35

    View full-size slide

  39. Introduction MCA MultiLogit
    Updating parameters
    θij(c) = βj(c) + K
    k=1
    dkuikvjk(c)
    L(β, U, D, V) ≤
    1
    4
    i,j,c
    Aj
    ic
    (zijc − θij(c)
    )2 + λ
    K
    k=1
    dk + c
    zijc = β(0)
    jc
    + u(0)
    i
    D(0)v(0)
    jk
    + 2(Aijc − πijc(θ0))
    • Update β: β = n−1Z 1
    • Update U and V: Let (I − n−111 )Z = PΦQ be the SVD.
    U = P and V = Q.
    • Update D: K
    k=1
    [(φk − dk)2 + λdk]
    dk = max(0, φk − λ) nuclear norm: there is automatic
    dimension selection.
    27 / 35

    View full-size slide

  40. Introduction MCA MultiLogit
    Selecting λ
    ⇒ 2 steps : QUT to select the rank K - Shrinkage with CV
    Rationale with Lasso:
    • Lasso often used for screening (Buhlmann van de Geer, 2011)
    • Selecting λ with CV or STEIN focuses on predictive properties
    • Optimal threshold for prediction = optimal for selecting var
    ⇒ Quantile Universal Threshold (Sardy, 2016) : select the threshold
    at the bulk edge of what a threshold should be under the null.
    Guaranteed variable screening with high proba. Be careful, biaised!
    28 / 35

    View full-size slide

  41. Introduction MCA MultiLogit
    Quantile Universal Threshold
    Ex PCA. X = µ + ε, with εij ∼ N(0, σ2) → ˆ
    µK = K
    k=1
    uk dk vk
    Soft-threshold: argminµ
    X − µ 2
    2
    + λ µ ∗ → dk max 1 − λ
    dk
    , 0
    ⇒ Selecting λ to have good estimation of the rank
    1 Generate data under the null hypothesis of no signal, µ = 0
    2 Compute the first singular value d1
    3 Repeat 1000 times 1 and 2
    4 Use the (1 − α)-quantile of the distribution of d1 as threshold
    (Exact results Zanella 2009; Asymptotic results, random matrix theory Shabalin 2013,
    Paul 2007, Baik 2006...)
    ⇒ Suppose to know σ!
    29 / 35

    View full-size slide

  42. Introduction MCA MultiLogit
    Selecting λ
    ⇒ 2 steps : QUT to select the rank K - Shrinkage with CV
    Model: πijc = exp(βj (c)+ K
    k=1
    dk uik vjk (c))
    Cj
    c =1
    exp(βj (c )+ K
    k=1
    dk uik vjk (c ))
    Lik:L(β, U, D, V) = − n
    i=1
    m
    j=1
    Cj
    c=1
    Aj
    ic
    log(πijc) + λ K
    k=1
    dk
    1 Generate under the null of no interaction and take λ the
    quantile of the distribution of d1: good rank recovery
    2 For a rank KQUT, estimate λ with Cross-Validation to
    determine the amount of shrinkage (Lasso + LS)
    k-fold CV, λ with the best out-of-sample deviance is chosen.
    30 / 35

    View full-size slide

  43. Introduction MCA MultiLogit
    Simulations
    • n: 50, 100, 300
    • m: 20, 100, 300 - 3 categories/variables
    • Interaction K: 2, 6
    • (d1/d2): 2, 1
    • the strength of the interaction (low, strong).
    ˜
    ui ∼ NK 0,
    d1 0
    0 dK
    ˜
    vj(c) ∼ NK 0,
    d1 0
    0 dK
    θc
    ij
    = −
    1
    2
    ˜
    ui − ˜
    vj(c) 2
    P(xij = c) ∝ eθc
    ij
    31 / 35

    View full-size slide

  44. Introduction MCA MultiLogit
    Simulations
    n p rank ratio strength model MCA
    1 50 20 2 1 0.1 0.044 0.035
    2 50 20 2 1 1 0.020 0.045
    3 50 20 2 2 0.1 0.048 0.036
    4 50 20 2 2 1 0.0206 0.042
    5 50 20 6 1 0.1 0.111 0.064
    6 50 20 6 1 1 0.045 0.026
    7 50 20 6 2 0.1 0.115 (0.028) 0.071
    8 50 20 6 2 1 0.032 0.051
    9 300 100 2 1 0.1 0.005 0.006
    10 300 100 2 1 1 0.004 0.042
    11 300 100 2 2 0.1 0.0047 0.005
    12 300 100 2 2 1 0.0037 (0.00369) 0.040
    13 300 300 2 1 0.1 0.003 0.004
    14 300 300 2 1 1 0.002 0.039
    15 300 300 2 2 0.1 0.003 0.004
    16 300 300 2 2 1 0.002 0.039
    17 300 100 6 1 0.1 0.019 0.015
    18 300 100 6 1 1 0.011 0.023
    19 300 100 6 2 0.1 0.018 (0.010) 0.017
    20 300 100 6 2 1 0.010 0.056
    21 300 300 6 1 0.1 0.011 0.008
    22 300 300 6 1 1 0.006 0.022
    23 300 300 6 2 0.1 0.009 0.012
    32 / 35

    View full-size slide

  45. Introduction MCA MultiLogit
    Simulation
    33 / 35

    View full-size slide

  46. Introduction MCA MultiLogit
    Simulation
    33 / 35

    View full-size slide

  47. Introduction MCA MultiLogit
    Overfitting
    ⇒ n = 50, p = 20, r = 6, strength = 0.1 ⇒ Alcohol data....
    0 5000 10000 15000
    300 400 500 600 700 800
    Index
    essai$dev
    34 / 35

    View full-size slide

  48. Introduction MCA MultiLogit
    Overfitting
    ⇒ n = 50, p = 20, r = 6, strength = 0.1 ⇒ Alcohol data....
    0 100 200 300 400 500 600
    20000 21000 22000 23000 24000 25000 26000
    Index
    essai$dev
    34 / 35

    View full-size slide

  49. Introduction MCA MultiLogit
    Conclusion
    MCA can be seen as a linearized estimate of the parameters of the
    multinomial logit bilinear model ⇒ MCA a proxy to estimate the
    model’s parameters (small interaction)
    • graphics
    • mixed data (quanti, quali) - FAMD / Multiple Factor Analysis
    for groups of variables
    • mixture of MCA/ mixture of PPCA
    • selecting the rank with BIC?? Fixed effect, asymptotics, n - p?
    • regularization in MCA to tackle overfitting issues
    • missing values
    35 / 35

    View full-size slide