Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Arthur Tenenhaus

C3bc10b8a72ed3c3bfd843793b8a9868?s=47 S³ Seminar
February 13, 2015

Arthur Tenenhaus

(CentraleSupelec, Laboratoire des Signaux et Systèmes, France)

https://s3-seminar.github.io/seminars/arthur-tenenhaus

Title — Structured data analysis

Abstract — In contrast to standard data that is structured by a single individuals variables data matrix, structured data are characterized by multiple and heterogeneous sources of information, interconnected, potentially of high dimensions. In addition, each source of information may also have a complex structure (e.g. tensor structure). The need to analyze the data by taking into account their natural structure appears to be essential but requires the development of new statistical techniques that constitute the core of my current research for many years. More specifically, I will present a unified framework for multiblock, multigroup and multiway data analysis through Regularized Generalized Canonical Correlation Analysis.

C3bc10b8a72ed3c3bfd843793b8a9868?s=128

S³ Seminar

February 13, 2015
Tweet

Transcript

  1. Structured data analysis with RGCCA Arthur Tenenhaus 2015/02/13

  2. Overview of the presentation X1 X2 XJ n p2 p3

    p1 ... Part I. Multi-block analysis X1 X2 XJ n p2 p3 p1 ... Part II. Multi-block and Multi-way analysis X2 X21 X1 X2 XI n1 p ... Part III. Multi-group analysis n2 nI
  3. X1 X2 XJ n p2 p3 p1 ... Part I.

    Multi-block analysis M. Tenenhaus (HEC, Paris) V. Guillemot (ICM) T. Löfstedt (CEA, Neurospin) C. Philippe (IGR) V. Frouin (CEA, Neurospin) P. Groenen (Erasmus University, Rotterdam)
  4. 4 Economic inequality and political instability Data from Russett (1964)

    Agricultural inequality GINI : Inequality of land distributions FARM : % farmers that own half of the land (> 50) RENT : % farmers that rent all their land Industrial development GNPR : Gross national product per capita ($ 1955) LABO : % of labor force employed in agriculture INST : Instability of executive (45-61) ECKS : Nb of violent internal war incidents (46-61) DEAT : Nb of people killed as a result of civic group violence (50-62) D-STAB : Stable democracy D-UNST : Unstable democracy DICT : Dictatorship Economic inequality Political instability
  5. Economic inequality and political instability (Data from Russett, 1964) Gini

    Farm Rent Gnpr Labo Inst Ecks Deat Demo Argentine 86.3 98.2 32.9 374 25 13.6 57 217 2 Australie 92.9 99.6 * 1215 14 11.3 0 0 1 Autriche 74.0 97.4 10.7 532 32 12.8 4 0 2  France 58.3 86.1 26.0 1046 26 16.3 46 1 2  Yougoslavie 43.7 79.8 0.0 297 67 0.0 9 0 3 X1 X2 X3 5 1 = Stable democracy 2 = Unstable democracy 3 = Dictatorship Three data blocks
  6. GINI FARM RENT GNPR LABO Agricultural inequality (X1 ) Industrial

    development (X2 ) ECKS DEAT D-STB D-INS INST DICT Political instability (X3 ) Agr. ineq. Ind. dev. Pol. inst. C13 = 1 C23 = 1 C12 = 0 Path diagram 6
  7. Block components 1 = 1 1 = 11 + 12

    + 13 2 = 2 2 = 21 + 22 3 = 3 3 = 31 + 32 + 33 + 34 . + 35 . + 36 Block components should verified two properties at the same time: (i) Block components explain well their own block. (ii) Block components are as correlated as possible for connected blocks.
  8. Covariance-based criteria cjk = 1 if blocks are linked, 0

    otherwise and cjj = 0 maximize , cor( , ) maximize , cor2( , ) maximize , |cor , | maximize all ‖‖=1 , cov( , ) maximize all ‖‖=1 , cov2( , ) maximize all ‖‖=1 , |cov , | SUMCOR (Horst, 1961) SSQCOR (Mathes, 1993 ; Hanafi, 2004) SABSCOR (Mathes, 1993 ; Hanafi, 2004) SUMCOV (Van de Geer, 1984) SSQCOV (Hanafi & Kiers, 2006) SABSCOV (Krämer, 2006) Some modified multi-block methods GENERALIZED CANONICAL CORRELATION ANALYSIS GENERALIZED CANONICAL COVARIANCE ANALYSIS cov2 , = var cor2( , )var Some multi-block methods
  9. Covariance-based criteria cjk = 1 if blocks are linked, 0

    otherwise and cjj = 0 maximize all var =1 , cov( , ) maximize all var =1 , cov2( , ) maximize all var =1 , |cov , | maximize all ‖‖=1 , cov( , ) maximize all ‖‖=1 , cov2( , ) maximize all ‖‖=1 , |cov , | SUMCOR: SSQCOR: SABSCOR: SUMCOV: SSQCOV: SABSCOV: cov2 , = var cor2( , )var
  10. RGCCA optimization problem argmax 1,2,…, g cov , ≠ 1

    − var + 2 = 1, = 1, … , Subject to the constraints and: identity (Horst sheme) square (Factorial scheme) abolute value (Centroid scheme) g       Shrinkage constant between 0 and 1 j       otherwise 0 connected is and if 1 k j X X jk c where: A monotone convergent algorithm Schäfer and Strimmer formula can be used for an optimal determination of the shrinkage constants • Tenenhaus A. and Tenenhaus M., Regularized Generalized Canonical Correlation Analysis, Psychometrika, vol. 76, Issue 2, pp. 257-284, 2011 • Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics and Data Analysis, in revision. • Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html
  11. Method Criterion Constraints PLS regression 1 1 2 2 Maximize

    Cov( , ) X a X a 1 2 1   a a Canonical Correlation Analysis 1 1 2 2 Maximize Cor( , ) X a X a 1 1 2 2 Var( ) Var( ) 1   X a X a Redundancy analysis of X1 with respect to X2 1/2 1 1 2 2 1 1 Maximize Cor( , )Var( ) X a X a X a 1 2 2 1 Var( ) 1   a X a Special cases Components X1 a1 and X2 a2 are well correlated. No stability condition for 2nd component 1st component is stable argmax 1,2 cov(1 1 , 2 2 ) 1 − var + 2 = 1, = 1,2 Subject to the constraints Choice of the shrinkage constant j (part 1)
  12. Choice of the shrinkage constant j (part 2) 0 1

    Favoring correlation Favoring stability j Schäfer and Strimmer formula can be used for an optimal determination of the shrinkage constants argmax 1,2,…, g cov , ≠ 1 − var + 2 = 1, = 1, … , Subject to the constraints
  13. Choice of the design matrix C Hierarchical models (a) One

    second order block (b) Several second order blocks max 1,2,…, ≠ g cov , +1 +1 1 − var + 2 = 1, = 1, … , + 1 max 1,2,…, =1 1 =1+1 g cov , 1 − var + 2 = 1, = 1, … , 1, …, J1 = Predictor blocks Very often: J1+1, …, J = Response Blocks
  14. PLS Regression Wold S., Martens & Wold H. (1983): The

    multivariate calibration problem in chemistry solved by the PLS method. In Proc. Conf. Matrix Pencils, Ruhe A. & Kåstrøm B. (Eds), March 1982, Lecture Notes in Mathematics, Springer Verlag, Heidelberg, p. 286-293. Redundancy analysis Barker M. & Rayens W. (2003): Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173. Regularized CCA Vinod H. D. (1976): Canonical ridge and econometrics of joint production. Journal of Econometrics, 4, 147–166. Inter-battery factor analysis Tucker L.R. (1958): An inter-battery method of factor analysis, Psychometrika, vol. 23, n°2, pp. 111-136. MCOA Chessel D. and Hanafi M. (1996): Analyse de la co-inertie de K nuages de points. Revue de Statistique Appliquée, 44, 35-60 SSQCOV Hanafi M. & Kiers H.A.L. (2006): Analysis of K sets of data, with differential emphasis on agreement between and within sets, Computational Statistics & Data Analysis, 51, 1491-1508. SUMCOR Horst P. (1961): Relations among m sets of variables, Psychometrika, vol. 26, pp. 126-149. SSQCOR Kettenring J.R. (1971): Canonical analysis of several sets of variables, Biometrika, 58, 433-451 MAXDIFF Van de Geer J. P. (1984): Linear relations among k sets of variables. Psychometrika, 49, 70-94. PLS path modeling Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M., Lauro C. (2005): PLS path modeling. Computational Statistics and Data (mode B) Analysis, 48, 159-205. Generalized Orthogonal Vivien M. & Sabatier R. (2003): Generalized orthogonal multiple co-inertia analysis (-PLS): new multiblock component MCOA and regression methods, Journal of Chemometrics, 17, 287-301. Caroll’s GCCA Carroll, J.D. (1968): A generalization of canonical correlation analysis to three or more sets of variables, Proc. 76th Conv. Am. Psych. Assoc., pp. 227-228. special cases of RGCCA (among others) two-block case multi-block case
  15. monotone convergent algorithms for these criteria argmax 1,2,…, g cov

    , ≠ 1 − var + 2 = 1, = 1, … , Subject to Two key ingredients: (i) Block relaxation (ii) Majorization by Minorization
  16. The RGCCA algorithm (primal version) j j j a X

    y  Outer Estimation (explains the block)     1 var 1 2    j j j j j a a X   Initial step j a     j t j j j t j j j t j j t j j j t j j j z X I X X n X z z X I X X n a 1 1 1 1 1 1                        Iterate until convergence of the criterion cjk = 1 if blocks are linked, 0 otherwise and cjj = 0 Inner Estimation (explains relation between block) k j k jk j e y z    Choice of weights ejk : - Horst : - Centroid : - Factorial : jk jk c e      k j jk jk c e y y , cor sign    k j jk jk c e y y , cov  Dimension = ×
  17. The RGCCA algorithm (dual version) Initial step j α 

       j j t j j j t j j t j j j t j j j j z I X X n X X z z I X X n α 1 1 1 1 1 1                        Iterate until convergence of the criterion cjk = 1 if blocks are linked, 0 otherwise and cjj = 0 Inner Estimation (explains relation between block) k j k jk j e y z    Choice of weights ejk : - Horst : - Centroid : - Factorial : jk jk c e      k j jk jk c e y y , cor sign    k j jk jk c e y y , cov  j t j j j α X X y  Outer Estimation (explains the block)     1 ) 1 ( 1    j t j j n j j t j j t j α X X I X X α   Dimension = × j t j j α X a 
  18. GINI FARM RENT GNPR LABO Agricultural inequality (X1 ) Industrial

    development (X2 ) ECKS DEAT D-STB INST DICT Political instability (X3 ) Agr. ineq. Ind. dev. Pol. inst. Weight vectors 18 0.66 0.74 0.10 0.69 -0.72 0.17 0.44 0.48 -0.56 0.49 Corr = 0.428 Corr = -0.767 small dimensional block settings ⟹ primal algorithm for RGCCA
  19. Bootstrap confidence intervals 19 GINI FARM RENT GNPR LABO ECKS

    DEAT D-STB INST DICT Agr. ineq. Ind. dev. Pol. inst. 0.66 0.74 0.10 0.69 -0.72 0.17 0.44 0.48 -0.56 0.49 Corr = 0.428 Corr = -0.767
  20. Data vizualization 20 Agricultural inequality Industrial development These countries have

    known a period of dictatorship after 1964. Greece : colonels’ dictatorship 1967-1974 Chili : Pinochet's military regime 1973-1990 Argentine : military dictatorship 1976-1983 Brasil : Branco’s military dictactorship 1964-1985
  21. 21 Glioma Cancer Data (Department of Pediatric Oncology of the

    Gustave Roussy Institute) Gene 1 Gene 2 … Gene 15201 CGH1 … CGH 1909 Localization Patient 1 0.18 -0.21 -0.73 0.00 -0.55 Hemisphere Patient 2 1.15 -0.45 0.27 -0.30 0.00 Midline Patient 3 1.35 0.17 0.22 0.33 0.64 DIPG   Patient 53 1.39 0.18 … -0.17 0.00 … 0.43 Hemisphere Transcriptomic data (X1 ) CGH data (X2 ) outcome (X3 )
  22. Glioma Cancer Data: from an RGCCA viewpoint (Department of Pediatric

    Oncology of the Gustave Roussy Institute) High dimensional block settings ⟹ dual algorithm for RGCCA Gene 1 … Gene 15201 Patient 1 0.18 -0.73 Patient 2 1.15 0.27 Patient 3 1.35 0.22  Patient 53 1.39 -0.17 CGH1 … CGH 1909 Patient 1 0.00 -0.55 Patient 2 -0.30 0.00 Patient 3 0.33 0.64  Patient 53 0.00 0.43 2 1 3 Hemisphere DIPG Patient 1 1 0 Patient 2 0 0 Patient 3 0 1  Patient 53 1 0 RGCCA with factorial scheme - 1 = 1, 2 = 1 and 3 = 0 C13 = 1 C23 = 1 C12 = 1 C12 = 0
  23. None
  24. Block components 1 = = 11 + ⋯ + 1,15201

    2 = = 21 + ⋯ + 2,1909 = = 31 + 32 Block components should verified three properties at the same time: (i) Block components well explain their own block. (ii) Block components are as correlated as possible for connected blocks. (iii) Block components are built from sparse
  25. Behavioral data (Clinic, psychometric) Intermediate phenotype Final phenotype Genotype Functional

    MRI Gene Expression Structured variable selection for RGCCA
  26. (Structured) variable selection for RGCCA argmax 1,2,…, g cov ,

    ≠ subject to = 1, = 1, … , Ω( ) ≤ , = 1, … , • LASSO: Ω = 1 Ω = =1 + =1 − ,−1 Ω = ∈ ag 2 • Group LASSO: • Fused LASSO: • Tenenhaus A., Philippe C., Guillemot V., Lê Cao K.-A., Grill J., Frouin V., Variable Selection for Generalized Canonical Correlation Analysis, Biostatistics, 15 (3), 569-583, 2014. • Löfstedt T., Hadj-Salem F., Guillemot V., Philippe C., Duchesnay E., Frouin V., and Tenenhaus A., (2014). Structured variable selection for generalized canonical correlation analysis. In: Proceedings of the 8th International Conference on Partial Least Squares and Related Methods (PLS14), Paris, France. • Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html
  27. Signature stability

  28. 28 Predictive performances

  29. Visualization GE1 CGH1

  30. X1 X2 XJ n p2 p3 p1 ... Part II.

    Multi-block and Multi-way analysis X2 X21 G. Lechuga (CentraleSupélec, L2S) L. Le Brusquet (CentraleSupélec, L2S) L. Puybasset, V. Perlbarg & D. Galanaud Hôpital La Pitié-Salpêtrière
  31. 31 Multimodal neuroimaging from a multiblock viewpoint (Brain and Spine

    Institute, La pitié Salpêtrière Hospital) 2 1 3 C13 = 1 C23 = 1 C12 = 1 C13 = 0 Anatomical MRI (X1 ) Functional MRI (X2 ) Behavior (X3 ) Anat 1 … Anat p1 Patient 1 0.18 -0.73 Patient 2 1.15 0.27 Patient 3 1.35 0.22  Patient n 1.39 -0.17 fMRI 1 … fMRI p1 Patient 1 0.00 -0.55 Patient 2 -0.30 0.00 Patient 3 0.33 0.64  Patient n 0.00 0.43 BEH 1 … BEH p3 Patient 1 0.18 -0.73 Patient 2 1.15 0.27 Patient 3 1.35 0.22  Patient n 1.39 -0.17 p1 ~104 p2 ~104 p3 ~10 n ~100 n ~100
  32. X5 Final Phenotype X1 X2 X3 X4 Anatomical MRI Diffusion

    MRI Functional MRI PET p1 p1 p1 p1 p2 From Multiblock data to … Multiway RGCCA (MGCCA) RGCCA algorithm a1 a2 a3 a5 a4 4×p1 parameters to estimate n • Tenenhaus A., Le Brusquet L. Regularized Generalized Canonical Correlation Analysis extended to three way data, International Conference of the ERCIM WG on Computational and Methodological Statistics, 2014 • Tenenhaus A., Le Brusquet L. Three-way Regularized Generalized Canonical Correlation Analysis, ThRee-way methods In Chemistry And Psychology, (TRICAP) ,2015
  33. X5 … to Multiblock / Multiway data X1 2 1

    C12 = 1 J 1 b K 1 b 2 b X5 X1 X2 X3 X4 P1 = P2 = and = 1, = 1,2 1 = 1 ⊗ 1 s.c. max 1, 2 cov(1 1 , 2 2 )
  34. P2 … to Multiblock / Multiway data P1 2 1

    C12 = 1 J 1 b K 1 b 2 b MGCCA b2 K 1 b J 1 b 4 + p1 parameters to estimate instead of 4 × p1 K J 1 1 1 b b b  
  35. MGCCA optimization problem = 1, = 1, … , =

    ⊗ , = 1, … , s.c. max 1,…, g cov( , ) ≠ 1 ..1 P 1 ..k P 1 1 ..K P 3 ..1 P 3 ..k P 3 3 ..K P 2 ..1 P 2 ..k P 2 2 ..K P 1 3 2 C13 = 1 C23 = 1 C12 = 1 C13 = 0 J 1 b K 1 b J 2 b K 2 b J 3 b K 3 b 1 2 3
  36. MGCCA algorithm y P b  j j j Outer

    component (summarizes the block) Initial step j b Iterate until convergence of the criterion cjk = 1 if blocks/groups are connected and 0 otherwise Inner component (take into account relationships between blocks) j jk k k j z e y      1 , arg max cov( , ) b b b b b b P b z     K J K J j j j j Choice of weights ejk : - Horst : - Centroid : - Factorial : jk jk e =c   sign cov( , ) ik jk j k e c y y  cov( , ) jk jk j k e c y y  ..., j j j j ..1 ..K , P P P      = 1 and = ⊗   , K J j j b b are obtained by SVD
  37. MGCCA results P2 P1 2 1 J 1 b K

    1 b 2 b Predict the long term recovery of patients after traumatic brain injury Influence of spatial positions Influence of the modalities J 1 b K 1 b Discriminating voxels within the white matter bundles Bad prognosis Good prognosis Control
  38. X1 X2 XI n1 p ... Part III. Multi-group analysis

    n2 nI M. Tenenhaus (HEC, Paris)
  39. Part III: multigroup data analysis • SETTINGS: The same set

    of variables are measured on individuals structured in several groups. groups are centered and normalized (unit norm) • OBJECTIVE: investigate the relationships between variables within the various groups. X2 n 1 p X2 n 2 n I X2 X2 Tenenhaus, A. and Tenenhaus, M. (2014). Regularized Generalized Canonical Correlation Analysis for multiblock or multigroup data analysis. European Journal of Operational Research, 238 :391–403. X2 X2
  40. RGCCA for multiblock data analysis argmax 1,2,…, g cov ,

    ≠ 1 − var + 2 = 1, = 1, … , Subject to the constraints Block component cov2 , = var cor2( , )var Block components should verified two properties at the same time: (i) Block components well explain their own block. (ii) Block components are as correlated as possible for connected blocks.
  41. RGCCA for multigroup data analysis argmax 1,…, g , ≠

    (1 − )‖ ‖2 + ‖ ‖2 = 1, = 1, … , s.c. , = cos , × × ‖ ‖ Group loadings and group components should verified the following properties at the same time: • Group component well explains their own block. • Small angle between loadings if groups are connected Similar Loadings Group loadings Group components
  42. Conclusion X1 X2 XJ n p2 pJ p1 ... Multiblock

    analysis X1 X2 XJ n p2 p3 p1 ... Multiblock/Multiway analysis X2 X21 X1 X2 XI n1 p ... Multigroup analysis n2 nI RGCCA for multiblock, multigroup or multiway data allows analyzing the data in their natural (but complex) structure.