Slide 1

Slide 1 text

Structured data analysis with RGCCA Arthur Tenenhaus 2015/02/13

Slide 2

Slide 2 text

Overview of the presentation X1 X2 XJ n p2 p3 p1 ... Part I. Multi-block analysis X1 X2 XJ n p2 p3 p1 ... Part II. Multi-block and Multi-way analysis X2 X21 X1 X2 XI n1 p ... Part III. Multi-group analysis n2 nI

Slide 3

Slide 3 text

X1 X2 XJ n p2 p3 p1 ... Part I. Multi-block analysis M. Tenenhaus (HEC, Paris) V. Guillemot (ICM) T. Löfstedt (CEA, Neurospin) C. Philippe (IGR) V. Frouin (CEA, Neurospin) P. Groenen (Erasmus University, Rotterdam)

Slide 4

Slide 4 text

4 Economic inequality and political instability Data from Russett (1964) Agricultural inequality GINI : Inequality of land distributions FARM : % farmers that own half of the land (> 50) RENT : % farmers that rent all their land Industrial development GNPR : Gross national product per capita ($ 1955) LABO : % of labor force employed in agriculture INST : Instability of executive (45-61) ECKS : Nb of violent internal war incidents (46-61) DEAT : Nb of people killed as a result of civic group violence (50-62) D-STAB : Stable democracy D-UNST : Unstable democracy DICT : Dictatorship Economic inequality Political instability

Slide 5

Slide 5 text

Economic inequality and political instability (Data from Russett, 1964) Gini Farm Rent Gnpr Labo Inst Ecks Deat Demo Argentine 86.3 98.2 32.9 374 25 13.6 57 217 2 Australie 92.9 99.6 * 1215 14 11.3 0 0 1 Autriche 74.0 97.4 10.7 532 32 12.8 4 0 2  France 58.3 86.1 26.0 1046 26 16.3 46 1 2  Yougoslavie 43.7 79.8 0.0 297 67 0.0 9 0 3 X1 X2 X3 5 1 = Stable democracy 2 = Unstable democracy 3 = Dictatorship Three data blocks

Slide 6

Slide 6 text

GINI FARM RENT GNPR LABO Agricultural inequality (X1 ) Industrial development (X2 ) ECKS DEAT D-STB D-INS INST DICT Political instability (X3 ) Agr. ineq. Ind. dev. Pol. inst. C13 = 1 C23 = 1 C12 = 0 Path diagram 6

Slide 7

Slide 7 text

Block components 1 = 1 1 = 11 + 12 + 13 2 = 2 2 = 21 + 22 3 = 3 3 = 31 + 32 + 33 + 34 . + 35 . + 36 Block components should verified two properties at the same time: (i) Block components explain well their own block. (ii) Block components are as correlated as possible for connected blocks.

Slide 8

Slide 8 text

Covariance-based criteria cjk = 1 if blocks are linked, 0 otherwise and cjj = 0 maximize , cor( , ) maximize , cor2( , ) maximize , |cor , | maximize all ‖‖=1 , cov( , ) maximize all ‖‖=1 , cov2( , ) maximize all ‖‖=1 , |cov , | SUMCOR (Horst, 1961) SSQCOR (Mathes, 1993 ; Hanafi, 2004) SABSCOR (Mathes, 1993 ; Hanafi, 2004) SUMCOV (Van de Geer, 1984) SSQCOV (Hanafi & Kiers, 2006) SABSCOV (Krämer, 2006) Some modified multi-block methods GENERALIZED CANONICAL CORRELATION ANALYSIS GENERALIZED CANONICAL COVARIANCE ANALYSIS cov2 , = var cor2( , )var Some multi-block methods

Slide 9

Slide 9 text

Covariance-based criteria cjk = 1 if blocks are linked, 0 otherwise and cjj = 0 maximize all var =1 , cov( , ) maximize all var =1 , cov2( , ) maximize all var =1 , |cov , | maximize all ‖‖=1 , cov( , ) maximize all ‖‖=1 , cov2( , ) maximize all ‖‖=1 , |cov , | SUMCOR: SSQCOR: SABSCOR: SUMCOV: SSQCOV: SABSCOV: cov2 , = var cor2( , )var

Slide 10

Slide 10 text

RGCCA optimization problem argmax 1,2,…, g cov , ≠ 1 − var + 2 = 1, = 1, … , Subject to the constraints and: identity (Horst sheme) square (Factorial scheme) abolute value (Centroid scheme) g       Shrinkage constant between 0 and 1 j       otherwise 0 connected is and if 1 k j X X jk c where: A monotone convergent algorithm Schäfer and Strimmer formula can be used for an optimal determination of the shrinkage constants • Tenenhaus A. and Tenenhaus M., Regularized Generalized Canonical Correlation Analysis, Psychometrika, vol. 76, Issue 2, pp. 257-284, 2011 • Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics and Data Analysis, in revision. • Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html

Slide 11

Slide 11 text

Method Criterion Constraints PLS regression 1 1 2 2 Maximize Cov( , ) X a X a 1 2 1   a a Canonical Correlation Analysis 1 1 2 2 Maximize Cor( , ) X a X a 1 1 2 2 Var( ) Var( ) 1   X a X a Redundancy analysis of X1 with respect to X2 1/2 1 1 2 2 1 1 Maximize Cor( , )Var( ) X a X a X a 1 2 2 1 Var( ) 1   a X a Special cases Components X1 a1 and X2 a2 are well correlated. No stability condition for 2nd component 1st component is stable argmax 1,2 cov(1 1 , 2 2 ) 1 − var + 2 = 1, = 1,2 Subject to the constraints Choice of the shrinkage constant j (part 1)

Slide 12

Slide 12 text

Choice of the shrinkage constant j (part 2) 0 1 Favoring correlation Favoring stability j Schäfer and Strimmer formula can be used for an optimal determination of the shrinkage constants argmax 1,2,…, g cov , ≠ 1 − var + 2 = 1, = 1, … , Subject to the constraints

Slide 13

Slide 13 text

Choice of the design matrix C Hierarchical models (a) One second order block (b) Several second order blocks max 1,2,…, ≠ g cov , +1 +1 1 − var + 2 = 1, = 1, … , + 1 max 1,2,…, =1 1 =1+1 g cov , 1 − var + 2 = 1, = 1, … , 1, …, J1 = Predictor blocks Very often: J1+1, …, J = Response Blocks

Slide 14

Slide 14 text

PLS Regression Wold S., Martens & Wold H. (1983): The multivariate calibration problem in chemistry solved by the PLS method. In Proc. Conf. Matrix Pencils, Ruhe A. & Kåstrøm B. (Eds), March 1982, Lecture Notes in Mathematics, Springer Verlag, Heidelberg, p. 286-293. Redundancy analysis Barker M. & Rayens W. (2003): Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173. Regularized CCA Vinod H. D. (1976): Canonical ridge and econometrics of joint production. Journal of Econometrics, 4, 147–166. Inter-battery factor analysis Tucker L.R. (1958): An inter-battery method of factor analysis, Psychometrika, vol. 23, n°2, pp. 111-136. MCOA Chessel D. and Hanafi M. (1996): Analyse de la co-inertie de K nuages de points. Revue de Statistique Appliquée, 44, 35-60 SSQCOV Hanafi M. & Kiers H.A.L. (2006): Analysis of K sets of data, with differential emphasis on agreement between and within sets, Computational Statistics & Data Analysis, 51, 1491-1508. SUMCOR Horst P. (1961): Relations among m sets of variables, Psychometrika, vol. 26, pp. 126-149. SSQCOR Kettenring J.R. (1971): Canonical analysis of several sets of variables, Biometrika, 58, 433-451 MAXDIFF Van de Geer J. P. (1984): Linear relations among k sets of variables. Psychometrika, 49, 70-94. PLS path modeling Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M., Lauro C. (2005): PLS path modeling. Computational Statistics and Data (mode B) Analysis, 48, 159-205. Generalized Orthogonal Vivien M. & Sabatier R. (2003): Generalized orthogonal multiple co-inertia analysis (-PLS): new multiblock component MCOA and regression methods, Journal of Chemometrics, 17, 287-301. Caroll’s GCCA Carroll, J.D. (1968): A generalization of canonical correlation analysis to three or more sets of variables, Proc. 76th Conv. Am. Psych. Assoc., pp. 227-228. special cases of RGCCA (among others) two-block case multi-block case

Slide 15

Slide 15 text

monotone convergent algorithms for these criteria argmax 1,2,…, g cov , ≠ 1 − var + 2 = 1, = 1, … , Subject to Two key ingredients: (i) Block relaxation (ii) Majorization by Minorization

Slide 16

Slide 16 text

The RGCCA algorithm (primal version) j j j a X y  Outer Estimation (explains the block)     1 var 1 2    j j j j j a a X   Initial step j a     j t j j j t j j j t j j t j j j t j j j z X I X X n X z z X I X X n a 1 1 1 1 1 1                        Iterate until convergence of the criterion cjk = 1 if blocks are linked, 0 otherwise and cjj = 0 Inner Estimation (explains relation between block) k j k jk j e y z    Choice of weights ejk : - Horst : - Centroid : - Factorial : jk jk c e      k j jk jk c e y y , cor sign    k j jk jk c e y y , cov  Dimension = ×

Slide 17

Slide 17 text

The RGCCA algorithm (dual version) Initial step j α     j j t j j j t j j t j j j t j j j j z I X X n X X z z I X X n α 1 1 1 1 1 1                        Iterate until convergence of the criterion cjk = 1 if blocks are linked, 0 otherwise and cjj = 0 Inner Estimation (explains relation between block) k j k jk j e y z    Choice of weights ejk : - Horst : - Centroid : - Factorial : jk jk c e      k j jk jk c e y y , cor sign    k j jk jk c e y y , cov  j t j j j α X X y  Outer Estimation (explains the block)     1 ) 1 ( 1    j t j j n j j t j j t j α X X I X X α   Dimension = × j t j j α X a 

Slide 18

Slide 18 text

GINI FARM RENT GNPR LABO Agricultural inequality (X1 ) Industrial development (X2 ) ECKS DEAT D-STB INST DICT Political instability (X3 ) Agr. ineq. Ind. dev. Pol. inst. Weight vectors 18 0.66 0.74 0.10 0.69 -0.72 0.17 0.44 0.48 -0.56 0.49 Corr = 0.428 Corr = -0.767 small dimensional block settings ⟹ primal algorithm for RGCCA

Slide 19

Slide 19 text

Bootstrap confidence intervals 19 GINI FARM RENT GNPR LABO ECKS DEAT D-STB INST DICT Agr. ineq. Ind. dev. Pol. inst. 0.66 0.74 0.10 0.69 -0.72 0.17 0.44 0.48 -0.56 0.49 Corr = 0.428 Corr = -0.767

Slide 20

Slide 20 text

Data vizualization 20 Agricultural inequality Industrial development These countries have known a period of dictatorship after 1964. Greece : colonels’ dictatorship 1967-1974 Chili : Pinochet's military regime 1973-1990 Argentine : military dictatorship 1976-1983 Brasil : Branco’s military dictactorship 1964-1985

Slide 21

Slide 21 text

21 Glioma Cancer Data (Department of Pediatric Oncology of the Gustave Roussy Institute) Gene 1 Gene 2 … Gene 15201 CGH1 … CGH 1909 Localization Patient 1 0.18 -0.21 -0.73 0.00 -0.55 Hemisphere Patient 2 1.15 -0.45 0.27 -0.30 0.00 Midline Patient 3 1.35 0.17 0.22 0.33 0.64 DIPG   Patient 53 1.39 0.18 … -0.17 0.00 … 0.43 Hemisphere Transcriptomic data (X1 ) CGH data (X2 ) outcome (X3 )

Slide 22

Slide 22 text

Glioma Cancer Data: from an RGCCA viewpoint (Department of Pediatric Oncology of the Gustave Roussy Institute) High dimensional block settings ⟹ dual algorithm for RGCCA Gene 1 … Gene 15201 Patient 1 0.18 -0.73 Patient 2 1.15 0.27 Patient 3 1.35 0.22  Patient 53 1.39 -0.17 CGH1 … CGH 1909 Patient 1 0.00 -0.55 Patient 2 -0.30 0.00 Patient 3 0.33 0.64  Patient 53 0.00 0.43 2 1 3 Hemisphere DIPG Patient 1 1 0 Patient 2 0 0 Patient 3 0 1  Patient 53 1 0 RGCCA with factorial scheme - 1 = 1, 2 = 1 and 3 = 0 C13 = 1 C23 = 1 C12 = 1 C12 = 0

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Block components 1 = = 11 + ⋯ + 1,15201 2 = = 21 + ⋯ + 2,1909 = = 31 + 32 Block components should verified three properties at the same time: (i) Block components well explain their own block. (ii) Block components are as correlated as possible for connected blocks. (iii) Block components are built from sparse

Slide 25

Slide 25 text

Behavioral data (Clinic, psychometric) Intermediate phenotype Final phenotype Genotype Functional MRI Gene Expression Structured variable selection for RGCCA

Slide 26

Slide 26 text

(Structured) variable selection for RGCCA argmax 1,2,…, g cov , ≠ subject to = 1, = 1, … , Ω( ) ≤ , = 1, … , • LASSO: Ω = 1 Ω = =1 + =1 − ,−1 Ω = ∈ ag 2 • Group LASSO: • Fused LASSO: • Tenenhaus A., Philippe C., Guillemot V., Lê Cao K.-A., Grill J., Frouin V., Variable Selection for Generalized Canonical Correlation Analysis, Biostatistics, 15 (3), 569-583, 2014. • Löfstedt T., Hadj-Salem F., Guillemot V., Philippe C., Duchesnay E., Frouin V., and Tenenhaus A., (2014). Structured variable selection for generalized canonical correlation analysis. In: Proceedings of the 8th International Conference on Partial Least Squares and Related Methods (PLS14), Paris, France. • Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html

Slide 27

Slide 27 text

Signature stability

Slide 28

Slide 28 text

28 Predictive performances

Slide 29

Slide 29 text

Visualization GE1 CGH1

Slide 30

Slide 30 text

X1 X2 XJ n p2 p3 p1 ... Part II. Multi-block and Multi-way analysis X2 X21 G. Lechuga (CentraleSupélec, L2S) L. Le Brusquet (CentraleSupélec, L2S) L. Puybasset, V. Perlbarg & D. Galanaud Hôpital La Pitié-Salpêtrière

Slide 31

Slide 31 text

31 Multimodal neuroimaging from a multiblock viewpoint (Brain and Spine Institute, La pitié Salpêtrière Hospital) 2 1 3 C13 = 1 C23 = 1 C12 = 1 C13 = 0 Anatomical MRI (X1 ) Functional MRI (X2 ) Behavior (X3 ) Anat 1 … Anat p1 Patient 1 0.18 -0.73 Patient 2 1.15 0.27 Patient 3 1.35 0.22  Patient n 1.39 -0.17 fMRI 1 … fMRI p1 Patient 1 0.00 -0.55 Patient 2 -0.30 0.00 Patient 3 0.33 0.64  Patient n 0.00 0.43 BEH 1 … BEH p3 Patient 1 0.18 -0.73 Patient 2 1.15 0.27 Patient 3 1.35 0.22  Patient n 1.39 -0.17 p1 ~104 p2 ~104 p3 ~10 n ~100 n ~100

Slide 32

Slide 32 text

X5 Final Phenotype X1 X2 X3 X4 Anatomical MRI Diffusion MRI Functional MRI PET p1 p1 p1 p1 p2 From Multiblock data to … Multiway RGCCA (MGCCA) RGCCA algorithm a1 a2 a3 a5 a4 4×p1 parameters to estimate n • Tenenhaus A., Le Brusquet L. Regularized Generalized Canonical Correlation Analysis extended to three way data, International Conference of the ERCIM WG on Computational and Methodological Statistics, 2014 • Tenenhaus A., Le Brusquet L. Three-way Regularized Generalized Canonical Correlation Analysis, ThRee-way methods In Chemistry And Psychology, (TRICAP) ,2015

Slide 33

Slide 33 text

X5 … to Multiblock / Multiway data X1 2 1 C12 = 1 J 1 b K 1 b 2 b X5 X1 X2 X3 X4 P1 = P2 = and = 1, = 1,2 1 = 1 ⊗ 1 s.c. max 1, 2 cov(1 1 , 2 2 )

Slide 34

Slide 34 text

P2 … to Multiblock / Multiway data P1 2 1 C12 = 1 J 1 b K 1 b 2 b MGCCA b2 K 1 b J 1 b 4 + p1 parameters to estimate instead of 4 × p1 K J 1 1 1 b b b  

Slide 35

Slide 35 text

MGCCA optimization problem = 1, = 1, … , = ⊗ , = 1, … , s.c. max 1,…, g cov( , ) ≠ 1 ..1 P 1 ..k P 1 1 ..K P 3 ..1 P 3 ..k P 3 3 ..K P 2 ..1 P 2 ..k P 2 2 ..K P 1 3 2 C13 = 1 C23 = 1 C12 = 1 C13 = 0 J 1 b K 1 b J 2 b K 2 b J 3 b K 3 b 1 2 3

Slide 36

Slide 36 text

MGCCA algorithm y P b  j j j Outer component (summarizes the block) Initial step j b Iterate until convergence of the criterion cjk = 1 if blocks/groups are connected and 0 otherwise Inner component (take into account relationships between blocks) j jk k k j z e y      1 , arg max cov( , ) b b b b b b P b z     K J K J j j j j Choice of weights ejk : - Horst : - Centroid : - Factorial : jk jk e =c   sign cov( , ) ik jk j k e c y y  cov( , ) jk jk j k e c y y  ..., j j j j ..1 ..K , P P P      = 1 and = ⊗   , K J j j b b are obtained by SVD

Slide 37

Slide 37 text

MGCCA results P2 P1 2 1 J 1 b K 1 b 2 b Predict the long term recovery of patients after traumatic brain injury Influence of spatial positions Influence of the modalities J 1 b K 1 b Discriminating voxels within the white matter bundles Bad prognosis Good prognosis Control

Slide 38

Slide 38 text

X1 X2 XI n1 p ... Part III. Multi-group analysis n2 nI M. Tenenhaus (HEC, Paris)

Slide 39

Slide 39 text

Part III: multigroup data analysis • SETTINGS: The same set of variables are measured on individuals structured in several groups. groups are centered and normalized (unit norm) • OBJECTIVE: investigate the relationships between variables within the various groups. X2 n 1 p X2 n 2 n I X2 X2 Tenenhaus, A. and Tenenhaus, M. (2014). Regularized Generalized Canonical Correlation Analysis for multiblock or multigroup data analysis. European Journal of Operational Research, 238 :391–403. X2 X2

Slide 40

Slide 40 text

RGCCA for multiblock data analysis argmax 1,2,…, g cov , ≠ 1 − var + 2 = 1, = 1, … , Subject to the constraints Block component cov2 , = var cor2( , )var Block components should verified two properties at the same time: (i) Block components well explain their own block. (ii) Block components are as correlated as possible for connected blocks.

Slide 41

Slide 41 text

RGCCA for multigroup data analysis argmax 1,…, g , ≠ (1 − )‖ ‖2 + ‖ ‖2 = 1, = 1, … , s.c. , = cos , × × ‖ ‖ Group loadings and group components should verified the following properties at the same time: • Group component well explains their own block. • Small angle between loadings if groups are connected Similar Loadings Group loadings Group components

Slide 42

Slide 42 text

Conclusion X1 X2 XJ n p2 pJ p1 ... Multiblock analysis X1 X2 XJ n p2 p3 p1 ... Multiblock/Multiway analysis X2 X21 X1 X2 XI n1 p ... Multigroup analysis n2 nI RGCCA for multiblock, multigroup or multiway data allows analyzing the data in their natural (but complex) structure.