Multi-blocks methods

Data Common Structure Groups Study Partial Analyses To go further
Example Multiple Factor Analysis Julie Josse Applied Mathematics Department, Agrocampus Ouest Stat 300 Stanford, July 2015 1 / 58

Example Multiple Factor Analysis 1 Data - Issues 2 Common Structure 3 Groups Study 4 Partial Analyses 5 Example 2 / 58

Example Multi-blocks data set 3 Groups of variables (MFA) Groups of variables are quantitative and/ or qualitative Objectives: - study the link between the sets of variables - balance the influence of each group of variables - give the classical graphs but also specific graphs: groups of variables - partial representation Examples: - Genomic: DNA, protein - Sensory analysis: sensorial, physico-chemical - Comparison of coding (quantitative / qualitative) • Sensory analysis: products - sensorial, physico-chemical • Survey: individuals - questionnaires themes (students health: addicted consumptions, psychological conditions, sleep, id) • Economy: countries - economic indicators each year • Biology: samples - Omics data (brain tumors: CGH, transcriptome; mouse: transcriptome, hepatic fatty acid measurements) ⇒ Generalized Canonical Correlation, Procrustes, Statis, etc. ⇒ MFA (Escoﬁer & Pagès, 1998) ⇒ Continuous / categorical / contingency sets of variables 3 / 58

Example Example: gliomas brain tumors The data <Experiment> Gliomas: Brain tumors, WHO classification - astrocytoma (A)……….……… x5 - oligodendroglioma (O)……… x8 - oligo-astrocytoma (OA)…… x6 - glioblastoma (GBM)………… x24 43 tumor samples (Bredel et al.,2005) - transcriptional modification (RNA), Microarrays - damage to DNA, CGH arrays • Transcriptional modiﬁcation (RNA), microarrays: 489 variables • Damage to DNA (CGH array): 113 variables ‘-omics’ data 1 j 1 J 1 1 i I Tumors 1 j 2 J 2 <Merged data tables> The data, the expectations <Genome alteration> <Transcriptome> 4 / 58

Example Objectives • Study the similarities between individuals with respect to all the variables • Study the linear relationships between variables ⇒ taking into account the structure on the data (balance the inﬂuence of each group) • Find the common structure with respect to all the groups - highlight the speciﬁcities of each group • Compare the typologies obtained from each group of variables (separate analyses) 5 / 58

Example Principal component methods The core of principal component methods is PCA on particular matrices "Doing a data analysis, in good mathematics, is simply searching eigenvectors, all the science of it (the art) is just to ﬁnd the right matrix to diagonalize" Benzécri MFA is a particular weighted PCA! 6 / 58

Example Balancing the groups of variables MFA is a weighted PCA: • compute the ﬁrst eigenvalue λj 1 of each group of variables • perform a global PCA on the weighted data table:   X1 λ1 1 ; X2 λ2 1 ; ...; XJ λJ 1   ⇒ Same idea as in PCA when variables are standardized: variables are weighted to compute distances between individuals i and i 8 variables highly correlated 2 var i i′ 7 / 58

Example Balancing the groups of variables Transcriptome Genome λ1 162 12 λ2 35 10 λ3 21 5 This weighting allows that: • Same weight for all the variables of one group: the structure of the group is preserved • For each group the variance of the main dimension of variability (ﬁrst eigenvalue) is equal to 1 • No group can generate by itself the ﬁrst global dimension • A multidimensional group will contribute to the construction of more dimensions than a one-dimensional group 8 / 58

Example Individuals and variables representations q −2 −1 0 1 2 3 −3 −2 −1 0 1 2 Individual factor map Dim 1 (20.99 %) Dim 2 (13.51 %) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q AA3 AO1 AO2 AO3 AOA1 AOA2 AOA3 AOA4 AOA6 AOA7 GBM1 GBM11 GBM15 GBM16 GBM21 GBM22 GBM23 GBM24 GBM25 GBM26 GBM27 GBM28 GBM29 GBM3 GBM30 GBM31 GBM4 GBM5 GBM6 GBM9 GNN1 GS1 GS2 JPA2 JPA3 LGG1 O1 O2 O3 O4 O5 sGBM1 sGBM3 A GBM O OA q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Correlation circle Dim 1 (20.99 %) Dim 2 (13.51 %) EXT1 KIAA0934 SVIL MGC39606 ADARB2 PARD3 SPAG6 AKR1C1 CCDC3 AKR1C2 FLJ40873 GYPA VASP ACAD8 NPD014 USP6NL CACNB2 CLCA3 COX7A1 CEAL1 FLJ38944 LOC56931 KIAA1543 HBXIP ZNF226 EDG1 SORT1 LOC283070 DCLRE1B ZNF549 KLRC3 MGC4728 DDEF1 DMN KCNJ1 MGC18216 PSG5 FBXO32 FLJ12586 SBP1 MGC42367 OSR1 D21S2089E HNRNPG.T AP4B1 ZNF160 LRBA LILRA1 EGLN2 LU GPSM2 APOC1 IGF1R TLL2 CA11 MGC39581 KCNK6 TFDP2 ZNF233 COX6A2 UBE4B NDUFB11 FLJ12572 RBP7 IGSF3 BLVRB PSG1 MYBPC2 MSN H08563 BGN TNFRSF12A CAPG MYCBP2 EMP3 RTN3 RGS16 ITGA5 PPP3CB LOC541471 H78560 ASPA TIMP1 AA398420 IMAGE.33267 SOCS3 PDPN CHI3L2 MASP1 PLP2 S100A11 SETD5 T97457 UBA52 DLL3 TSPYL1 FAM84B RALY LOC613212 CLIC1 VEGF AA281932 COL1A1 CD47 IFI30 ATP6V1C1 PSD3 RUNX1 LHFPL2 C9orf48 CA12 EDNRB PPP3CA LGALS3 STEAP3 IGFBP3 KLRC3.1 KLRC2 KLRC1 NPL ARPP.19 PDE2A COL4A2 TIMP3 PLAUR PNCK WDR7 X37864 NNMT TBC1D7 MRC2 ABCA5 TCF7L1 CD58 MST1 DECR2 AA131320 LOC57228 MST150 PEG3 TMEM49 FABP3 SLC16A3 SERPINH1 ADM NMNAT2 GABBR1 VAMP2 CSPG2 WASF1 CBX6 RAB27B COL3A1 KIAA1644 NLGN1 C16orf5 LOC400451 AEBP1 COL1A2 COL6A2 HEY1 HSPG2 RPS19BP1 SUV39H1 RFXAP TNC DKFZp313A2432 STC1 PDXP MOAP1 MAPKBP1 ID4 CACNB3 SERPING1 RND3 HMOX1 MTHFD2 R70506 APOC2 ANXA1 AA489629 H86813 SCAMP5 RPS3A PLAU R61377 NEFH STMN1 FBXL16 C1R WWTR1 EPHB1 ANXA5 AA906888 AI262682 LOC146795 SOX4 COL9A2 SERPINA3 AA029415 PRKCZ FLJ35740 FN1 LOC388610 AI871056 FGF13 STXBP1 AI357047 SP100 HCLS1 MGC26694 LSP1 PRKAR1B YWHAG CLTB AI335002 A2BP1 HLA.A SNCG IGFBP5 PYCR1 ABCC3 L3MBTL4 DPYD H20822 AI822135 LMO7 ADFP KCNMA1 PCDHGC3 MSTP9 AA479357 LTF ZNF217 AA490257 LOC389831 AA975768 NOTCH1 MB RAB30 FCGR2B C4A FLJ38984 PYGL SNRPN KNS2 GBP1 IL32 LY96 AA181288 IGF2BP3 CALM1 CASP1 PDLIM7 DEF6 W93688 USP3 X38595 H87106 CDKN1A PHLDA1 GOT1 EFHB GYPC MYC CDC20B TNFAIP3 POSTN PLOD2 CDKN2D AA669383 CLSTN3 MDK MVP CCL2 FAM46A PLA2G2A KCNQ2 CKMT1A H24428 KIAA0963 NSF RAB11FIP4 CD53 AA873230 URP2 CAMK2A APCS HPRT1 LAMA4 ITPKA HK2 RBP4 DDIT4 OSBPL1A CAMKK2 L1CAM PYHIN1 H91845 SLC35A2 FCGRT AI005038 PTPRZ1 ADCY1 HAMP CD44 FAS NUAK1 DNASE1L1 GPNMB HBA1 T62491 FBXO2 VSNL1 SPINT2 C8orf4 NCF2 PRSS3 PLTP CAP2 LAMB1 EDG3 INPP5F PDE4DIP MMP2 S100A10 LAPTM5 PRRX1 IL1RAP HLA.G TSPAN4 ITGA9 CCR1 MAL2 DSCR1L1 C6orf12 DDN CBLN2 GBP2 PRKCB1 F13A1 S100A1 R52960 BCL2A1 YWHAH FREQ UGCG SERPINI1 NLK ANK3 AI002301 NCF1 CA11.1 NY.SAR.48 AA598555 CYBA ID3 TAP1 TGFBI AI263051 TOMM40 C1orf187 IQSEC1 DES NPC2 AIF1 HLA.F AI350724 PRSS1 SAA2 CYR61 T51726 SYNGR3 ITGA3 CHN1 AA702986 FBXW7 AA598631 LUM PCP4 SRPX IGHG1 FAM84A H41096 HPCA CTHRC1 AA401952 DYNLT1 GAS1 RAB20 ESM1 AA424849 NR4A1 EPB49 MDFI LYN TXNDC PALLD R70684 CAV1 ZFHX1B H10054 NDST1 SPARC SCN2B SYT7 MED11 PARP14 MICAL2 TncRNA BNIP2 FZD7 GADD45B FBL LOC283130 STK17A PRKCG HLA.DRB1 PRG1 N98591 PPP1R14A SLC15A2 NPTX1 MGP ⇒ What can be added to interpret? 9 / 58

Example Individuals and variables representations q −2 −1 0 1 2 3 −3 −2 −1 0 1 2 Individual factor map Dim 1 (20.99 %) Dim 2 (13.51 %) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q AA3 AO1 AO2 AO3 AOA1 AOA2 AOA3 AOA4 AOA6 AOA7 GBM1 GBM11 GBM15 GBM16 GBM21 GBM22 GBM23 GBM24 GBM25 GBM26 GBM27 GBM28 GBM29 GBM3 GBM30 GBM31 GBM4 GBM5 GBM6 GBM9 GNN1 GS1 GS2 JPA2 JPA3 LGG1 O1 O2 O3 O4 O5 sGBM1 sGBM3 A GBM O OA A GBM O OA Figure 4: Multi-way glioma data set: Characteristics of oligodendrogliomas are linked to modifica the genomic status of genes located on 1p and 19q positions. ⇒ Do the means of the tumors coordinates per stage on each dimension significantly differ from each other? 10 / 58

Example Groups study ⇒ Synthetic comparison of the groups ⇒ Are the relative positions of individuals globally similar from one group to another? Are the partial clouds similar? ⇒ Do the groups bring the same information? 11 / 58

Example Principal component in MFA MFA = weighted PCA ⇒ ﬁrst principal component of MFA maximizes J j=1 k∈Kj cov2   x.k λj 1 , F1   = J j=1 Lg (F1, Kj ) Lg (F1, Kj ) =< WKj λ1 , F1F1 >= trace(WKj F1F1 ) ⇒ F1 the most related to the groups in the Lg sense 12 / 58

Example Representation of the groups Group j has the coordinates (Lg (F1, Kj ), Lg (F2, Kj )) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Groups representation Dim 1 (20.99 %) Dim 2 (13.51 %) CGH expr WHO • 2 groups are all the more close that they induce the same structure • The 1st dimension is common to all the groups • 2nd dimension mainly due to CGH 0 ≤ Lg (F1, Kj ) = 1 λj 1 k∈Kj cov2(x.k, F1 ) ≤λj 1 ≤ 1 ⇒ Could you predict the results of the PCA for each group? 13 / 58

Example The RV coeﬃcient Xj(I×Kj ) and Xm(I×Km) not directly comparable Wj(I×I) = Xj Xj and Wm(I×I) = XmXm can be compared Inner product matrices = relative position of the individuals Covariance between two groups: < Wj , Wm >= k∈Kj l∈Km cov2(x.k, x.l ) Correlation between two groups (Escouﬁer, 1973): RV (Kj , Km ) = < Wj , Wm > Wj Wm 0 ≤ RV ≤ 1 RV = 0: variables of Kj are uncorrelated with variables of Km RV = 1: the two clouds of points are homothetic ⇒ Extension of the notion of correlation matrix 14 / 58

Example Similarity between two groups Measure of similarity between groups Kj and Km: Lg (Kj , Km ) = k∈Kj l∈Km cov2 x.k λk 1 , x.l λl 1 Ramsay (1984): "Matrices may be similar or dissimilar in a many ways" Canonical correlation (Hotteling, 1936), Mantel (1967), Procrustes (Gower, 1971), dCov (Szekely et al., 2007), kernel based HSIC (Gretton et al., 2005), etc... 15 / 58

Example Numeric indicators > res.mfa$group$Lg CGH expr WHO CGH 2.51 0.60 0.46 expr 0.60 1.10 0.36 WHO 0.46 0.36 0.50 > res.mfa$group$RV CGH expr WHO CGH 1.00 0.36 0.41 expr 0.36 1.00 0.48 WHO 0.41 0.48 1.00 Lg (Kj , Kj ) = Kj k=1 (λj k )2 (λj 1 )2 = 1+ Kj k=2 (λj k )2 (λj 1 )2 • CGH gives richer description (Lg greater) • RV: a standardized Lg • CGH and expr are not linked (RV=0.36) Contribution of each group to each component of the MFA > res.mfa$group$contrib Dim.1 Dim.2 Dim.3 CGH 45.8 93.3 78.1 expr 54.2 6.7 21.9 • Similar contribution of the 2 groups to the ﬁrst dimension • Second dimension only due to CGH 16 / 58

Example Partial analyses ⇒ Comparison of the groups through the individuals ⇒ Comparison of the typologies provided by each group in a common space ⇒ Are there individuals very particular with respect to one group? ⇒ Comparison of the separate PCA 17 / 58

Example Projection of partial points xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 Projection of group 1 Projection of group 2 Projection of group 3 Data MFA individuals configuration i i1 i2 i3 i Mean point Partial point 3 Partial point 2 Partial point 1 G1 G2 G3 RK = ⊕ RKj 18 / 58

Example Partial points opinion attitude individuals individual i What you think What you do behavioral conflict F1 F2 19 / 58

Example Partial points What you expected for the tutorial What you have learned during the tutorial Tutorial participants F F F F1 1 1 1 F F F F2 2 2 2 What you have learned during the tutorial What you expected for the tutorial What you have learned during the tutorial What you expected for the tutorial 20 / 58

Example Partial points What you expected for the tutorial What you have learned during the tutorial Tutorial participants F F F F1 1 1 1 F F F F2 2 2 2 What you have learned during the tutorial What you expected for the tutorial What you have learned during the tutorial What you expected for the tutorial Disappointed learner Happy learner 20 / 58

Example Representation of the partial points q −4 −2 0 2 4 6 −6 −4 −2 0 2 4 Individual factor map Dim 1 (20.99 %) Dim 2 (13.51 %) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q AA3 AO1 AO2 AO3 AOA1 AOA2 AOA3 AOA4 AOA6 AOA7 GBM1 GBM11 GBM15 GBM16 GBM21 GBM22 GBM23 GBM24 GBM25 GBM26 GBM27 GBM28 GBM29 GBM3 GBM30 GBM31 GBM4 GBM5 GBM6 GBM9 GNN1 GS1 GS2 JPA2 JPA3 LGG1 O1 O2 O3 O4 O5 sGBM1 sGBM3 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q A GBM O OA CGH expr q −1 0 1 2 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Individual factor map Dim 1 (20.99 %) Dim 2 (13.51 %) A GBM O OA CGH expr • an individual is at the barycentre of its partial points • an individual is all the more "homogeneous" that its superposed representations are close 21 / 58

Example Identify particular individuals q −2 −1 0 1 2 3 −3 −2 −1 0 1 2 Individual factor map Dim 1 (20.99 %) Dim 2 (13.51 %) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q AA3 AO1 AO2 AO3 AOA1 AOA2 AOA3 AOA4 AOA6 AOA7 GBM1 GBM11 GBM15 GBM16 GBM21 GBM22 GBM23 GBM24 GBM25 GBM26 GBM27 GBM28 GBM29 GBM3 GBM30 GBM31 GBM4 GBM5 GBM6 GBM9 GNN1 GS1 GS2 JPA2 JPA3 LGG1 O1 O2 O3 O4 O5 sGBM1 sGBM3 q q A GBM O OA CGH expr 22 / 58

Example Numeric indicators I i=1 J j=1 (Fij q )2 = I i=1 J j=1 (Fiq )2 + I i=1 J j=1 (Fij q − Fiq )2 Total inertia = Between indiv. inertia + Within indiv. inertia > res.mfa$inertia.ratio Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 0.84 0.56 0.44 0.59 0.43 • For the ﬁrst dimension, the coordinates of each partial points are close (0.84 close to 1) • The within inertia can be decomposed by individuals res.mfa$ind$within.inertia 23 / 58

Example Representation of the partial components Do the separate analyses give similar dimensions as MFA? PCA i I 1 1 q Q 1 q Q 24 / 58

Example Representation of the partial components q −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Partial axes Dim 1 (20.99 %) Dim 2 (13.51 %) Dim1.CGH Dim2.CGH Dim3.CGH Dim1.expr Dim2.expr Dim3.expr Dim1.WHO Dim2.WHO Dim3.WHO CGH expr WHO • The ﬁrst dimension of each group is well projected • CGH has same dimensions as MFA 25 / 58

Example Use of biological knowledge Genes can be grouped by gene ontology (GO) biological process GO:0006928 cell motility ANXA1 CALD1 EGFR ENPP2 FN1 FPRL2 LSP1 MSN PDPN PLAUR PRSS3 SAA2 SPINT2 TNFRSF12A VEGF WASF1 YARS GO:0009966 regulation of signal transduction CASP1 EDG2 F2R HCLS1 HMOX1 IGFBP3 IQSEC1 LYN MALT1 TCF7L1 TNFAIP3 TRIO VEGF YWHAG YWHAH GO:0052276 chromosome organisation and biogenesis CBX6 NUSAP1 PCOLN3 PTTG1 SUV39H1 TCF7L1 TSPYL1 26 / 58

Example Use of biological knowledge • Biological processes considered as supplementary groups of variables ‘-omics’ data 1 j 1 J 1 1 i I 1 j 2 J 2 M1 M2 M3 ….. Modules <MODULES of GENES> Tumors Modules Modular approach => Integration of the modules as groups of supplementary variables 27 / 58

Example Use of biological knowledge 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Groups representation Dim 1 (20.99 %) Dim 2 (13.51 %) CGH expr WHO q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Many biological processes induce the same structure on the individuals than MFA 28 / 58

Example To go further • Mixed data: MFA with 1 group = 1 variable continuous variables: PCA is recovered; categorical variables: MCA is recovered mixed: FAMD • MFA used for methodological purposes: • comparison of coding (continuous or categorical) • comparison between preprocessing (standardized PCA and unstandardized PCA) • comparison of results from diﬀerent analyses • Hierarchical Multiple Factor Analysis: Takes into account a hierarchy on the variables: variables are grouped and subgrouped (like in questionnaires structured in topics and subtopics) 29 / 58

Example Clustering: MFA as a preprocessing i i’ X1 X2 MFA balances the inﬂuence of the groups when computing distances between individuals d2(i, i ) = J j=1 1 λj Kj k=1 (xik − xi k )2 AHC or k-means onto the ﬁrst principal components (F.1, ..., F.Q) obtained from MFA allows to • take into account the groups structure in the clustering • make the clustering more robust by deleting the last dimensions 30 / 58

Example Clustering AHC onto the ﬁrst 5 principal components from MFA 0.0 0.4 0.8 Hierarchical clustering inertia gain O4 AO2 sGBM1 AOA6 AO1 AO3 GBM1 LGG1 AOA3 O3 O1 GBM6 O2 O5 AOA4 GBM25 GS1 GBM29 GBM31 GBM15 GBM26 JPA2 GBM9 GBM21 GBM22 GBM23 GBM27 GBM11 GBM4 GBM5 GBM30 GBM28 GS2 sGBM3 AOA1 GNN1 AOA2 AOA7 GBM24 JPA3 AA3 GBM3 GBM16 0.0 0.2 0.4 0.6 0.8 1.0 Cluster Dendrogram Individuals are sorted according to their coordinate on F.1 31 / 58

Example Partition from the tree An empirical number of clusters is suggested 0.0 0.4 0.8 Hierarchical clustering inertia gain O4 AO2 sGBM1 AOA6 AO1 AO3 GBM1 LGG1 AOA3 O3 O1 GBM6 O2 O5 AOA4 GBM25 GS1 GBM29 GBM31 GBM15 GBM26 JPA2 GBM9 GBM21 GBM22 GBM23 GBM27 GBM11 GBM4 GBM5 GBM30 GBM28 GS2 sGBM3 AOA1 GNN1 AOA2 AOA7 GBM24 JPA3 AA3 GBM3 GBM16 0.0 0.2 0.4 0.6 0.8 1.0 Cluster Dendrogram 32 / 58

Example Partition on the principal component map −2 −1 0 1 2 3 −4 −2 0 2 Factor map Dim 1 (20.99%) Dim 2 (13.51%) GBM3 GBM27 GBM4 GBM16 GBM5 GBM9 GBM30 GS2 GBM11 GBM21 GBM28 GBM29 GBM31 GBM22 JPA3 GBM15 GBM23 GBM24 GBM25 GS1 GBM26 JPA2 AA3 sGBM3 AOA4 sGBM1 AOA2 GBM1 GBM6 AO2 GNN1 LGG1AOA6 AOA3 AO1 O2 AOA1 AOA7 O5 AO3 O3 O1 O4 cluster 1 cluster 2 cluster 3 cluster 1 cluster 2 cluster 3 Continuous vision (principal component) and discontinuous (clusters) 33 / 58

Example Cluster description by variables v.test = ¯ x − ¯ x s2 I I−I I−1 H0 : random sampling of I values from I with ¯ x the mean of variable x in cluster , ¯ x (s) the mean (standard deviation) of the variable x in the data set, I the cardinal of cluster $desc.var$quanti$‘1‘ v.test Mean in category Overall mean sd in category Overall sd p.value TMEM49 4.488 -0.430 -1.424 0.722 1.277 0.000 TNFRSF12A 4.433 -0.794 -1.838 0.789 1.357 0.000 LGALS3 4.369 -0.222 -1.216 0.861 1.312 0.000 S100A11 4.300 -0.737 -1.500 0.525 1.024 0.000 BGN 4.273 2.105 1.106 0.697 1.348 0.000 IFI30 4.264 0.987 0.026 0.979 1.300 0.000 .... .... C9orf48 -4.411 -0.686 -0.037 0.540 0.848 0.000 PSD3 -4.594 -1.684 -1.024 0.419 0.829 0.000 AA398420 -4.635 0.324 1.134 0.635 1.007 0.000 34 / 58

Example Cluster description by observations • parangon: the closest observations to the centroid of the cluster min i∈ d(xi., C ) with C the centroid of cluster • speciﬁc observations: the furthest observations to the centroids of the other clusters (the observations sorted according to their distance from the highest to the smallest to the closest centroid) max i∈ min = d(xi., C ) desc.ind$para cluster: 1 GBM11 GBM28 GBM5 GBM25 GBM31 0.6649847 0.7001998 0.7973604 0.8869271 0.9674042 --------------------------------------------------------------- desc.ind$dist cluster: 1 GBM30 GS2 GBM21 GBM22 GBM27 3.227968 3.096048 3.031256 2.904327 2.778950 --------------------------------------------------------------- 35 / 58

Example Cluster description • by the principal components (observations coordinates): same description than for continuous variables $desc.axes$quanti$‘1‘ v.test Mean in category Overall mean sd in category Overall sd p.value Dim.2 2.919 0.511 0 0.465 1.010 0.004 Dim.1 -4.458 -0.974 0 0.560 1.259 0.000 • by categorical variables: chi-square and hypergeometric test $test.chi2 p.value df type 8.433474e-06 6 ⇒ Active and supplementary elements are used ⇒ Only signiﬁcant results are presented 36 / 58

Example Cluster description $‘1‘ Cla/Mod Mod/Cla Global p.value v.test type=GBM 75 94.73684 55.81395 3.300966e-06 4.651145 type=OA 0 0.00000 13.95349 2.207775e-02 -2.289028 type=O 0 0.00000 18.60465 5.071916e-03 -2.802430 $‘2‘ Cla/Mod Mod/Cla Global p.value v.test type=A 60 37.5 11.62791 0.0398214 2.055597 $‘3‘ Cla/Mod Mod/Cla Global p.value v.test type=O 100.0 50.00 18.60465 8.875341e-05 3.919444 type=GBM 12.5 18.75 55.81395 2.319983e-04 -3.681354 37 / 58

Example Complementarity between hierarchical clustering and partitioning • Partitioning after AHC: the k-means algorithm is initialized from the centroids of the partition obtained from the tree • consolidate the partition • loss of the hierarchy • AHC with many individuals: time-consuming ⇒ partitioning before AHC • compute k-means with approximately 100 clusters • AHC on the weighted centroids obtained from the k-means ⇒ top of the tree is approximately the same 38 / 58

Example Other methods: ade4 Two-table analysis Available methods individuals variables variables Function name Analysis name between Between-class analysis within Within-class analysis discrimin Discriminant analysis coinertia Coinertia analysis cca Canonical correspondence analysis pcaiv PCA on Instrumental Variables pcaivortho Orthogonal PCAIV procuste Procustes analysis niche Niche (OMI) analysis St´ ephane Dray (Univ. Lyon 1) CARME 2011, Rennes 22 / 31 39 / 58

Example Other methods: ade4 Other functions K-table variables individuals Function name Analysis name sepan K-table separate analyses pta Partial triadic analysis foucart Foucart analysis statis STATIS analysis mfa Multiple factor analysis mcoa Multiple coinertia analysis statico 2 K-table analysis St´ ephane Dray (Univ. Lyon 1) CARME 2011, Rennes 26 / 31 40 / 58

Example Other methods Predict one block with others: • Multi-block PLS regression • Multi-block PCA on instrumental variables.. 41 / 58

Example RV Tests Is there any (linear) relationship between the 2 sets? H0 : ρV = 0 Asymptotic tests: distributions normal, elliptical - rank (Robert et al, 1985, Cléroux, 1995, Cléroux & Ducharme, 1989) nRV ∼ λi Z2 i ⇒ sensitive to the departure from the distribution and to n 42 / 58

Example RV Tests Is there any (linear) relationship between the 2 sets? H0 : ρV = 0 Asymptotic tests: distributions normal, elliptical - rank (Robert et al, 1985, Cléroux, 1995, Cléroux & Ducharme, 1989) nRV ∼ λi Z2 i ⇒ sensitive to the departure from the distribution and to n Permutation tests: permute one matrix’s rows - compute the RV for n! permutations p-value: proportion of the values greater than the observed one ⇒ computationally costly (“old fashion" argument?) 42 / 58

Example RV Tests Is there any (linear) relationship between the 2 sets? H0 : ρV = 0 Asymptotic tests: distributions normal, elliptical - rank (Robert et al, 1985, Cléroux, 1995, Cléroux & Ducharme, 1989) nRV ∼ λi Z2 i ⇒ sensitive to the departure from the distribution and to n Permutation tests: permute one matrix’s rows - compute the RV for n! permutations p-value: proportion of the values greater than the observed one ⇒ computationally costly (“old fashion" argument?) Approximation of the permutation distribution • sampling from the permutations - package ade4 (RV.rtest) • moment matching: Pearson family, Edgeworth expansion 42 / 58

Example Moments matching The ﬁrst three moments under H0 (Kazi-Aoual et al., 1995) EH0 (RV ) = βx × βy n − 1 βx = (tr(X X))2 tr((X X)2) = ( λi )2 λ2 i . βx a measure of complexity 1 ≤ βx ≤ p RV large: n small and many orthogonal variables per group ⇒ Normal approximation: RVstd = RV − EH0 (RV ) VH0 (RV ) 43 / 58

Example Moments matching Problem: the exact distribution of the RVstd is often skewed Histogram of the standardized RV Density −1 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5 Normal Gamma Edgeworth ⇒ Pearson type III f(x) (skewness= γ): (2/γ)4/γ2 Γ(4/γ2) 2+γx γ (4−γ2)/γ2 e−2(2+xγ)/γ2 ⇒ package FactoMineR (coeﬀRV) (Josse et al., 2008) 44 / 58

Example Back to the wine example! • 10 white wines from Val de Loire (5 Vouvray - 5 Sauvignon) • 27 continuous variables: sensory descriptors O.fruity O.passion O.citrus … Sweetness Acidity Bitterness Astringency Aroma.intensity Aroma.persistency Visual.intensity Odor.preferene Overall.preference Label S Michaud 4.3 2.4 5.7 … 3.5 5.9 4.1 1.4 7.1 6.7 5.0 6.0 5.0 Sauvignon S Renaudie 4.4 3.1 5.3 … 3.3 6.8 3.8 2.3 7.2 6.6 3.4 5.4 5.5 Sauvignon S Trotignon 5.1 4.0 5.3 … 3.0 6.1 4.1 2.4 6.1 6.1 3.0 5.0 5.5 Sauvignon S Buisse Domaine 4.3 2.4 3.6 … 3.9 5.6 2.5 3.0 4.9 5.1 4.1 5.3 4.6 Sauvignon S Buisse Cristal 5.6 3.1 3.5 … 3.4 6.6 5.0 3.1 6.1 5.1 3.6 6.1 5.0 Sauvignon V Aub Silex 3.9 0.7 3.3 … 7.9 4.4 3.0 2.4 5.9 5.6 4.0 5.0 5.5 Vouvray V Aub Marigny 2.1 0.7 1.0 … 3.5 6.4 5.0 4.0 6.3 6.7 6.0 5.1 4.1 Vouvray V Font Domaine 5.1 0.5 2.5 … 3.0 5.7 4.0 2.5 6.7 6.3 6.4 4.4 5.1 Vouvray V Font Brûlés 5.1 0.8 3.8 … 3.9 5.4 4.0 3.1 7.0 6.1 7.4 4.4 6.4 Vouvray V Font Coteaux 4.1 0.9 2.7 … 3.8 5.1 4.3 4.3 7.3 6.6 6.3 6.0 5.7 Vouvray 45 / 58

Example Back to the wine example! • 3 panels (oenologists, naive consumers, our students!) • 60 preference scores: taste evaluation 1 - 10 Categorical Continuous variables Student (15) wine 10 … wine 2 wine 1 Label (1) Preference (60) Consu mer (15) Expert (27) • How are the products described by the panels? • Do the panels describe the products in a same way? Is there a speciﬁc description done by one panel? 46 / 58

Example Practice with R 1 Deﬁne groups of active and supplementary variables 2 Scale or not the variables 3 Perform MFA 4 Choose the number of dimensions to interpret 5 Simultaneously interpret the individuals and variables graphs 6 Study the groups of variables 7 Study the partial representations 8 Use indicators to enrich the interpretation 47 / 58

Example Practice with R library(FactoMineR) Expert <- read.table("http://factominer.free.fr/docs/Expert_wine.csv", header=TRUE, sep=";", row.names=1) Consu <- read.table(".../Consumer_wine.csv",header=T,sep=";",row.names=1) Stud <- read.table(".../Student_wine.csv",header=T,sep=";",row.names=1) Pref <- read.table(".../Preference_wine.csv",header=T,sep=";",row.names=1) palette(c("black","red","blue","orange","darkgreen","maroon","darkviolet")) complet <- cbind.data.frame(Expert[,1:28],Consu[,2:16],Stud[,2:16],Pref) res.mfa <- MFA(complet,group=c(1,27,15,15,60),type=c("n",rep("s",4)), num.group.sup=c(1,5),graph=FALSE, name.group=c("Label","Expert","Consumer","Student","Preference")) plot(res.mfa,choix="group",palette=palette()) plot(res.mfa,choix="var",invisible="quanti.sup",hab="group",palette=palette()) plot(res.mfa,choix="ind",partial="all",habillage="group",palette=palette()) plot(res.mfa,choix="axes",habillage="group",palette=palette()) dimdesc(res.mfa) write.infile(res.mfa,file="my_wine_results.csv") #to export a list 48 / 58

Example Representation of the individuals -2 -1 0 1 2 3 -3 -2 -1 0 1 Dim 1 (42.52 %) Dim 2 (24.42 %) S Michaud S Renaudie S Trotignon S Buisse Domaine S Buisse Cristal V Aub Silex V Aub Marigny V Font Domaine V Font Brûlés V Font Coteaux Sauvignon Vouvray Sauvignon Vouvray • The two labels are well separated • Vouvray are sensorially more diﬀerent • Several groups of wines, ... 49 / 58

Example Representation of the active variables -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (42.52 %) Dim 2 (24.42 %) Expert Consumer Student O.Intensity.before.shaking O.Intensity.after.shaking Expression O.fruity O.passion O.citrus O.candied.fruit O.vanilla O.wooded O.mushroom O.plante O.flower O.alcohol Typicity Attack.intensity Sweetness Acidity Bitterness Astringency Freshness Oxidation Smoothness A.intensity A.persistency Visual.intensity Grade Surface.feeling O.Intensity.before.shaking_C O.Intensity.after.shaking_C O.alcohol_C O.plante_C O.mushroom_C O.passion_C O.Typicity_C A.intensity_C Sweetness_C Acidity_C Bitterness_C Astringency_C A.alcohol_C Balance_C Typical_C O.Intensity.before.shaking_S O.Intensity.after.shaking_S O.alcohol_S O.plante_S O.mushroom_S O.passion_S O.Typicity_S A.intensity_S Sweetness_S Acidity_S Bitterness_S Astringency_S A.alcohol_S Balance_S Typical_S 50 / 58

Example Representation of the active variables -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (42.52 %) Dim 2 (24.42 %) Expert Consumer Student O.passion Sweetness Acidity O.passion_C Sweetness_C Acidity_C O.passion_S Sweetness_S Acidity_S 50 / 58

Example Representation of the groups 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Dim 1 (42.52 %) Dim 2 (24.42 %) Expert Consumer Student Preference Label • 2 groups are all the more close that they induce the same structure • The 1st dimension is common to all the panels • 2nd dimension mainly due to the experts • Preference linked to sensory description 51 / 58

Example Representation of the partial points -4 -2 0 2 4 -3 -2 -1 0 1 2 Dim 1 (42.52 %) Dim 2 (24.42 %) S Michaud S Renaudie S Trotignon S Buisse Domaine S Buisse Cristal V Aub Silex V Aub Marigny V Font Domaine V Font Brûlés V Font Coteaux Sauvignon Vouvray Expert Consumer Student 52 / 58

Example Representation of the partial dimensions -1.5 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (42.52 %) Dim 2 (24.42 %) Dim1.Expert Dim2.Expert Dim1.Consumer Dim2.Consumer Dim1.Student Dim2.Student Dim1.Preference Dim2.Preference Dim1.Label Expert Consumer Student Preference Label • The two ﬁrst dimensions of each group are well projected • Consumer has same dimensions as MFA 53 / 58

Example Representation of supplementary continuous variables -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (42.52 %) Dim 2 (24.42 %) ⇒ Preferences do not participated to the construction of the dimensions ⇒ Preferences are linked to sensory description 54 / 58

Example Representation of supplementary continuous variables -2 -1 0 1 2 3 -3 -2 -1 0 1 Dim 1 (42.52 %) Dim 2 (24.42 %) S Michaud S Renaudie S Trotignon S Buisse Domaine S Buisse Cristal V Aub Silex V Aub Marigny V Font Domaine V Font Brûlés V Font Coteaux Sauvignon Vouvray Sauvignon Vouvray -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (42.52 %) Dim 2 (24.42 %) Expert Consumer Student O.passion Sweetness Acidity O.passion_C Sweetness_C Acidity_C O.passion_S Sweetness_S Acidity_S 54 / 58

Example Representation of supplementary continuous variables ⇒ Main information: the favourite is Vouvray Aubuisières Silex 54 / 58

Example Helps to interpret • Contribution of each group of variables to each component of the MFA > res.mfa$group$contrib Dim.1 Dim.2 Dim.3 Expert 30.5 46.0 33.7 Consumer 33.2 23.1 31.2 Student 36.3 30.9 35.1 • Similar contribution of the 3 groups to the ﬁrst dimension • Second dimension mainly due to the expert • Correlation between the global cloud and each partial cloud > res.mfa$group$correlation Dim.1 Dim.2 Dim.3 Expert 0.95 0.95 0.96 Consumer 0.95 0.83 0.87 Student 0.99 0.99 0.84 First components are highly linked to the 3 groups: the 3 clouds of points are nearly homothetic 55 / 58

Example Similarity measures between groups > res.mfa$group$Lg Expert Consumer Student Preference Label MFA Expert 1.45 0.94 1.17 1.01 0.89 1.33 Consumer 0.94 1.25 1.04 1.11 0.28 1.21 Student 1.17 1.04 1.29 1.03 0.62 1.31 Preference 1.01 1.11 1.03 1.47 0.37 1.18 Label 0.89 0.28 0.62 0.37 1.00 0.67 MFA 1.33 1.21 1.31 1.18 0.67 1.44 > res.mfa$group$RV Expert Consumer Student Preference Label MFA Expert 1.00 0.70 0.85 0.69 0.74 0.92 Consumer 0.70 1.00 0.82 0.82 0.25 0.90 Student 0.85 0.82 1.00 0.75 0.55 0.96 Preference 0.69 0.82 0.75 1.00 0.31 0.81 Label 0.74 0.25 0.55 0.31 1.00 0.56 MFA 0.92 0.90 0.96 0.81 0.56 1.00 • Expert gives a richer description (Lg greater) • Groups Student and Expert are linked (RV = 0.85) • Group Student is the closest to the overall (RV = 0.96) 56 / 58

Example Partition from the tree An empirical number of clusters is suggested (minq Wq−Wq+1 Wq−1−Wq ) 0.0 0.5 1.0 1.5 2.0 Hierarchical Clustering inertia gain V Aub Silex S Trotignon S Renaudie S Michaud S Buisse Domaine S Buisse Cristal V Font Brûlés V Font Domaine V Aub Marigny V Font Coteaux 0.0 0.5 1.0 1.5 2.0 Hierarchical Classification 57 / 58

Example Partition on the principal component map -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 Dim 1 (42.52%) Dim 2 (24.42%) V Aub Silex S Trotignon S Buisse Domaine S Renaudie S Michaud S Buisse Cristal V Font Brûlés V Font Domaine V Aub Marigny V Font Coteaux cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 Dim 1 (42.52%) Dim 2 (24.42%) V Aub Silex S Trotignon S Buisse Domaine S Renaudie S Michaud S Buisse Cristal V Font Brûlés V Font Domaine V Aub Marigny V Font Coteaux cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 Continuous vision (principal component) and discontinuous (clusters) 58 / 58

Multi-blocks methods

Multi-blocks methods

More Decks by julie josse

Other Decks in Research

Featured

Transcript