Leveraging multi-omic data for integrative exploratory, predictive, and network analyses

Embed

Start on current slide

Slide 1

Slide 1 text

Leveraging multi-omic data for integrative exploratory, predictive, and network analyses ANDREA RAU KIM DATA & LIFE SCIENCES SEMINAR MONTPELLIER UNIVERSITÉ D’EXCELLENCE MAY 30, 2022 1 https://andrea-rau.com @andreamrau slides: https://tinyurl.com/MUSE2022-Rau

Slide 2

Slide 2 text

2 Gene expression TTTGCA AAACGT TF Transcription factor expression Copy number alterations The multi-omics data landscape Promoter methylation microRNA expression …GCAGCGTTCGA… …GCAACGTTAGA… Somatic mutations Germline genetic variation Enhancer Accessibility Protein abundance Metabolite concentrations … + Histone modifications + RNA processing/stability + 3D conformation + Microbiome composition + … 2

Slide 3

Slide 3 text

3 - Comprehensive, multi-dimensional maps of key genomic changes in 33 cancer types from n = 11k+ individuals ◦ RNA-seq, miRNA-seq, copy number alterations, methylation, somatic mutations, protein abundance, genotypes, histological data, clinical data → p ~ 100s to 1000s to 100k+ - Publically available data (multi-tiered data depending on patient identifiability) - Widely used by the research community (1000+ publications by TCGA network + independent researchers) Large-scale (public) matched multi-omics The Cancer Genome Atlas (TCGA) Image: Corces et al. (2018)

Slide 4

Slide 4 text

4 Smaller-scale matched multi-omics @ INRAE H2020 GENE-SWitCH The regulatory GENomE of Swine & Chicken: functional annotation during development PI’s: Elisabetta Giuffra and Hervé Acloque (INRAE) Aim: deliver new underpinning knowledge on functional genomes of the 2 main monogastric farm species to enable immediate translation to the pig and poultry sectors - High-quality richly annotated maps of pig and chicken genomes ◦ Developmental stages: early/late organogenesis, new born/hatched, adult ◦ Sexes: {♀,♂} x 3 biological replicates ◦ Tissues: liver, skeletal muscle, small intestine, cerebellum, dorsal epidermis, lung, kidney ◦ Assays: RNA-seq, ATAC-seq, ChIP-seq, RNA-seq, smRNA-seq, lrRNA-seq, methylation, Hi-C, whole genome sequences ◦ eQTLs in small intestine + skeletal muscle + liver in pigs Image: http://www.fragencode.org Image: http://www.gene-switch.eu/project.html Integrate functional information with phenotypic + genotypic data in genomic prediction models for greater power and interpretability http://www/gene-switch.eu

Slide 5

Slide 5 text

5 - Anchor definition / matching of samples and/or biological entities (experimental design) - Many more biological entities than individuals (p ≫ n) → overfitting - Heterogeneous data modalities - Normalization / standardization / pre-processing - Substantial batch effects (i.e., technical noise) - Missing or incomplete data (e.g., MI-MFA1 for imputation) - Validation/assessment of analysis outputs: lack of ground truth - Scalability: computational power/memory, look-elsewhere effect Some challenges of multi-omic data analysis https://bioinformatics.mdanderson.org/BatchEffectsViewer/ 1 Voillet et al. 2016 BMC Bioinformatics; 2Ramos et al. (2017) Cancer Research https://bioconductor.org/packages/MultiAssayExperiment/ MultiAssayExperiment: coordinated representation + storage + analysis of multi-omics data2

Slide 6

Slide 6 text

6 Requires anchor to link modalities, account for (known/unknown) interdependencies within and between modalities What is multi-omic data integration? Multi-{domain, way, view, modal, table, variate, omics} data Samples → ← Features Horizontal Diagonal Samples → ← Assays ← Features Mosaic Samples → ← Assays ← Features Images adapted from Argelaguet et al. (2021) Nature Biotechnology; Rajasundaram & Selbig (2016) Current Opinion in Plant Biology Samples → ← Assays Vertical ← Features

Slide 7

Slide 7 text

7 Why (and how) multi-omic data integration? Exploration • Uncover and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights

Slide 8

Slide 8 text

8 Which individuals in a large-scale cohort have highly aberrant multi-omic profiles for a given pathway of interest? Does patient prognosis correlate with large pathway deviation scores? Which genes / omics modalities drive these strongly aberrant scores? Multi-omic integration: Exploration Breast invasive carcinoma (BRCA; n = 504) and lung adenocarcinoma (LUAD; n = 144) • (Batch-corrected) RNA-seq + promoter methylation + copy number alterations + miRNA-seq • miRNA → gene mapping via miRTarBase (exact matches, Functional MTI predictions) • 1136 MSigDB curated canonical pathways

Slide 9

Slide 9 text

A B C Individuals 1 / λA 1 / λB 1 / λC Individuals 1 / λA 1 / λB 1 / λC PC 1 PC 2 ! 9 Define an individualized pathway-level deviation score based on multi-omic data using MFA http://bioconductor.org/packages/padma Rau et al. (2022) Biostatistics, https://doi.org/10.1101/827022 padma: Pathway deviation scores using Multiple Factor Analysis i 9

Slide 10

Slide 10 text

Which individuals have the most highly aberrant multi-omic profiles? 10 D4-GDP dissociation inhibitor signaling pathway, LUAD (Cox PH*, BH padj = 0.0111) Rau et al. (2022) Biostatistics

Slide 11

Slide 11 text

Which genes/omics drive large pathway deviation scores? 11 → CASP1, CASP3, CASP8 have large gene-level deviation scores for the two most extreme individuals… Rau et al. (2022) Biostatistics

Slide 12

Slide 12 text

12 • Larger padma deviation scores = increasingly aberrant pathway variation with significantly worse prognosis (survival, histological grade) in breast and lung cancer • Potential outlier detection tool in precision medicine & agriculture applications Innovative use of existing MFA method to quantify and graphically explore individualized multi-omic pathway deviation scores Next steps… • Incorporation of known hierarchical structure among genes in pathway • Extensions for highly structured data (e.g., multi-omic data from divergent chicken lines subject to feed/heat stress or maize diversity panels under control/cold conditions) padma results on TCGA multi-omic data (RNA-seq + miRNA-seq + methylation + CNA data, MSigDB canonical pathways)

Slide 13

Slide 13 text

13 Why (and how) multi-omic data integration? Exploration • Uncover and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights

Slide 14

Slide 14 text

14 Multi-omic integration: Genomic Prediction Genomic prediction of phenotypes and breeding values now widely used in most major plant and animal breeding programs Phenotypes ~ Genotypes → Increase rates of genetic gain through: • Better accuracy of estimated breeding values • Reduction of generation intervals • Genome-guided mate selection Increased availability of additional omics data has potential to improve prediction and enhance QTL discovery via inclusion as prior biological information Goal: accurate + interpretable phenotype prediction

Slide 15

Slide 15 text

15 Bayesian models for genomic prediction Erbe et al. (2012) Journal of Dairy Science; Kemper et al. (2015) Genetics Selection Evolution

Slide 16

Slide 16 text

…000001001201002100200010100001011001011110… …ACTCCGTAACTAGCCTACAAAGGCTAACTTACAAAAGATTTA… Genotype BayesR AnimalQTLdb GWAS hits BayesRC Unmethylated (piglet liver) Accessible chromatin (embryo liver) BayesRCπ or ? BayesRC+ + https://github.com/fmollandin/BayesRCO GBV Predict Null Low Medium High Multi-annotated SNP (Single-annotated SNPs) Overlapping annotations in genomic prediction Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO

Slide 17

Slide 17 text

17 Improved Bayesian models for genomic prediction PhD work of Fanny Mollandin (H2020 GENE-SWitCH) Cumulative Preferential assignment Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO

Slide 18

Slide 18 text

18 BayesRCO for genomic prediction: simulations PhD work of Fanny Mollandin (H2020 GENE-SWitCH) • Phenotypes simulated from real cattle 50k genotypes (n ~ 2500) with various heritabilities, number/sizes of QTLs • Types of annotation categories ⇒ strongly/moderately/weakly enriched or unenriched • A = 1 strong + 1 moderate + remaining SNPs • B = 1 strong + 1 moderate + 1 weak + 1 unenriched + remaining SNPs • C = 2 strong + 2 moderate + 3 weak + 2 unenriched + remaining SNPs 3 scenarios Improvement in validation prediction and QTL ranking (posterior variance) compared to BayesR BayesRC BayesRCπ BayesRC+ Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO

Slide 19

Slide 19 text

19 BayesRCO for genomic prediction: PIG-HEAT data PhD work of Fanny Mollandin (H2020 GENE-SWitCH), collaboration with Hélène Gilbert (GenPhySE) • 60k genotypes for n ~1200 pigs in 2 environments • 11 (overlapping) annotation categories extracted from PigQTLdb1 trait hierarchies • Focus on average daily weight gain and backfat thickness, sibling-structured 10-fold CV 1 https://www.animalgenome.org/cgi-bin/QTLdb/SS/index BayesRCπ Next steps… • Annotation categories generated using GENE-SWitCH multi-omics data

Slide 20

Slide 20 text

20 Why (and how) multi-omic data integration? Exploration • Uncover and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights

Slide 21

Slide 21 text

21 Image adapted from Lee et al. (2020) Frontiers in Genetics Homogeneous network (homogeneous nodes, single view) Multiplex network (homogeneous nodes, multiple views) Multi-layered network (heterogeneous nodes, multiple views) Graphs typically used to describe interactions: nodes = individual molecules, edges = interactions (dependencies) Multi-omic integration: Networks

Slide 22

Slide 22 text

22 Copula models for mixed-type data networks • (Sparse) graphical models often preferred to pairwise associations for network inference • Remove indirect associations by identifying conditional dependencies • For continuous data, graphical Gaussian models are a popular choice • But multi-omic data represent mixed-type data (continuous, counts, binary, …) that may have nonconstant correlations across their distribution • One strategy: couple univariate marginal distributions of variable pairs with copulae ⇒ Map (≠ transform!) mixed-type data into other variables where correlation can be easily defined Full joint probability distribution Marginal distribution of each variable Function coupling marginals together (= « copula ») Image from https://analystprep.com/study-notes/frm/part-1/quantitative-analysis/correlations-and-copulas

Slide 23

Slide 23 text

DINAMIC: Differential network analysis of mixed-type data with copulae 23 INRAE DIGIT-BIO Metaprogramme (2021-2023)

Slide 24

Slide 24 text

Some final remarks on multi-omics integration …and answering questions that we have not yet thought to ask1 Multi-omic data integration often requires a combination of software tools + computational expertise + domain expertise…  Utility of tools for rapid querying + (interactive) exploration of fully processed data without advanced coding knowledge  Reproducibility Communication + vocabulary is key! Moving forward, dealing with partially matched data: ◦ Multi-task learning (mosaic integration) ◦ Transfer learning (exploit large-scale reference atlases) Emergence of single-cell / spatial / time-course multi-omics data 1 Stein-O’Brien et al. (2018) Trends in Genetics Matrix factorization? Decomposition? Latent factor model? ... 24

Slide 25

Slide 25 text

Acknowledgements Part of this work has received funding from the EU’s Horizon 2020 Research and Innovation Programme under grand agreement n°817998.