Leveraging multi-omic data for integrative exploratory, predictive, and network analyses

Leveraging multi-omic data for integrative exploratory, predictive, and network analyses
ANDREA RAU KIM DATA & LIFE SCIENCES SEMINAR MONTPELLIER UNIVERSITÉ D’EXCELLENCE MAY 30, 2022 1 https://andrea-rau.com @andreamrau slides: https://tinyurl.com/MUSE2022-Rau

2 Gene expression TTTGCA AAACGT TF Transcription factor expression Copy
number alterations The multi-omics data landscape Promoter methylation microRNA expression …GCAGCGTTCGA… …GCAACGTTAGA… Somatic mutations Germline genetic variation Enhancer Accessibility Protein abundance Metabolite concentrations … + Histone modifications + RNA processing/stability + 3D conformation + Microbiome composition + … 2

3 - Comprehensive, multi-dimensional maps of key genomic changes in
33 cancer types from n = 11k+ individuals ◦ RNA-seq, miRNA-seq, copy number alterations, methylation, somatic mutations, protein abundance, genotypes, histological data, clinical data → p ~ 100s to 1000s to 100k+ - Publically available data (multi-tiered data depending on patient identifiability) - Widely used by the research community (1000+ publications by TCGA network + independent researchers) Large-scale (public) matched multi-omics The Cancer Genome Atlas (TCGA) Image: Corces et al. (2018)

4 Smaller-scale matched multi-omics @ INRAE H2020 GENE-SWitCH The regulatory
GENomE of Swine & Chicken: functional annotation during development PI’s: Elisabetta Giuffra and Hervé Acloque (INRAE) Aim: deliver new underpinning knowledge on functional genomes of the 2 main monogastric farm species to enable immediate translation to the pig and poultry sectors - High-quality richly annotated maps of pig and chicken genomes ◦ Developmental stages: early/late organogenesis, new born/hatched, adult ◦ Sexes: {♀,♂} x 3 biological replicates ◦ Tissues: liver, skeletal muscle, small intestine, cerebellum, dorsal epidermis, lung, kidney ◦ Assays: RNA-seq, ATAC-seq, ChIP-seq, RNA-seq, smRNA-seq, lrRNA-seq, methylation, Hi-C, whole genome sequences ◦ eQTLs in small intestine + skeletal muscle + liver in pigs Image: http://www.fragencode.org Image: http://www.gene-switch.eu/project.html Integrate functional information with phenotypic + genotypic data in genomic prediction models for greater power and interpretability http://www/gene-switch.eu

5 - Anchor definition / matching of samples and/or biological
entities (experimental design) - Many more biological entities than individuals (p ≫ n) → overfitting - Heterogeneous data modalities - Normalization / standardization / pre-processing - Substantial batch effects (i.e., technical noise) - Missing or incomplete data (e.g., MI-MFA1 for imputation) - Validation/assessment of analysis outputs: lack of ground truth - Scalability: computational power/memory, look-elsewhere effect Some challenges of multi-omic data analysis https://bioinformatics.mdanderson.org/BatchEffectsViewer/ 1 Voillet et al. 2016 BMC Bioinformatics; 2Ramos et al. (2017) Cancer Research https://bioconductor.org/packages/MultiAssayExperiment/ MultiAssayExperiment: coordinated representation + storage + analysis of multi-omics data2

6 Requires anchor to link modalities, account for (known/unknown) interdependencies
within and between modalities What is multi-omic data integration? Multi-{domain, way, view, modal, table, variate, omics} data Samples → ← Features Horizontal Diagonal Samples → ← Assays ← Features Mosaic Samples → ← Assays ← Features Images adapted from Argelaguet et al. (2021) Nature Biotechnology; Rajasundaram & Selbig (2016) Current Opinion in Plant Biology Samples → ← Assays Vertical ← Features

7 Why (and how) multi-omic data integration? Exploration • Uncover
and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights

8 Which individuals in a large-scale cohort have highly aberrant
multi-omic profiles for a given pathway of interest? Does patient prognosis correlate with large pathway deviation scores? Which genes / omics modalities drive these strongly aberrant scores? Multi-omic integration: Exploration Breast invasive carcinoma (BRCA; n = 504) and lung adenocarcinoma (LUAD; n = 144) • (Batch-corrected) RNA-seq + promoter methylation + copy number alterations + miRNA-seq • miRNA → gene mapping via miRTarBase (exact matches, Functional MTI predictions) • 1136 MSigDB curated canonical pathways

A B C Individuals 1 / λA 1 / λB
1 / λC Individuals 1 / λA 1 / λB 1 / λC PC 1 PC 2 ! 9 Define an individualized pathway-level deviation score based on multi-omic data using MFA http://bioconductor.org/packages/padma Rau et al. (2022) Biostatistics, https://doi.org/10.1101/827022 padma: Pathway deviation scores using Multiple Factor Analysis i 9

Which individuals have the most highly aberrant multi-omic profiles? 10
D4-GDP dissociation inhibitor signaling pathway, LUAD (Cox PH*, BH padj = 0.0111) Rau et al. (2022) Biostatistics

Which genes/omics drive large pathway deviation scores? 11 → CASP1,
CASP3, CASP8 have large gene-level deviation scores for the two most extreme individuals… Rau et al. (2022) Biostatistics

12 • Larger padma deviation scores = increasingly aberrant pathway
variation with significantly worse prognosis (survival, histological grade) in breast and lung cancer • Potential outlier detection tool in precision medicine & agriculture applications Innovative use of existing MFA method to quantify and graphically explore individualized multi-omic pathway deviation scores Next steps… • Incorporation of known hierarchical structure among genes in pathway • Extensions for highly structured data (e.g., multi-omic data from divergent chicken lines subject to feed/heat stress or maize diversity panels under control/cold conditions) padma results on TCGA multi-omic data (RNA-seq + miRNA-seq + methylation + CNA data, MSigDB canonical pathways)

14 Multi-omic integration: Genomic Prediction Genomic prediction of phenotypes and
breeding values now widely used in most major plant and animal breeding programs Phenotypes ~ Genotypes → Increase rates of genetic gain through: • Better accuracy of estimated breeding values • Reduction of generation intervals • Genome-guided mate selection Increased availability of additional omics data has potential to improve prediction and enhance QTL discovery via inclusion as prior biological information Goal: accurate + interpretable phenotype prediction

15 Bayesian models for genomic prediction Erbe et al. (2012)
Journal of Dairy Science; Kemper et al. (2015) Genetics Selection Evolution

…000001001201002100200010100001011001011110… …ACTCCGTAACTAGCCTACAAAGGCTAACTTACAAAAGATTTA… Genotype BayesR AnimalQTLdb GWAS hits BayesRC Unmethylated (piglet
liver) Accessible chromatin (embryo liver) BayesRCπ or ? BayesRC+ + https://github.com/fmollandin/BayesRCO GBV Predict Null Low Medium High Multi-annotated SNP (Single-annotated SNPs) Overlapping annotations in genomic prediction Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO

17 Improved Bayesian models for genomic prediction PhD work of
Fanny Mollandin (H2020 GENE-SWitCH) Cumulative Preferential assignment Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO

18 BayesRCO for genomic prediction: simulations PhD work of Fanny
Mollandin (H2020 GENE-SWitCH) • Phenotypes simulated from real cattle 50k genotypes (n ~ 2500) with various heritabilities, number/sizes of QTLs • Types of annotation categories ⇒ strongly/moderately/weakly enriched or unenriched • A = 1 strong + 1 moderate + remaining SNPs • B = 1 strong + 1 moderate + 1 weak + 1 unenriched + remaining SNPs • C = 2 strong + 2 moderate + 3 weak + 2 unenriched + remaining SNPs 3 scenarios Improvement in validation prediction and QTL ranking (posterior variance) compared to BayesR BayesRC BayesRCπ BayesRC+ Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO

19 BayesRCO for genomic prediction: PIG-HEAT data PhD work of
Fanny Mollandin (H2020 GENE-SWitCH), collaboration with Hélène Gilbert (GenPhySE) • 60k genotypes for n ~1200 pigs in 2 environments • 11 (overlapping) annotation categories extracted from PigQTLdb1 trait hierarchies • Focus on average daily weight gain and backfat thickness, sibling-structured 10-fold CV 1 https://www.animalgenome.org/cgi-bin/QTLdb/SS/index BayesRCπ Next steps… • Annotation categories generated using GENE-SWitCH multi-omics data

21 Image adapted from Lee et al. (2020) Frontiers in
Genetics Homogeneous network (homogeneous nodes, single view) Multiplex network (homogeneous nodes, multiple views) Multi-layered network (heterogeneous nodes, multiple views) Graphs typically used to describe interactions: nodes = individual molecules, edges = interactions (dependencies) Multi-omic integration: Networks

22 Copula models for mixed-type data networks • (Sparse) graphical
models often preferred to pairwise associations for network inference • Remove indirect associations by identifying conditional dependencies • For continuous data, graphical Gaussian models are a popular choice • But multi-omic data represent mixed-type data (continuous, counts, binary, …) that may have nonconstant correlations across their distribution • One strategy: couple univariate marginal distributions of variable pairs with copulae ⇒ Map (≠ transform!) mixed-type data into other variables where correlation can be easily defined Full joint probability distribution Marginal distribution of each variable Function coupling marginals together (= « copula ») Image from https://analystprep.com/study-notes/frm/part-1/quantitative-analysis/correlations-and-copulas

DINAMIC: Differential network analysis of mixed-type data with copulae 23
INRAE DIGIT-BIO Metaprogramme (2021-2023)

Some final remarks on multi-omics integration …and answering questions that
we have not yet thought to ask1 Multi-omic data integration often requires a combination of software tools + computational expertise + domain expertise…  Utility of tools for rapid querying + (interactive) exploration of fully processed data without advanced coding knowledge  Reproducibility Communication + vocabulary is key! Moving forward, dealing with partially matched data: ◦ Multi-task learning (mosaic integration) ◦ Transfer learning (exploit large-scale reference atlases) Emergence of single-cell / spatial / time-course multi-omics data 1 Stein-O’Brien et al. (2018) Trends in Genetics Matrix factorization? Decomposition? Latent factor model? ... 24

Acknowledgements Part of this work has received funding from the
EU’s Horizon 2020 Research and Innovation Programme under grand agreement n°817998.

Leveraging multi-omic data for integrative expl...

Leveraging multi-omic data for integrative exploratory, predictive, and network analyses

Andrea Rau

More Decks by Andrea Rau

Other Decks in Science

Featured

Transcript

Leveraging multi-omic data for integrative exploratory, predictive, and network analyses

2 Gene expression TTTGCA AAACGT TF Transcription factor expression Copy

3 - Comprehensive, multi-dimensional maps of key genomic changes in

4 Smaller-scale matched multi-omics @ INRAE H2020 GENE-SWitCH The regulatory

5 - Anchor definition / matching of samples and/or biological

6 Requires anchor to link modalities, account for (known/unknown) interdependencies

7 Why (and how) multi-omic data integration? Exploration • Uncover

8 Which individuals in a large-scale cohort have highly aberrant

A B C Individuals 1 / λA 1 / λB

Which individuals have the most highly aberrant multi-omic profiles? 10

Which genes/omics drive large pathway deviation scores? 11 → CASP1,

12 • Larger padma deviation scores = increasingly aberrant pathway

13 Why (and how) multi-omic data integration? Exploration • Uncover

14 Multi-omic integration: Genomic Prediction Genomic prediction of phenotypes and

15 Bayesian models for genomic prediction Erbe et al. (2012)

…000001001201002100200010100001011001011110… …ACTCCGTAACTAGCCTACAAAGGCTAACTTACAAAAGATTTA… Genotype BayesR AnimalQTLdb GWAS hits BayesRC Unmethylated (piglet

17 Improved Bayesian models for genomic prediction PhD work of

18 BayesRCO for genomic prediction: simulations PhD work of Fanny

19 BayesRCO for genomic prediction: PIG-HEAT data PhD work of

20 Why (and how) multi-omic data integration? Exploration • Uncover

21 Image adapted from Lee et al. (2020) Frontiers in

22 Copula models for mixed-type data networks • (Sparse) graphical

DINAMIC: Differential network analysis of mixed-type data with copulae 23

Some final remarks on multi-omics integration …and answering questions that

Acknowledgements Part of this work has received funding from the