Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Leveraging multi-omic data for integrative expl...

Leveraging multi-omic data for integrative exploratory, predictive, and network analyses

KIM -- Data & Life Sciences seminar @ Montpellier (May 30, 2022)

Abstract: The increased availability and affordability of high-throughput sequencing technologies in recent years have facilitated the use of multi-omic studies, expanding and enriching our understanding of complex systems across hierarchical biological levels. Integrative methods for these heterogeneous and multi-faceted omics data have shown promise for enhancing the interpretability of exploratory analyses, improving predictive power, and contributing to a holistic understanding of systems biology. However, such integrative analyses are accompanied by several major obstacles, including the potentially ambiguous relationships among omic levels, high dimensionality coupled with small sample sizes, technical artefacts due to batch effects, potentially incomplete or missing data… and the occasional difficulty in posing well-defined and answerable research questions of such data. In light of these challenges, in this talk I will discuss a few of our recent methodological contributions to integrate multi-omic data for (1) exploratory analyses, (2) genomic prediction, and (3) network inference, all with a focus on enhanced interpretability and user-friendly software implementations.

Andrea Rau

May 17, 2022
Tweet

More Decks by Andrea Rau

Other Decks in Science

Transcript

  1. Leveraging multi-omic data for integrative exploratory, predictive, and network analyses

    ANDREA RAU KIM DATA & LIFE SCIENCES SEMINAR MONTPELLIER UNIVERSITÉ D’EXCELLENCE MAY 30, 2022 1 https://andrea-rau.com @andreamrau slides: https://tinyurl.com/MUSE2022-Rau
  2. 2 Gene expression TTTGCA AAACGT TF Transcription factor expression Copy

    number alterations The multi-omics data landscape Promoter methylation microRNA expression …GCAGCGTTCGA… …GCAACGTTAGA… Somatic mutations Germline genetic variation Enhancer Accessibility Protein abundance Metabolite concentrations … + Histone modifications + RNA processing/stability + 3D conformation + Microbiome composition + … 2
  3. 3 - Comprehensive, multi-dimensional maps of key genomic changes in

    33 cancer types from n = 11k+ individuals ◦ RNA-seq, miRNA-seq, copy number alterations, methylation, somatic mutations, protein abundance, genotypes, histological data, clinical data → p ~ 100s to 1000s to 100k+ - Publically available data (multi-tiered data depending on patient identifiability) - Widely used by the research community (1000+ publications by TCGA network + independent researchers) Large-scale (public) matched multi-omics The Cancer Genome Atlas (TCGA) Image: Corces et al. (2018)
  4. 4 Smaller-scale matched multi-omics @ INRAE H2020 GENE-SWitCH The regulatory

    GENomE of Swine & Chicken: functional annotation during development PI’s: Elisabetta Giuffra and Hervé Acloque (INRAE) Aim: deliver new underpinning knowledge on functional genomes of the 2 main monogastric farm species to enable immediate translation to the pig and poultry sectors - High-quality richly annotated maps of pig and chicken genomes ◦ Developmental stages: early/late organogenesis, new born/hatched, adult ◦ Sexes: {♀,♂} x 3 biological replicates ◦ Tissues: liver, skeletal muscle, small intestine, cerebellum, dorsal epidermis, lung, kidney ◦ Assays: RNA-seq, ATAC-seq, ChIP-seq, RNA-seq, smRNA-seq, lrRNA-seq, methylation, Hi-C, whole genome sequences ◦ eQTLs in small intestine + skeletal muscle + liver in pigs Image: http://www.fragencode.org Image: http://www.gene-switch.eu/project.html Integrate functional information with phenotypic + genotypic data in genomic prediction models for greater power and interpretability http://www/gene-switch.eu
  5. 5 - Anchor definition / matching of samples and/or biological

    entities (experimental design) - Many more biological entities than individuals (p ≫ n) → overfitting - Heterogeneous data modalities - Normalization / standardization / pre-processing - Substantial batch effects (i.e., technical noise) - Missing or incomplete data (e.g., MI-MFA1 for imputation) - Validation/assessment of analysis outputs: lack of ground truth - Scalability: computational power/memory, look-elsewhere effect Some challenges of multi-omic data analysis https://bioinformatics.mdanderson.org/BatchEffectsViewer/ 1 Voillet et al. 2016 BMC Bioinformatics; 2Ramos et al. (2017) Cancer Research https://bioconductor.org/packages/MultiAssayExperiment/ MultiAssayExperiment: coordinated representation + storage + analysis of multi-omics data2
  6. 6 Requires anchor to link modalities, account for (known/unknown) interdependencies

    within and between modalities What is multi-omic data integration? Multi-{domain, way, view, modal, table, variate, omics} data Samples → ← Features Horizontal Diagonal Samples → ← Assays ← Features Mosaic Samples → ← Assays ← Features Images adapted from Argelaguet et al. (2021) Nature Biotechnology; Rajasundaram & Selbig (2016) Current Opinion in Plant Biology Samples → ← Assays Vertical ← Features
  7. 7 Why (and how) multi-omic data integration? Exploration • Uncover

    and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights
  8. 8 Which individuals in a large-scale cohort have highly aberrant

    multi-omic profiles for a given pathway of interest? Does patient prognosis correlate with large pathway deviation scores? Which genes / omics modalities drive these strongly aberrant scores? Multi-omic integration: Exploration Breast invasive carcinoma (BRCA; n = 504) and lung adenocarcinoma (LUAD; n = 144) • (Batch-corrected) RNA-seq + promoter methylation + copy number alterations + miRNA-seq • miRNA → gene mapping via miRTarBase (exact matches, Functional MTI predictions) • 1136 MSigDB curated canonical pathways
  9. A B C Individuals 1 / λA 1 / λB

    1 / λC Individuals 1 / λA 1 / λB 1 / λC PC 1 PC 2 ! 9 Define an individualized pathway-level deviation score based on multi-omic data using MFA http://bioconductor.org/packages/padma Rau et al. (2022) Biostatistics, https://doi.org/10.1101/827022 padma: Pathway deviation scores using Multiple Factor Analysis i 9
  10. Which individuals have the most highly aberrant multi-omic profiles? 10

    D4-GDP dissociation inhibitor signaling pathway, LUAD (Cox PH*, BH padj = 0.0111) Rau et al. (2022) Biostatistics
  11. Which genes/omics drive large pathway deviation scores? 11 → CASP1,

    CASP3, CASP8 have large gene-level deviation scores for the two most extreme individuals… Rau et al. (2022) Biostatistics
  12. 12 • Larger padma deviation scores = increasingly aberrant pathway

    variation with significantly worse prognosis (survival, histological grade) in breast and lung cancer • Potential outlier detection tool in precision medicine & agriculture applications Innovative use of existing MFA method to quantify and graphically explore individualized multi-omic pathway deviation scores Next steps… • Incorporation of known hierarchical structure among genes in pathway • Extensions for highly structured data (e.g., multi-omic data from divergent chicken lines subject to feed/heat stress or maize diversity panels under control/cold conditions) padma results on TCGA multi-omic data (RNA-seq + miRNA-seq + methylation + CNA data, MSigDB canonical pathways)
  13. 13 Why (and how) multi-omic data integration? Exploration • Uncover

    and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights
  14. 14 Multi-omic integration: Genomic Prediction Genomic prediction of phenotypes and

    breeding values now widely used in most major plant and animal breeding programs Phenotypes ~ Genotypes → Increase rates of genetic gain through: • Better accuracy of estimated breeding values • Reduction of generation intervals • Genome-guided mate selection Increased availability of additional omics data has potential to improve prediction and enhance QTL discovery via inclusion as prior biological information Goal: accurate + interpretable phenotype prediction
  15. 15 Bayesian models for genomic prediction Erbe et al. (2012)

    Journal of Dairy Science; Kemper et al. (2015) Genetics Selection Evolution
  16. …000001001201002100200010100001011001011110… …ACTCCGTAACTAGCCTACAAAGGCTAACTTACAAAAGATTTA… Genotype BayesR AnimalQTLdb GWAS hits BayesRC Unmethylated (piglet

    liver) Accessible chromatin (embryo liver) BayesRCπ or ? BayesRC+ + https://github.com/fmollandin/BayesRCO GBV Predict Null Low Medium High Multi-annotated SNP (Single-annotated SNPs) Overlapping annotations in genomic prediction Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO
  17. 17 Improved Bayesian models for genomic prediction PhD work of

    Fanny Mollandin (H2020 GENE-SWitCH) Cumulative Preferential assignment Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO
  18. 18 BayesRCO for genomic prediction: simulations PhD work of Fanny

    Mollandin (H2020 GENE-SWitCH) • Phenotypes simulated from real cattle 50k genotypes (n ~ 2500) with various heritabilities, number/sizes of QTLs • Types of annotation categories ⇒ strongly/moderately/weakly enriched or unenriched • A = 1 strong + 1 moderate + remaining SNPs • B = 1 strong + 1 moderate + 1 weak + 1 unenriched + remaining SNPs • C = 2 strong + 2 moderate + 3 weak + 2 unenriched + remaining SNPs 3 scenarios Improvement in validation prediction and QTL ranking (posterior variance) compared to BayesR BayesRC BayesRCπ BayesRC+ Mollandin et al. (2022), https://doi.org/10.21203/rs.3.rs-1366477/v1; https://github.com/fmollandin/BayesRCO
  19. 19 BayesRCO for genomic prediction: PIG-HEAT data PhD work of

    Fanny Mollandin (H2020 GENE-SWitCH), collaboration with Hélène Gilbert (GenPhySE) • 60k genotypes for n ~1200 pigs in 2 environments • 11 (overlapping) annotation categories extracted from PigQTLdb1 trait hierarchies • Focus on average daily weight gain and backfat thickness, sibling-structured 10-fold CV 1 https://www.animalgenome.org/cgi-bin/QTLdb/SS/index BayesRCπ Next steps… • Annotation categories generated using GENE-SWitCH multi-omics data
  20. 20 Why (and how) multi-omic data integration? Exploration • Uncover

    and describe interpretable structure among samples and underlying relationships among omics • Clustering, unsupervised classification of individuals Prediction • Identify interpretable and concise set of biomarkers • Accurately predict phenotypes (genomic prediction) Network inference • Identify dependencies among biological entities • Extract mechanistic hypotheses and systems biology insights
  21. 21 Image adapted from Lee et al. (2020) Frontiers in

    Genetics Homogeneous network (homogeneous nodes, single view) Multiplex network (homogeneous nodes, multiple views) Multi-layered network (heterogeneous nodes, multiple views) Graphs typically used to describe interactions: nodes = individual molecules, edges = interactions (dependencies) Multi-omic integration: Networks
  22. 22 Copula models for mixed-type data networks • (Sparse) graphical

    models often preferred to pairwise associations for network inference • Remove indirect associations by identifying conditional dependencies • For continuous data, graphical Gaussian models are a popular choice • But multi-omic data represent mixed-type data (continuous, counts, binary, …) that may have nonconstant correlations across their distribution • One strategy: couple univariate marginal distributions of variable pairs with copulae ⇒ Map (≠ transform!) mixed-type data into other variables where correlation can be easily defined Full joint probability distribution Marginal distribution of each variable Function coupling marginals together (= « copula ») Image from https://analystprep.com/study-notes/frm/part-1/quantitative-analysis/correlations-and-copulas
  23. Some final remarks on multi-omics integration …and answering questions that

    we have not yet thought to ask1 Multi-omic data integration often requires a combination of software tools + computational expertise + domain expertise…  Utility of tools for rapid querying + (interactive) exploration of fully processed data without advanced coding knowledge  Reproducibility Communication + vocabulary is key! Moving forward, dealing with partially matched data: ◦ Multi-task learning (mosaic integration) ◦ Transfer learning (exploit large-scale reference atlases) Emergence of single-cell / spatial / time-course multi-omics data 1 Stein-O’Brien et al. (2018) Trends in Genetics Matrix factorization? Decomposition? Latent factor model? ... 24
  24. Acknowledgements Part of this work has received funding from the

    EU’s Horizon 2020 Research and Innovation Programme under grand agreement n°817998.