Integrative and interactive analyses of multi-omics data

Integrative and interactive analyses of multi-omics data ANDREA RAU JOBIM
EVERYWHERE @ ZOOM JULY 2, 2020 1 https://andrea-rau.com @andreamrau slides: https://tinyurl.com/JOBIM2020-Rau

2 Gene expression TTTGCA AAACGT TF Transcription factor expression Copy
number alterations The gene regulatory landscape and multi-omics data Promoter methylation microRNA expression …GCAGCGTTCGA… …GCAACGTTAGA… Somatic mutations Germline genetic variation Enhancer Accessibility Protein abundance Metabolite concentrations … + Histone modifications + RNA processing/stability + 3D conformation + Microbiome composition + …

3 - Comprehensive, multi-dimensional maps of key genomic changes in
33 cancer types from 11k+ individuals ◦ RNA-seq, miRNA-seq, copy number alterations, methylation, somatic mutations, protein abundance, genotypes, histological data, clinical data - Publically available data (multi-tiered data depending on patient identifiability) - Widely used by the research community (1000+ publications by TCGA network + independent researchers) The Cancer Genome Atlas (TCGA) Image: Corces et al. (2018)

4 No regenerative response = disability Central nervous system (CNS)
injury Study the gene regulatory network involved in CNS rewiring during optic nerve regeneration in zebrafish Robust regenerative response = functional recovery Gene expression + Chromatin accessibility (RNA-seq + ATAC-seq) Dhara et al. (2019) Scientific Reports; Rau et al. (2019) G3 4

5 - Many more biological entities than individuals (p >>
n) - Experimental design - Normalization / standardization / pre-processing, potentially heterogenous quality across datasets, substantial batch effects - Missing or incomplete data (e.g., MI-MFA1) - Look-everywhere effect Some challenges of multi-omic data analysis https://bioinformatics.mdanderson.org/BatchEffectsViewer/ 1 Voillet et al. 2016; 2Ramos et al. (2017), https://bioconductor.org/packages/MultiAssayExperiment/ MultiAssayExperiment: coordinated representation + storage + analysis of multi-omics data2

6 - Horizontal versus vertical integration - Account for (known/unknown)
interdependencies within and across data types - (Partially) matched omics data across samples or biological entities (e.g., genes) - In some contexts, limited/incomplete a priori knowledge of relevant phenotype groups for comparisons = unsupervised analysis Multi-omic data → Multivariate, multi-table methods Multi-{domain, way, view, modal, table, omics} data How do we integrate multi-omic data? What question are we specifically addressing? How can we use multi-omic data to answer that question? Image: Rajasundaram and Selbig (2016)

7 Broad umbrella of integrative data analysis Many different answers,
depending on the question… Exploration / description • Find underlying relationships between datasets • Clustering, unsupervised classification Prediction • Identify small set of features (i.e., biomarkers) that yields best possible prediction • Remove noisy or redundant feature, curse of dimensionality • Use set of features to understand the underlying biology Causality • Extract mechanistic hypotheses and insights http://factominer.free.fr, http://mixomics.org/

8 For a given pathway of interest, can we identify
and quantify highly aberrant individuals in a sample based on multi-omic data? Does patient prognosis correlate with large pathway deviation scores? Which individuals have the most aberrant profiles for pathways of interest? Which genes / omic drive these aberrant scores? Integrative multi-omics methods: Multivariate analysis

A B C Individuals 1 / λA 1 / λB
1 / λC Individuals 1 / λA 1 / λB 1 / λC PC 1 PC 2 ! 9 Define an individualized pathway-level deviation score based on multi-omic data using MFA http://github.com/andreamrau/padma Rau et al. (2020) Biostatistics, https://doi.org/10.1101/827022 padma: Pathway deviation scores using Multiple Factor Analysis i

10 Applying padma to TCGA multi-omics data Breast invasive carcinoma
(BRCA; n = 504) and lung adenocarcinoma (LUAD; n = 144) • Batch correction performed using removeBatchEffects in limma • RNA-seq + promoter methylation + copy number alterations + miRNA-seq • miRNA → gene mapping provided by miRTarBase (exact matches, Functional MTI predictions) • 1136 MSigDB curated canonical pathways (Biocarta, PID, Reactome, Sigma Aldrich, Signaling Gateway, Signal Transduction Knowledge Environment, Matrisome Project) Patient prognosis measured using progression-free interval survival times (LUAD) and histological grade (BRCA) Rau et al. (2020) Biostatistics

Which individuals have the most highly aberrant multi-omic profiles? 11
D4-GDP dissociation inhibitor signaling pathway, LUAD (Cox PH*, BH padj = 0.0111) Rau et al. (2020) Biostatistics

Which genes/omics drive large pathway deviation scores? 12 → CASP1,
CASP3, and CASP8 all have high gene-level deviation scores for the two most extreme individuals… Rau et al. (2020) Biostatistics

Which genes/omics drive large pathway deviation scores? 13 Rau et
al. (2020) Biostatistics

14 • Larger padma deviation scores = increasingly aberrant pathway
variation with significantly worse prognosis (survival, histological grade) in breast and lung cancer • Potential outlier detection tool Innovative use of existing MFA method to calculate and graphically explore individualized multi-omic pathway deviation scores Next steps… • Incorporation of known hierarchical structure among genes in pathway • Extensions for highly structured data (e.g., multi-omic data from divergent chicken lines subject to feed/heat stress or maize diversity panels under control/cold conditions) padma results on TCGA breast and lung cancer (RNA-seq + miRNA-seq + methylation + CNA data, MSigDB canonical pathways) Rau et al. (2020) Biostatistics

15 Integrative multi-omics methods: Clustering Clustering individuals based on single
omics (especially gene expression) data widely used to identify molecular subtypes of cancer • PAM50, AIMS intrinsic subtypes • Many methods have been developed Recently, many integrative clustering methods have proposed to make use of multi-omic data • Rich literature in machine learning on multi-view methods • Multi-omic specific methods: MVDA, iCluster+, MOFA, … • Primarily de novo clustering from multi-omics data How can an existing clustering be merged or split based on multi-omics data? e.g., subdivide intrinsic subtypes into distinct sub-groups of individuals

16 maskmeans: Multi-view aggregation/splitting K-means = ( 1 , …
, , …, ) where each is scaled to unit-variance and additionally divided by the size of its view: = / Aggregation/splitting of initial clustering of the n individuals based on the minimization of a criterion similar to the multi-view fuzzy K-means algorithm* with tuning parameters , > 1: * Wang and Chen (2017); Godichon-Baggioni et al. (2020) AOAS; http://github.com/andreamrau/maskmeans ෍ =1 ෍ =1 ෍ =1 (, )(, ) () − () 2 Clustering partition Per-view cluster centers Per-cluster, per-view weights

17 Multi-view splitting K-means algorithm Given a (hard or fuzzy)
clustering matrix = (, ) with K clusters, at each step: Identify the cluster ෠ that minimizes our criterion Split this cluster in two, ሚ ෠ 1 and ሚ ෠ 2 , such that the criterion is minimized, under the constraint that ,1 + ,2 = ,෠ for all Update per-view cluster centers Update weight matrix = (, ) for this split for all = 1, … , and = 1, … , + 1 Godichon-Baggioni et al. (2020) AOAS 1 2 3 4 1 5 4 6 = 3 1 5

n = 61 n = 38 n = 228 n
= 136 n = 43 18 maskmeans for TCGA breast cancer n = 506 patients; focus on subset of 226 genes (TP53, MKI67, estrogen signaling and ErbB signaling pathways, and the SAM40 DNA methylation signature) and 149 miRNAs with avg normalized expression > 50 Godichon-Baggioni et al. (2020) AOAS Age at diagnosis + menopause status Number of lymph nodes

Biology Statistics Visualize DE genes? Cluster expression profiles? Plot clusters
in temporal order and output gene lists? Coordinates for open chromatin proximal to cluster 2? GO enrichment of genes proximal to accessible chromatin at t=2? 19

Interactivity in (multi-omic) data analysis  Immediate feedback on how
data/figures/results change when inputs are modified, user becomes an active participant in the analysis  Recent advances in R make interactive visualizations (plotly) and web applications (Shiny) more readily available  Shiny apps allow R scripts to be rerun based on user inputs without running R  Can be shared locally or hosted on the web (Shinyapps.io or using a Shiny server) 20

21 Regeneration Rosetta Shiny app * Expanded functionality beyond original
study dozens of supported organisms, deep investigation of regeneration-associated expression and chromatin accessibility Dhara et al. (2019) Scientific Reports, doi: 10.1038/s41598-019-50485-6. Rau et al (2019). G3: Genes|Genomes|Genetics, doi: 10.1534/g3.119.400729, http://ls-shiny-prod.uwm.edu/rosetta {RNA-seq + ATAC-seq} x 5 time pts

Some final remarks on multi-omics …and answering questions that we
have not yet thought to ask1 Multi-omic data integration often requires a combination of software tools + technical expertise + domain expertise… Utility of tools for rapid querying + (interactive) exploration of fully processed data without advanced coding knowledge Reproducibility Communication + vocabulary is key! Emergence of single-cell multi-omics data2 and time-course multi- omics data 1 Stein-O’Brien et al. (2018) Trends in Genetics 2 Mathematical frameworks for integrative analysis of merging biological data types: https://www.birs.ca/events/2020/5-day-workshops/20w5197 Matrix factorization? Decomposition? Latent factor model? ...

Acknowledgements 23 23 https://andrea-rau.com @andreamrau slides: https://tinyurl.com/JOBIM2020-Rau

Integrative and interactive analyses of multi-o...

Integrative and interactive analyses of multi-omics data

Andrea Rau

More Decks by Andrea Rau

Other Decks in Science

Featured

Transcript

Integrative and interactive analyses of multi-omics data ANDREA RAU JOBIM

2 Gene expression TTTGCA AAACGT TF Transcription factor expression Copy

3 - Comprehensive, multi-dimensional maps of key genomic changes in

4 No regenerative response = disability Central nervous system (CNS)

5 - Many more biological entities than individuals (p >>

6 - Horizontal versus vertical integration - Account for (known/unknown)

7 Broad umbrella of integrative data analysis Many different answers,

8 For a given pathway of interest, can we identify

A B C Individuals 1 / λA 1 / λB

10 Applying padma to TCGA multi-omics data Breast invasive carcinoma

Which individuals have the most highly aberrant multi-omic profiles? 11

Which genes/omics drive large pathway deviation scores? 12 → CASP1,

Which genes/omics drive large pathway deviation scores? 13 Rau et

14 • Larger padma deviation scores = increasingly aberrant pathway

15 Integrative multi-omics methods: Clustering Clustering individuals based on single

16 maskmeans: Multi-view aggregation/splitting K-means = ( 1 , …

17 Multi-view splitting K-means algorithm Given a (hard or fuzzy)

n = 61 n = 38 n = 228 n

Biology Statistics Visualize DE genes? Cluster expression profiles? Plot clusters

Interactivity in (multi-omic) data analysis  Immediate feedback on how

21 Regeneration Rosetta Shiny app * Expanded functionality beyond original

Some final remarks on multi-omics …and answering questions that we

Acknowledgements 23 23 https://andrea-rau.com @andreamrau slides: https://tinyurl.com/JOBIM2020-Rau