Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mixture models as a useful tool for identifying co-expressed genes from RNA-seq data with coseq

Dc971cfc929cb925baf3d41f48e25fa5?s=47 Andrea Rau
April 08, 2021

Mixture models as a useful tool for identifying co-expressed genes from RNA-seq data with coseq

Presentation at MiMo workshop on mixture models (April 8, 2021)
https://mimo2021.sciencesconf.org

Abstract: Complex studies of transcriptome dynamics are now routinely carried out using RNA sequencing (RNA-seq). A common goal in such studies is to identify groups of co-expressed genes that share similar expression profiles across several treatment conditions, time points, or tissues. These co-expression analyses serve both as an exploratory visualization tool as well as a hypothesis-generating tool for poorly annotated genes. In this talk, I will discuss some of the mixture-model based approaches we have developed in recent years for RNA-seq co-expression analysis, with a particular focus on the practical issues surrounding the use of such approaches and their implementation within the R/Bioconductor package ecosystem. Finally, as studies with matched multi-view data (i.e., at different biological levels of molecular information) are becoming increasingly common, I will also briefly discuss our recent work to integratively use multiple data views to aggregate or split existing clusters from multi-omics data.

Dc971cfc929cb925baf3d41f48e25fa5?s=128

Andrea Rau

April 08, 2021
Tweet

Transcript

  1. Mixture models as a useful tool for identifying co-expressed genes

    from RNA-seq data with coseq Andrea Rau @andreamrau April 8, 2021 @ MiMo workshop on mixture models (virtual) https://tinyurl.com/MiMo2021-Rau
  2. RNA-seq gene expression data (Pseudo)counts affected by amount of transcription

    + gene length + technical biases (library size, GC content) 1. Experimental design 2. Experiment: library preparation, manufacturer protocols, ... 3. Bioinformatics: QC, trimming, alignment, quantification 4. Analysis andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 2 / 20
  3. "Standard" RNA-seq multi-factor analysis pipeline 1. Exploratory analysis 2. Differential

    analysis Wald test with contrasts Likelihood ratio test to compare a full model to a reduced one 3. Clustering / co-expression analysis 4. Functional enrichment analysis to annotate/interpret results coseq DAVID, topGO, enrichplot, ShinyGO, … 1 2 3 4 DESeq2, edgeR, … andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 3 / 20
  4. "Standard" RNA-seq multi-factor analysis pipeline 1. Exploratory analysis 2. Differential

    analysis Wald test with contrasts Likelihood ratio test to compare a full model to a reduced one 3. Clustering / co-expression analysis 4. Functional enrichment analysis to annotate/interpret results coseq DAVID, topGO, enrichplot, ShinyGO, … 1 2 3 4 DESeq2, edgeR, … andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 3 / 20
  5. How should co-expression be defined for RNA-seq? : Cluster relative

    expression independent of overall expression 0 500 1000 1500 2000 2500 0 5 10 15 Sample Normalized expression A 0 2 4 6 8 0 5 10 15 Sample Log(normalized expression + 1) B 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 Sample Normalized expression profiles Group 1 2 3 4 C If yij is normalized expression of gene i in sample j, normalized profile for gene i is: pij = yij +1 yi +1 andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 4 / 20
  6. How should co-expression be defined for RNA-seq? : Cluster relative

    expression independent of overall expression 0 500 1000 1500 2000 2500 0 5 10 15 Sample Normalized expression A 0 2 4 6 8 0 5 10 15 Sample Log(normalized expression + 1) B 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 Sample Normalized expression profiles Group 1 2 3 4 C If yij is normalized expression of gene i in sample j, normalized profile for gene i is: pij = yij +1 yi +1 Normalized profiles pij are compositional, i.e. linear dependence of pi· = 1 imposes constraints that can be problematic. . . andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 4 / 20
  7. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  8. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  9. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  10. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  11. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  12. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  13. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  14. Mixture model strategies for co-expression With G. Celeux, M.-L. Martin-Magniette,

    C. Maugis-Rabusseau, A. Godichon-Baggioni andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 5 / 20
  15. coseq R/Bioconductor package for RNA-seq co-expression BiocManager::install("coseq") library(coseq) ## S4

    method for signature matrix , data.frame , ## and DESeqDataSet coseq(object, K, , ...) andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 6 / 20
  16. coseq R/Bioconductor package for RNA-seq co-expression BiocManager::install("coseq") library(coseq) ## S4

    method for signature matrix , data.frame , ## and DESeqDataSet coseq(object, K, , ...) Several options: Clustering model (kmeans or Normal) and transformation (logclr, clr, . . . ) Type of normalization for RNA-seq data (TMM, DESeq, . . . ) Parallel computation using BiocParallel Output: S4 object of class coseqResults, with corresponding methods (plot, summary, show, clusters, likelihood, . . . ) andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 6 / 20
  17. coseq for clustering RNA-seq data data(fietz) counts <- exprs(fietz) ##

    RNA-seq from mice conds <- pData(fietz) ## 3 neocortex regions x 5 reps library(DESeq2) deseq <- DESeqDataSetFromMatrix(counts,conds,~tissue) deseq <- DESeq(deseq) coexp <- coseq(deseq, K=2:15, alpha=0.01) ## **************************************** ## Co-expression analysis on DESeq2 output: ## 6461 DE genes at p-adj < 0.01 ## **************************************** ## coseq analysis: kmeans approach & logclr ## K = 2 to 15 ## **************************************** ## Running K = 2 ... ## Running K = 3 ... andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 7 / 20
  18. Choice of number of clusters coexp An object of class

    coseqResults 6461 features by 15 samples. Models fit: K = 2 ... 15 Chosen clustering model: K = 10 library(capushe) plot(metadata(coexp)$capushe)) # ICL diagnostic graphic for GMM: plot(coexp, graphs="ICL") andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 8 / 20
  19. Clustering diagnostics with coseq plot(coexp, graphs="probapost_boxplots") 0.00 0.25 0.50 0.75

    1.00 2 6 9 3 7 5 1 4 8 10 Cluster Max conditional probability Genes attributed to cluster with largest conditional probability of membership given the estimated parameters: τik(θ) = πkfk(yi| θk) K =1 π f (yi| θ ) For K-means, associate clustering partition with spherical GMM andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 9 / 20
  20. Cluster visualization with coseq # plot(coexp) p <- plot(coexp, conds=conds$tissue,

    graphs="boxplots", collapse_reps="average")$boxplot 9 10 5 6 7 8 1 2 3 4 CP SVZ VZ CP SVZ VZ CP SVZ VZ CP SVZ VZ 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Conditions Average expression profiles Conditions CP SVZ VZ andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 10 / 20
  21. Cluster visualization with coseq library(ggplot2) p + ggtitle("RNA-seq profiles") +

    theme_bw() 9 10 5 6 7 8 1 2 3 4 CP SVZ VZ CP SVZ VZ CP SVZ VZ CP SVZ VZ 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Conditions Average expression profiles Conditions CP SVZ VZ RNA−seq profiles andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 11 / 20
  22. Some thoughts about software usability andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau

    12 / 20
  23. Some thoughts about software usability Fully worked examples in package

    vignette and extensive documentation ⇒ workflow prototype Inputs and outputs: Accept a range of input types (S4 classes and methods) Structure outputs according to the next step in analysis pipeline Incorporate feedback from users ⇒ Bioconductor forums, emails, ... Reasonable parameter default values (users almost always just use defaults!) Simplified package and function names, consistent naming conventions Exploit ggplot2 functionality for fully customizable plots andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 13 / 20
  24. Highlighting some recent applications using coseq Tomato co-expression (Sauvage et

    al., 2017) Decipher domestication footprints in wild + crop tomato Clusters enriched in relevant functional terms (carbohydrate metabolism, epigenetic regulation of expression) Results suggestive of rewiring of gene co-expression induced by domestication Varroa life cycle co-expression (Mondet et al., 2018) Full life-cycle transcriptomic catalog of a honeybee parasite Many clusters characterized by stage-specific expression (e.g. pre- and post-reproductive stages) Energetic and chitin metabolic process → transcriptional regulation → neurotransmitter/neuropeptide receptors involved in behavioural regulation = Adaptation of parasite to host andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 14 / 20
  25. From mono-omics to multi-omics... 2 Gene expression TTTGCA AAACGT TF

    Transcription factor expression Copy number alterations The gene regulatory landscape and multi-omics data Promoter methylation microRNA expression GCGTTC GCGTT CGTTC Somatic mutations Germline genetic variation Enhancer Accessibility Protein abundance Metabolite concentrations ⇒ Most integrative clustering approaches focus on classifying samples (rather than genes) andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 15 / 20
  26. maskmeans: Multi-view aggregation/splitting K-means With C. Maugis-Rabusseau, A. Godichon-Baggioni (Standardized)

    multi-vew data Z = Z(1), . . . , Z(v), . . . , Z(V ) , normalized by view size: X(v) = Z(v)/dv Aggregation/splitting of initial K clusters by minimizing: Crit (ΠK , α, µ) = K k=1 n i=1 V v=1 (αk,v )γ(πi,k)δ X(v) i − µ(v) k 2 , where γ, δ ≥ 1 and v αk,v = k πi,k = 1 for all k = 1, . . . , K andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 16 / 20
  27. maskmeans: Multi-view aggregation/splitting K-means With C. Maugis-Rabusseau, A. Godichon-Baggioni (Standardized)

    multi-vew data Z = Z(1), . . . , Z(v), . . . , Z(V ) , normalized by view size: X(v) = Z(v)/dv Aggregation/splitting of initial K clusters by minimizing: Crit (ΠK , α, µ) = K k=1 n i=1 V v=1 (αk,v )γ(πi,k)δ X(v) i − µ(v) k 2 , where γ, δ ≥ 1 and v αk,v = k πi,k = 1 for all k = 1, . . . , K Starting with initial membership probabilities Π(0) = (πi,k) for K0 1. Multi-view aggregation: aggregate clusters while minimizing Crit, where δ = 1 and αk,v = αv for all k 2. Multi-view splitting: split clusters while minimizing Crit ⇒ Special cases: hard vs fuzzy clustering, per-cluster weights or not andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 16 / 20
  28. Multi-view K-means splitting for TCGA multi-omics data TCGA multi-omics breast

    cancer data in n= 506 patients (226 genes + 149 miRNAs of interest) Multi-view K-means splitting to refine previously inferred BRCA subtypes (Basal, Her2+, Luminal A, Luminal B, normal) of individuals to reveal clinically interesting subgroupings n = 61 n = 38 n = 228 n = 136 n = 43 maskmeans for TCGA breast cancer n = 506 patients; focus on subset of 226 genes (TP53, MKI67, estrogen signaling and ErbB signaling pathways, and the SAM40 DNA methylation signature) and 149 miRNAs with avg normalized expression > 50 Age at diagnosis + menopause status Number of lymph nodes andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 17 / 20
  29. maskmeans: Multi-view aggregation and splitting K-means devtools::install_github("andreamrau/maskmeans") library(maskmeans) aggreg <-

    maskmeans(mv_data, clustering_init, type="aggregation", ...) split <- maskmeans(mv_data, clustering_init, type="splitting", Kmax, ...) mv_simulate(...) mv_plot(...) maskmeans_plot(...) andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 18 / 20
  30. Potential next steps for (multi-omic) co-expression Ideally, move from co-expression

    → co-regulation Co-regulation is explained by many things (biochemical, genetic, evolutionary, technological factors, ...): Compartimentalization / 3D organization of genome (Hi-C data) Local sharing of regulatory elements (e.g., transcription factors, ChIP-seq data) Housekeeping genes that are constitutively expressed across tissues ... Increasingly large/complex experimental designs (e.g., scRNA-seq data) → variable selection procedures andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 19 / 20
  31. In progress: hierarchical Bayesian mixture models for genomic prediction PhD

    work of F. Mollandin (H2020 GENE-SWitCH) y = µ1n + Xβ + e andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 20 / 20
  32. In progress: hierarchical Bayesian mixture models for genomic prediction PhD

    work of F. Mollandin (H2020 GENE-SWitCH) y = µ1n + Xβ + e BayesR : βi ∼ π1 δ(0) null +π2 N(0, 0.0001σ2 g ) small +π3 N(0, 0.001σ2 g ) medium +π4 N(0, 0.01σ2 g ) large andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 20 / 20
  33. In progress: hierarchical Bayesian mixture models for genomic prediction PhD

    work of F. Mollandin (H2020 GENE-SWitCH) y = µ1n + Xβ + e BayesR : βi ∼ π1 δ(0) null +π2 N(0, 0.0001σ2 g ) small +π3 N(0, 0.001σ2 g ) medium +π4 N(0, 0.01σ2 g ) large BayesRC+ with overlapping functional multi-omics categories: βi |a ∈ A(i) ∼ a∈A(i) 4 k=1 πk,aN(0, σ2 k ) andrea.rau@inrae.fr • @andreamrau • https://tinyurl.com/MiMo2021-Rau 20 / 20
  34. vignette("coseq") MixStatSeq ANR-JCJC grant (C. Maugis-Rabusseau) http://bioconductor.org/packages/coseq Rau, A., et

    al. (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics. Rau and Maugis-Rabusseau (2018) Transformation and model choice for RNA-seq co-expression. Briefings in Bioinformatics. Godichon-Baggioni et al. (2018) Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data. Journal of Applied Statistics.