HDR defense presentation

Statistical methods and software for the analysis of transcriptomic data
Andrea Rau HDR defense September 26, 2017 Universit´ e d’´ Evry-Val-d’Essonne

2005 B.A. Mathematics & French 2007 M.S. Applied Statistics 2010
Ph.D. Statistics Prof. R. W. Doerge, Dr. F. Jaﬀr´ ezic, Dr. J.-L. Foulley Reverse engineering gene networks using genomic time-course data 2010-2011 Post-doctoral fellowship Dr. G. Celeux, Dr. M.-L. Martin-Magniette Methods for RNA-seq co-expression analysis 2011-today Charg´ ee de Recherche Animal Genetics & Integrative Biology (GABI) unit, Jouy en Josas

Supervision and teaching Student (co-)supervision: 2 Ph.D. students and 1
3-month Erasmus+ Ph.D. mobility M´ elina Gallopin (defended December 2015), Gilles Monneret (expected December 2017), Manuel Revilla S´ anchez (defended June 2017) 7 Master’s students Rapha¨ elle Momal-Leisenring (2017), Fr´ ed´ eric Jehl (2017), Babacar Ciss (2016), Audrey Hulot (2015), Meriem Benabbas (2015), M´ elina Gallopin (2012), R´ emi Bancal (2012) Teaching: M1/M2 biostatistics courses at Universit´ e Paris-Sud (2011-2012), Ensai (2012-2017), UEVE (2016-2017) Researcher training schools: INRA-PEPI school (2012), RGB-Net COST action (2012), Agrocampus Ouest (2016-2017), Saclay Plant Sciences (2016), PiGutNet COST action (2017) Organization of BioBayes school on Bayesian statistics (2013) + co-author of subsequent book 3 / 35

Context: RNA-seq expression data 4 / 35

Contributions 5 / 35

Differential analysis of RNA-seq data Is there a statistically significant
difference in the expression of gene i between experimental conditions? Variable library sizes among experiments + technical biases ⇒ per-sample normalization factors (mj ) needed1 Discrete, non-negative, highly skewed data ⇒ transformation + precision weights or negative binomial models : yij ∼ NB (mj qij , φi ) log(qij ) = Xj βi Differential expression for gene i via hypothesis testing: H0i : cβi = 0 1Statomique consortium (Dillies et al., 2013) 7 / 35

Filtering in diﬀerential analyses Stringent multiple testing correction (e.g., FDR
control) due to large number of hypothesis tests Usually assume P-values are uniformly distributed under H0 p−values Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 1200 Identify and remove genes that generate an uninformative signal Only test hypotheses for genes passing ﬁlter ⇒ tempered correction for multiple testing Filtering thresholds tend to be fairly ad hoc in practice 8 / 35

A data-based filter for RNA-seq data2 ˜ yj = yj
/mj : normalized read counts in sample j from condition C(j) Idea: identify a filtering threshold s that maximizes the filtering similarity among replicates within the same condition using the Jaccard index Js(˜ yj , ˜ yj ) = a a + b + c Sample j Normalized counts > s Normalized counts ≤ s Sample j Normalized counts > s a b Normalized counts ≤ s c d 2Joint work with M. Gallopin, G. Celeux, and F. Jaffr´ ezic (Rau et al., 2013) 9 / 35

A data-based filter for RNA-seq data Multiple replicates/conditions typically available
⇒ global filtering similarity by averaging pairwise Jaccard indices within conditions: Js (˜ y) = mean Js(˜ yj , ˜ yj ) : j < j and C(j) = C(j ) Data-based filter threshold: s = arg max s Js (˜ y) ⇒ Filter genes with normalized read counts ≤ s in all samples 10 / 35

A data-based ﬁlter for RNA-seq data Increased detection power for
moderately/strongly expressed genes 11 / 35

HTSFilter: A data-based ﬁlter for RNA-seq data > library(HTSFilter) >
library(DESeq2) > > dds <- DESeq(ddsObject) > filter <- HTSFilter(dds)$filteredData > res <- results(filter, independentFiltering=FALSE) 12 / 35

Gene co-expression is... Groups of co-transcribed genes Similar expression patterns/trends
over a range of experiments (correlation, partial correlation, clustering) Related to shared regulatory inputs, functional pathways, and biological process(es) characterize orphan genes Image sources: Google search “co-expression analysis” 14 / 35

How should co-expression be deﬁned for RNA-seq? : Cluster measure
of relative expression independent of overall expression 0 500 1000 1500 2000 2500 0 5 10 15 Sample Normalized expression A 0 2 4 6 8 0 5 10 15 Sample Log(normalized expression + 1) B 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 Sample Normalized expression profiles Group 1 2 3 4 C 15 / 35

How should co-expression be deﬁned for RNA-seq? : Cluster measure
of relative expression independent of overall expression 0 500 1000 1500 2000 2500 0 5 10 15 Sample Normalized expression A 0 2 4 6 8 0 5 10 15 Sample Log(normalized expression + 1) B 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 Sample Normalized expression profiles Group 1 2 3 4 C Normalized proﬁle for gene i: pij = ˜ yij +1 ˜ yi +1 15 / 35

Some strategies for mixture models for RNA-seq data3 f (·|K,
ΨK ) = n i=1 K k=1 πkfk(·|θk) 1 Directly model read counts: HTSCluster yi |gene i ∈ cluster k ∼ J j=1 Poisson(yij |µijk) 2 Apply appropriately chosen transformation: coseq g(yi )|gene i ∈ cluster k ∼ N(µk, Σk) 3 Model the compositional nature of proﬁles: coseq g(pi )|gene i ∈ cluster k ∼ N(µk, Σk) 3 Joint work with C. Maugis-Rabusseau, M.-L. Martin-Magniette, G. Celeux, A. Godichon-Baggioni. 16 / 35

(Per-cluster) covariance structures in RNA-seq 17 / 35

Transformations for normalized proﬁles Normalized proﬁles pij are compositional data,
i.e. linear dependence of pi· = 1 ⇒ pi imposes constraints on covariance matrices Σk that are problematic for Gaussian mixture models 18 / 35

Transformations for normalized profiles Normalized profiles pij are compositional data,
i.e. linear dependence of pi· = 1 ⇒ pi imposes constraints on covariance matrices Σk that are problematic for Gaussian mixture models : Transform normalized profiles to break the linear sum constraint garcsin(pij ) = arcsin √ pij ∈ [0, π/2], glogit(pij ) = log pij 1 − pij ∈ (−∞, ∞) ... and fit a Gaussian mixture model to transformed normalized profiles (for simplicity we’ll only consider the [pkLkCk] model: variable proportions, volume, orientation, and shape) 18 / 35

Choosing a transformation with model selection criteria g(x) = arbitrary
monotonic transformation of a data x If g(x) ∼ f (.|ΨK ) (a GMM) ⇒ initial data x is an i.i.d. sample from fg (.|ΨK ) (obviously not necessarily a GMM) 19 / 35

Choosing a transformation with model selection criteria g(x) = arbitrary
monotonic transformation of a data x If g(x) ∼ f (.|ΨK ) (a GMM) ⇒ initial data x is an i.i.d. sample from fg (.|ΨK ) (obviously not necessarily a GMM) : select the pair (K, g) minimizing the corrected criteria: BIC∗(K, g) = − log f (y; K, ˆ Ψ(K,g) ) − νK 2 log(n) − log (det(Jg )) ICL∗(K, g) = BIC∗(K, g) + Entropy(K) where Jg = Jacobian of the transformation g and ˆ Ψ(K,g) is the MLE for the model with K clusters and transformation g Note:# of parameters νK does not depend on g 19 / 35

Choosing between arcsin and logit with ICL First calculate log-determinant
of each transformation: log (det(Jarcsin)) = −nq ln(2) − 1 2 n i=1 q j=1 log (pij (1 − pij )) , log (det(Jlogit)) = −nq ln (ln(2)) − n i=1 q j=1 log (pij (1 − pij )) Then choose the transformation corresponding to the minimum of ICL∗(K, arcsin) and ICL∗(K, logit) 20 / 35

coseq: Co-expression analysis of RNA-seq data > library(coseq) > coexp_arcsin
<- coseq(counts, K=2:10, model="Normal", > transformation="arcsin") > coexp_logit <- coseq(counts, K=2:10, model="Normal", > transformation="logit") Options: parallelization provided via BiocParallel, ﬁltering, library size normalization, transformation choice (arcsin, logit, CLR, logCLR), approach (Gaussian/Poisson mixture models, K-means) Seamless integration with edgeR/DESeq2 pipeline 21 / 35

coseq: Co-expression analysis of RNA-seq data > compareICL(list(coexp_arcsin, coexp_logit)) 22
/ 35

coseq: Co-expression analysis of RNA-seq data > plot(coexp_arcsin, ...) 5
9 1 3 12 4 6 2 7 8 10 11 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 CP SVZ VZ CP SVZ VZ CP SVZ VZ CP SVZ VZ Conditions Average expression profiles Conditions CP SVZ VZ 23 / 35

coseq: Co-expression analysis of RNA-seq data > plot(coexp_arcsin, ...) Genes
attributed to cluster with largest conditional probability of membership given the estimated parameters: τik(θ) = πkfk(yi |θk) K =1 π f (yi |θ ) 23 / 35

Functional annotation for co-expression analysis Odds Exp. K GO ID
p-value Ratio Count Count Size Term 1 GO:0006396 0.0010 4.47 2.10 8 360 RNA processing 6 GO:0016203 0.0006 20.67 0.17 3 30 muscle attachment 11 GO:0019058 0.0008 124.10 0.05 2 3 viral life cycle [...] Integrate functional annotations into the model selection step4 ⇒ Integrated Completed Annotated Likelihood (ICAL) criterion: ICAL(K) = ICL(K) − G g=1 K k=1 ng k log ng k ng ng = # genes annotated for function g, ng k = # genes assigned to cluster k annotated for g 4Joint work with M. Gallopin, G. Celeux, and F. Jaﬀr´ ezic. 24 / 35

Markov equivalence in Directed Acyclic Graphs Graphs often used to
represent gene networks Markov equivalence: two diﬀerent networks can yield same joint distribution and observational data alone generally cannot orient edges 26 / 35

Markov equivalence in Directed Acyclic Graphs Graphs often used to
represent gene networks Markov equivalence: two diﬀerent networks can yield same joint distribution and observational data alone generally cannot orient edges Interventions can resolve this ambiguity 26 / 35

Gene expression interventions Gene knock-outs (KO): ⇒ Gene is made
entirely inoperative Crossbreeding chimeric and wild-type individuals Naturally occuring point mutations causing functional KO CRISPR-Cas9 genome editing Gene knock-downs (KD): ⇒ Gene’s expression is reduced RNA interference (RNAi) Source: https://algae.ucsd.edu/research/cab-comm/genetic-tools.html 27 / 35

Gene expression interventions Mathematically, interventions are a do(Xi = xi
) operation Do-calculus (Pearl, 2000): E(Xj |do(Xi = xi )) = E(Xj ) if Xj ∈ pa(Xi ) E(Xj |xi , pa(Xi ))P(pa(Xi ))dpa(Xi ) if Xj / ∈ pa(Xi ) Total causal eﬀects: βij = ∂ ∂x E(Xj |do(Xi = xi )) Equal to 0 if Xi is not an ancestor of Xj 28 / 35

Joint inference of steady-state + intervention expression Xj = expression
of gene j Gaussian Bayesian network: Xj = mj + i∈pa(j) wij Xi + εj with εj ∼ N(0, σ2 j ) for j = 1, . . . , p wij = 0 if and only if i ∈ pa(j) Directed acyclic graph (DAG), and nodes have been ordered so that i ∈ pa(j) ⇒ i < j (i.e., W = (wij ) is upper triangular) Model parameters are θ = (W, m, σ) Total causal eﬀects are β = (I − W)−1 = I + W + . . . + W p−1 29 / 35

Joint inference of steady-state + intervention expression MCMC-Mallows GBN model5
Log-likelihood of model for arbitrary combination of knock-outs (single, partially complete, multiple, ...) Explore the posterior distribution of node order and estimated causal eﬀects via an empirical MCMC algorithm Node ordering proposal via Mallows model Estimation of model parameters for given ordering via likelihood calculations Some extensions: ridge penalty for high(er) dimensional networks, marginal causal estimation for single KO 5Joint work with F. Jaﬀr´ ezic, G. Nuel, G. Monneret 30 / 35

MCMC-Mallows GBN model Simulations for several KO settings suggest that
our approach makes better use of partially available information than competiting methods Multiple KOs contribute additional valuable information 31 / 35

Current/future work: integrative multi-omic analysis Multi-omics: Gene expression + methylation
+ copy number alterations + chromatin accessibility + chromatin conformation + ... Source: Rajasundaram and Selbig (2016) 32 / 35

Current/future work: integrative multi-omic analysis Exploring molecular drivers of gene
expression in cancer genomes6 Comprehensive/coordinated eﬀort to improve the molecular understanding of major cancers through high-throughput genomics What are the relative contributions of other molecular data (methylation, copy number alterations, somatic mutations, transcription factor expression, miRNA abundance, genotypes) to the variation of gene expression within a cancer? Are there shared patterns across cancers? How can interesting patterns of molecular drivers be easily explored? R/Shiny interactive web application 6Joint work with P. L. Auer. 33 / 35

Current/future work: integrative multi-omic analysis Joint modeling of chromatin accessibility
and gene expression8 Fr-AgENCODE: French pilot project part of FAANG consortium with aim to improve functional annotation for livestock7 4 species / 3 tissues (CD4, CD8, liver) Multiple high-throughput assays: RNA-seq, ATAC-seq, Hi-C Goal: study multi-tissue/multi-species associations between the transcriptional and chromatin landscapes 7Analyses coordinated by E. Giuﬀra, S. Foissac, S. Djebali-Quelen, S. Lagarrigue 8Joint work with R. Momal-Leisenring. 34 / 35

Future work: integrative multi-omic analysis AgreenSkills+ mobility project Integrative analysis
of multi-omics data for improved detection power of functional genetic variants (October 2017-May 2019) Comprehensive data from the NHLBI Trans-Omic Precision Medicine (TOPmed) program ⇒ identify rare coding and functional noncoding variants underlying complex multifactorial diseases 60k individuals (26 studies) proﬁled with WGS, other ’omics to follow Goal: integrate public transcriptomic/epigenomic data (GTEx, ENCODE, Roadmap) to better target functionally relevant variants 35 / 35

Acknowledgements Ph.D. students: M´ elina Gallopin Manuel Revilla Sanchez Gilles
Monneret Master’s students: Rapha¨ elle Momal-Leisenring Fr´ ed´ eric Jehl Babacar Ciss Audrey Hulot Meriem Benabbas R´ emi Bancal Statomique consortium Funding: MixStatSeq ANR-JCJC grant INRA/GA departmental grants (Causality, COSI-net) HDR evaluation committee: David Causeur Anne-Laure Boulesteix Nathalie Villa-Vialaneix St´ ephane Robin Franck Picard Christophe Ambroise Collaborators: Gilles Celeux Marie-Laure Martin-Magniette Cathy Maugis-Rabusseau Antoine Godichon-Baggioni Gr´ egory Nuel Guillemette Marot Christopher Sauvage Sandrine Lagarrigue Paul Livermore Auer

Acknowledgements GABI research unit, and our director Claire Rogel-Gaillard Special
thanks to the PSGen team, especially Florence Jaﬀr´ ezic, Denis Lalo¨ e, Tatiana Zerjal, and the Bioinfo-Biostats pole

Thank you!

HDR defense presentation

HDR defense presentation

More Decks by Andrea Rau

Other Decks in Science

Featured

Transcript