Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HDR defense presentation

Andrea Rau
September 26, 2017

HDR defense presentation

In recent years, high-throughput sequencing (HTS) has become an essential tool for genomic and transcriptomic studies. In particular, the use of HTS to directly sequence reverse-transcribed RNA molecules, known as RNA sequencing (RNA-seq), has revolutionized the study of gene expression. In turn, a great deal of methodological research has focused on developing analysis pipelines that are well suited to the characteristics of RNA-seq data. In this work, I focus on methodological contributions to three analytical goals: (1) the powerful detection of differentially expressed genes from RNA-seq data, in particular through a data-based filter for weakly expressed genes and a P-value combination approach for data arising from multiple related studies; (2) the identification of clusters of co-expressed genes from RNA-seq data using finite mixture models, appropriately chosen transformations, and adapted model selection criteria; and (3) the inference of gene regulatory networks from observational RNA-seq data or arbitrarily complex gene knockout expression data. In addition, I will present some of the open-source software packages I have developed and continue to maintain for the implementation of our proposed statistical methods. Finally, I will discuss some related research perspectives regarding methodological developments for multi-omics data integration.

Andrea Rau

September 26, 2017
Tweet

More Decks by Andrea Rau

Other Decks in Science

Transcript

  1. Statistical methods and software for the
    analysis of transcriptomic data
    Andrea Rau
    HDR defense
    September 26, 2017
    Universit´
    e d’´
    Evry-Val-d’Essonne

    View Slide

  2. 2005 B.A. Mathematics & French
    2007 M.S. Applied Statistics
    2010 Ph.D. Statistics
    Prof. R. W. Doerge, Dr. F. Jaffr´
    ezic, Dr. J.-L. Foulley
    Reverse engineering gene networks using genomic
    time-course data
    2010-2011 Post-doctoral fellowship
    Dr. G. Celeux, Dr. M.-L. Martin-Magniette
    Methods for RNA-seq co-expression analysis
    2011-today Charg´
    ee de Recherche
    Animal Genetics & Integrative Biology (GABI) unit,
    Jouy en Josas

    View Slide

  3. Supervision and teaching
    Student (co-)supervision:
    2 Ph.D. students and 1 3-month Erasmus+ Ph.D. mobility

    elina Gallopin (defended December 2015), Gilles Monneret (expected
    December 2017), Manuel Revilla S´
    anchez (defended June 2017)
    7 Master’s students
    Rapha¨
    elle Momal-Leisenring (2017), Fr´
    ed´
    eric Jehl (2017), Babacar Ciss
    (2016), Audrey Hulot (2015), Meriem Benabbas (2015), M´
    elina
    Gallopin (2012), R´
    emi Bancal (2012)
    Teaching:
    M1/M2 biostatistics courses at Universit´
    e Paris-Sud (2011-2012),
    Ensai (2012-2017), UEVE (2016-2017)
    Researcher training schools: INRA-PEPI school (2012), RGB-Net
    COST action (2012), Agrocampus Ouest (2016-2017), Saclay Plant
    Sciences (2016), PiGutNet COST action (2017)
    Organization of BioBayes school on Bayesian statistics (2013) +
    co-author of subsequent book
    3 / 35

    View Slide

  4. Context: RNA-seq expression data
    4 / 35

    View Slide

  5. Context: RNA-seq expression data
    4 / 35

    View Slide

  6. Context: RNA-seq expression data
    4 / 35

    View Slide

  7. Context: RNA-seq expression data
    4 / 35

    View Slide

  8. Context: RNA-seq expression data
    4 / 35

    View Slide

  9. Context: RNA-seq expression data
    4 / 35

    View Slide

  10. Contributions
    5 / 35

    View Slide

  11. Contributions
    6 / 35

    View Slide

  12. Differential analysis of RNA-seq data
    Is there a statistically significant difference in the expression
    of gene i between experimental conditions?
    Variable library sizes among experiments + technical biases
    ⇒ per-sample normalization factors (mj ) needed1
    Discrete, non-negative, highly skewed data
    ⇒ transformation + precision weights or negative binomial models :
    yij ∼ NB (mj qij , φi )
    log(qij ) = Xj βi
    Differential expression for gene i via hypothesis testing:
    H0i : cβi
    = 0
    1Statomique consortium (Dillies et al., 2013)
    7 / 35

    View Slide

  13. Filtering in differential analyses
    Stringent multiple testing
    correction (e.g., FDR control)
    due to large number of
    hypothesis tests
    Usually assume P-values are
    uniformly distributed under H0
    p−values
    Frequency
    0.0 0.2 0.4 0.6 0.8 1.0
    0 200 400 600 800 1000 1200
    Identify and remove genes that generate an uninformative signal
    Only test hypotheses for genes passing filter ⇒ tempered correction
    for multiple testing
    Filtering thresholds tend to be fairly ad hoc in practice
    8 / 35

    View Slide

  14. A data-based filter for RNA-seq data2
    ˜
    yj = yj /mj : normalized read counts in sample j from condition C(j)
    Idea: identify a filtering threshold s that maximizes the filtering
    similarity among replicates within the same condition using the
    Jaccard index
    Js(˜
    yj , ˜
    yj ) =
    a
    a + b + c
    Sample j
    Normalized
    counts > s
    Normalized
    counts ≤ s
    Sample j
    Normalized
    counts > s
    a b
    Normalized
    counts ≤ s
    c d
    2Joint work with M. Gallopin, G. Celeux, and F. Jaffr´
    ezic (Rau et al., 2013)
    9 / 35

    View Slide

  15. A data-based filter for RNA-seq data
    Multiple replicates/conditions typically available ⇒ global filtering
    similarity by averaging pairwise Jaccard indices within conditions:
    Js

    y) = mean Js(˜
    yj , ˜
    yj ) : j < j and C(j) = C(j )
    Data-based filter threshold:
    s = arg max
    s
    Js

    y)
    ⇒ Filter genes with normalized
    read counts ≤ s in all samples
    10 / 35

    View Slide

  16. A data-based filter for RNA-seq data
    Increased detection power for moderately/strongly expressed genes
    11 / 35

    View Slide

  17. HTSFilter: A data-based filter for RNA-seq data
    > library(HTSFilter)
    > library(DESeq2)
    >
    > dds <- DESeq(ddsObject)
    > filter <- HTSFilter(dds)$filteredData
    > res <- results(filter, independentFiltering=FALSE)
    12 / 35

    View Slide

  18. Contributions
    13 / 35

    View Slide

  19. Contributions
    13 / 35

    View Slide

  20. Gene co-expression is...
    Groups of co-transcribed genes
    Similar expression patterns/trends over a range of experiments
    (correlation, partial correlation, clustering)
    Related to shared regulatory inputs, functional pathways, and
    biological process(es) characterize orphan genes
    Image sources: Google search “co-expression analysis”
    14 / 35

    View Slide

  21. How should co-expression be defined for RNA-seq?
    : Cluster measure of relative expression independent of overall expression
    0
    500
    1000
    1500
    2000
    2500
    0 5 10 15
    Sample
    Normalized expression
    A
    0
    2
    4
    6
    8
    0 5 10 15
    Sample
    Log(normalized expression + 1)
    B
    0.00
    0.05
    0.10
    0.15
    0.20
    0.25
    0 5 10 15
    Sample
    Normalized expression profiles
    Group
    1
    2
    3
    4
    C
    15 / 35

    View Slide

  22. How should co-expression be defined for RNA-seq?
    : Cluster measure of relative expression independent of overall expression
    0
    500
    1000
    1500
    2000
    2500
    0 5 10 15
    Sample
    Normalized expression
    A
    0
    2
    4
    6
    8
    0 5 10 15
    Sample
    Log(normalized expression + 1)
    B
    0.00
    0.05
    0.10
    0.15
    0.20
    0.25
    0 5 10 15
    Sample
    Normalized expression profiles
    Group
    1
    2
    3
    4
    C
    Normalized profile for gene i: pij = ˜
    yij +1
    ˜
    yi +1
    15 / 35

    View Slide

  23. Some strategies for mixture models for RNA-seq data3
    f (·|K, ΨK ) =
    n
    i=1
    K
    k=1
    πkfk(·|θk)
    1 Directly model read counts: HTSCluster
    yi |gene i ∈ cluster k ∼
    J
    j=1
    Poisson(yij |µijk)
    2 Apply appropriately chosen transformation: coseq
    g(yi )|gene i ∈ cluster k ∼ N(µk, Σk)
    3 Model the compositional nature of profiles: coseq
    g(pi )|gene i ∈ cluster k ∼ N(µk, Σk)
    3
    Joint work with C. Maugis-Rabusseau, M.-L. Martin-Magniette, G. Celeux, A. Godichon-Baggioni.
    16 / 35

    View Slide

  24. (Per-cluster) covariance structures in RNA-seq
    17 / 35

    View Slide

  25. Transformations for normalized profiles
    Normalized profiles pij are compositional data, i.e. linear dependence
    of pi· = 1 ⇒ pi imposes constraints on covariance matrices Σk that
    are problematic for Gaussian mixture models
    18 / 35

    View Slide

  26. Transformations for normalized profiles
    Normalized profiles pij are compositional data, i.e. linear dependence
    of pi· = 1 ⇒ pi imposes constraints on covariance matrices Σk that
    are problematic for Gaussian mixture models
    : Transform normalized profiles to break the linear sum constraint
    garcsin(pij ) = arcsin

    pij ∈ [0, π/2],
    glogit(pij ) = log
    pij
    1 − pij
    ∈ (−∞, ∞)
    ... and fit a Gaussian mixture model to transformed normalized
    profiles (for simplicity we’ll only consider the [pkLkCk] model:
    variable proportions, volume, orientation, and shape)
    18 / 35

    View Slide

  27. Choosing a transformation with model selection criteria
    g(x) = arbitrary monotonic transformation of a data x
    If g(x) ∼ f (.|ΨK ) (a GMM) ⇒ initial data x is an i.i.d. sample from
    fg (.|ΨK ) (obviously not necessarily a GMM)
    19 / 35

    View Slide

  28. Choosing a transformation with model selection criteria
    g(x) = arbitrary monotonic transformation of a data x
    If g(x) ∼ f (.|ΨK ) (a GMM) ⇒ initial data x is an i.i.d. sample from
    fg (.|ΨK ) (obviously not necessarily a GMM)
    : select the pair (K, g) minimizing the corrected criteria:
    BIC∗(K, g) = − log f (y; K, ˆ
    Ψ(K,g)
    ) −
    νK
    2
    log(n) − log (det(Jg ))
    ICL∗(K, g) = BIC∗(K, g) + Entropy(K)
    where Jg = Jacobian of the transformation g and ˆ
    Ψ(K,g)
    is the MLE
    for the model with K clusters and transformation g
    Note:# of parameters νK does not depend on g
    19 / 35

    View Slide

  29. Choosing between arcsin and logit with ICL
    First calculate log-determinant of each transformation:
    log (det(Jarcsin)) = −nq ln(2) −
    1
    2
    n
    i=1
    q
    j=1
    log (pij (1 − pij )) ,
    log (det(Jlogit)) = −nq ln (ln(2)) −
    n
    i=1
    q
    j=1
    log (pij (1 − pij ))
    Then choose the transformation corresponding to the minimum of
    ICL∗(K, arcsin) and ICL∗(K, logit)
    20 / 35

    View Slide

  30. coseq: Co-expression analysis of RNA-seq data
    > library(coseq)
    > coexp_arcsin <- coseq(counts, K=2:10, model="Normal",
    > transformation="arcsin")
    > coexp_logit <- coseq(counts, K=2:10, model="Normal",
    > transformation="logit")
    Options: parallelization provided via BiocParallel, filtering, library
    size normalization, transformation choice (arcsin, logit, CLR,
    logCLR), approach (Gaussian/Poisson mixture models, K-means)
    Seamless integration with edgeR/DESeq2 pipeline
    21 / 35

    View Slide

  31. coseq: Co-expression analysis of RNA-seq data
    > compareICL(list(coexp_arcsin, coexp_logit))
    22 / 35

    View Slide

  32. coseq: Co-expression analysis of RNA-seq data
    > plot(coexp_arcsin, ...)
    5 9 1 3
    12 4 6 2
    7 8 10 11
    0.00
    0.25
    0.50
    0.75
    1.00
    0.00
    0.25
    0.50
    0.75
    1.00
    0.00
    0.25
    0.50
    0.75
    1.00
    CP SVZ VZ CP SVZ VZ CP SVZ VZ CP SVZ VZ
    Conditions
    Average expression profiles
    Conditions
    CP
    SVZ
    VZ
    23 / 35

    View Slide

  33. coseq: Co-expression analysis of RNA-seq data
    > plot(coexp_arcsin, ...)
    Genes attributed to cluster with largest conditional probability of
    membership given the estimated parameters:
    τik(θ) =
    πkfk(yi |θk)
    K
    =1
    π f (yi |θ )
    23 / 35

    View Slide

  34. Functional annotation for co-expression analysis
    Odds Exp.
    K GO ID p-value Ratio Count Count Size Term
    1 GO:0006396 0.0010 4.47 2.10 8 360 RNA processing
    6 GO:0016203 0.0006 20.67 0.17 3 30 muscle attachment
    11 GO:0019058 0.0008 124.10 0.05 2 3 viral life cycle
    [...]
    Integrate functional annotations into the model selection step4 ⇒
    Integrated Completed Annotated Likelihood (ICAL) criterion:
    ICAL(K) = ICL(K) −
    G
    g=1
    K
    k=1
    ng
    k
    log
    ng
    k
    ng
    ng = # genes annotated for function g,
    ng
    k
    = # genes assigned to cluster k annotated for g
    4Joint work with M. Gallopin, G. Celeux, and F. Jaffr´
    ezic.
    24 / 35

    View Slide

  35. Contributions
    25 / 35

    View Slide

  36. Contributions
    25 / 35

    View Slide

  37. Markov equivalence in Directed Acyclic Graphs
    Graphs often used to represent gene networks
    Markov equivalence: two different networks can yield same joint
    distribution and observational data alone generally cannot orient edges
    26 / 35

    View Slide

  38. Markov equivalence in Directed Acyclic Graphs
    Graphs often used to represent gene networks
    Markov equivalence: two different networks can yield same joint
    distribution and observational data alone generally cannot orient edges
    Interventions can resolve this ambiguity
    26 / 35

    View Slide

  39. Gene expression interventions
    Gene knock-outs (KO):
    ⇒ Gene is made entirely inoperative
    Crossbreeding chimeric and wild-type individuals
    Naturally occuring point mutations causing functional KO
    CRISPR-Cas9 genome editing
    Gene knock-downs (KD):
    ⇒ Gene’s expression is reduced
    RNA interference (RNAi)
    Source: https://algae.ucsd.edu/research/cab-comm/genetic-tools.html
    27 / 35

    View Slide

  40. Gene expression interventions
    Mathematically, interventions are a do(Xi = xi ) operation
    Do-calculus (Pearl, 2000):
    E(Xj |do(Xi = xi )) = E(Xj ) if Xj ∈ pa(Xi )
    E(Xj |xi , pa(Xi ))P(pa(Xi ))dpa(Xi ) if Xj /
    ∈ pa(Xi )
    Total causal effects:
    βij =

    ∂x E(Xj |do(Xi = xi ))
    Equal to 0 if Xi is not an ancestor of Xj
    28 / 35

    View Slide

  41. Joint inference of steady-state + intervention expression
    Xj = expression of gene j
    Gaussian Bayesian network:
    Xj = mj +
    i∈pa(j)
    wij Xi + εj with εj ∼ N(0, σ2
    j
    )
    for j = 1, . . . , p
    wij = 0 if and only if i ∈ pa(j)
    Directed acyclic graph (DAG), and nodes have been ordered so
    that i ∈ pa(j) ⇒ i < j (i.e., W = (wij ) is upper triangular)
    Model parameters are θ = (W, m, σ)
    Total causal effects are β = (I − W)−1 = I + W + . . . + W p−1
    29 / 35

    View Slide

  42. Joint inference of steady-state + intervention expression
    MCMC-Mallows GBN model5
    Log-likelihood of model for arbitrary combination of knock-outs
    (single, partially complete, multiple, ...)
    Explore the posterior distribution of node order and estimated causal
    effects via an empirical MCMC algorithm
    Node ordering proposal via Mallows model
    Estimation of model parameters for given ordering via likelihood
    calculations
    Some extensions: ridge penalty for high(er) dimensional networks,
    marginal causal estimation for single KO
    5Joint work with F. Jaffr´
    ezic, G. Nuel, G. Monneret
    30 / 35

    View Slide

  43. MCMC-Mallows GBN model
    Simulations for several KO settings suggest that our approach makes
    better use of partially available information than competiting methods
    Multiple KOs contribute additional valuable information
    31 / 35

    View Slide

  44. Current/future work: integrative multi-omic analysis
    Multi-omics: Gene expression + methylation + copy number alterations +
    chromatin accessibility + chromatin conformation + ...
    Source: Rajasundaram and Selbig (2016)
    32 / 35

    View Slide

  45. Current/future work: integrative multi-omic analysis
    Exploring molecular drivers of gene expression in cancer genomes6
    Comprehensive/coordinated effort to improve the
    molecular understanding of major cancers through
    high-throughput genomics
    What are the relative contributions of other molecular data
    (methylation, copy number alterations, somatic mutations,
    transcription factor expression, miRNA abundance, genotypes) to the
    variation of gene expression within a cancer?
    Are there shared patterns across cancers?
    How can interesting patterns of molecular drivers be easily explored?
    R/Shiny interactive web application
    6Joint work with P. L. Auer.
    33 / 35

    View Slide

  46. Current/future work: integrative multi-omic analysis
    Joint modeling of chromatin accessibility and gene expression8
    Fr-AgENCODE: French pilot project part of FAANG consortium with
    aim to improve functional annotation for livestock7
    4 species / 3 tissues (CD4, CD8, liver)
    Multiple high-throughput assays: RNA-seq, ATAC-seq, Hi-C
    Goal: study multi-tissue/multi-species associations between the
    transcriptional and chromatin landscapes
    7Analyses coordinated by E. Giuffra, S. Foissac, S. Djebali-Quelen, S. Lagarrigue
    8Joint work with R. Momal-Leisenring.
    34 / 35

    View Slide

  47. Future work: integrative multi-omic analysis
    AgreenSkills+ mobility project
    Integrative analysis of multi-omics data for improved detection power of
    functional genetic variants (October 2017-May 2019)
    Comprehensive data from the NHLBI Trans-Omic Precision Medicine
    (TOPmed) program ⇒ identify rare coding and functional noncoding
    variants underlying complex multifactorial diseases
    60k individuals (26 studies) profiled with WGS, other ’omics to follow
    Goal: integrate public transcriptomic/epigenomic data (GTEx,
    ENCODE, Roadmap) to better target functionally relevant variants
    35 / 35

    View Slide

  48. Acknowledgements
    Ph.D. students:

    elina Gallopin
    Manuel Revilla Sanchez
    Gilles Monneret
    Master’s students:
    Rapha¨
    elle Momal-Leisenring
    Fr´
    ed´
    eric Jehl
    Babacar Ciss
    Audrey Hulot
    Meriem Benabbas

    emi Bancal
    Statomique consortium
    Funding:
    MixStatSeq ANR-JCJC grant
    INRA/GA departmental grants
    (Causality, COSI-net)
    HDR evaluation committee:
    David Causeur
    Anne-Laure Boulesteix
    Nathalie Villa-Vialaneix
    St´
    ephane Robin
    Franck Picard
    Christophe Ambroise
    Collaborators:
    Gilles Celeux
    Marie-Laure Martin-Magniette
    Cathy Maugis-Rabusseau
    Antoine Godichon-Baggioni
    Gr´
    egory Nuel
    Guillemette Marot
    Christopher Sauvage
    Sandrine Lagarrigue
    Paul Livermore Auer

    View Slide

  49. Acknowledgements
    GABI research unit, and our director Claire Rogel-Gaillard
    Special thanks to the PSGen team, especially Florence Jaffr´
    ezic,
    Denis Lalo¨
    e, Tatiana Zerjal, and the Bioinfo-Biostats pole

    View Slide

  50. Thank you!

    View Slide