$30 off During Our Annual Pro Sale. View Details »

Differential expression analysis tools

Alyssa Frazee
September 07, 2014

Differential expression analysis tools

Talk given at ECCB 2014 workshop: http://www.eccb14.org/program/workshops/rna-seq, and at University of Zurich, on tools we've built to do differential expression analysis without relying on existing gene/exon/isoform annotation.

Alyssa Frazee

September 07, 2014
Tweet

More Decks by Alyssa Frazee

Other Decks in Science

Transcript

  1. Engineering annotation-
    agnostic tools for differential
    expression analysis
    Alyssa Frazee
    Johns Hopkins University
    @acfrazee

    View Slide

  2. Why annotation-agnostic?

    View Slide

  3. Why annotation-agnostic?
    (1) isoform-level analysis

    View Slide

  4. scientifically important  
    cell differentiation
    (Trapnell 2010)
    organism development
    (Graveley 2010)
    cancer
    (Govindan 2012)

    View Slide

  5. 24376000 24378000 24380000 24382000 24384000
    genomic position
    332
    333
    334
    335
    336
    337
    338
    but read-counting is challenging  

    View Slide

  6. Why annotation-agnostic?
    (2) allows for new discoveries

    View Slide

  7. View Slide

  8. One solution: assembly
    genome
    some
    possible
    assemblies

    View Slide

  9. Approach 1: avoid assembly
    altogether

    View Slide

  10. DER Finder
    PMID 24398039
    idea: scan genome base-by-
    base, highlight segments
    showing differential
    expression signal

    View Slide

  11. DER Finder
    0 2 4 6 8
    log2(count+1)
    chr22: 17684448−17684670
    normal
    tumor
    −1 0 1 2 3 4 5
    t statistic
    states
    genomic position
    DE signal
    read
    coverage

    View Slide

  12. DER Finder
    0 2 4 6 8
    log2(count+1)
    chr22: 17684448−17684670
    normal
    tumor
    −1 0 1 2 3 4 5
    t statistic
    states
    genomic position
    DE signal
    read
    coverage

    View Slide

  13. find signal at each nucleotide
    samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest

    View Slide

  14. samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest
     v  
    find signal at each nucleotide

    View Slide

  15. DE DE not
    DE
    segment genome into groups of
    nucleotides with similar signal
    t
    1
    t
    2
    t
    3
    t
    4
    t
    5

    DE not
    DE

    View Slide

  16. DE DE not
    DE
    segment genome into groups of
    nucleotides with similar signal
    t
    1
    t
    2
    t
    3
    t
    4
    t
    5

    DE not
    DE
    Hidden Markov Model

    View Slide

  17. 0 2 4 6 8
    log2(count+1)
    chr22: 17684448−17684670
    normal
    tumor
    −1 0 1 2 3 4 5
    t statistic
    xaxinds
    exons states
    17684041 17684451 17684551 17684651 17684754
    genomic position
    linear
    models
    HMM
    (candidate DERs)

    View Slide

  18. 0 2 4 6 8
    log2(count+1)
    chr22: 17684448−17684670
    normal
    tumor
    −1 0 1 2 3 4 5
    t statistic
    xaxinds
    exons states
    17684041 17684451 17684551 17684651 17684754
    genomic position
    linear
    models
    HMM
    permutation tests
    for statistical
    significance

    View Slide

  19. 0 2 4 6 8
    log2(count+1)
    chr22: 17684448−17684670
    normal
    tumor
    −1 0 1 2 3 4 5
    t statistic
    xaxinds
    exons states
    17684041 17684451 17684551 17684651 17684754
    genomic position
    match to
    annotation if
    desired:
    CECR1, “may
    play a role in
    regulating cell
    proliferation”

    View Slide

  20. engineering challenges
    •  creating and handling nucleotide-by-
    sample matrix
    •  efficient linear model fitting
    (solution: lmFit)
    •  efficient segmentation with HMM
    •  efficient p-value calculations
    Initial solution: https://github.com/alyssafrazee/derfinder

    View Slide

  21. Approach 2: improve the
    current software
    infrastructure for analysis of
    transcriptome assemblies

    View Slide

  22. Ballgown
    biorXiv preprint: http://biorxiv.org/content/early/2014/03/30/003665,
    Bioconductor package “ballgown”
    Align Reads!
    (e.g. TopHat)!
    Assemble Transcripts!
    (e.g. Cufflinks)!
    Estimate Expression!
    (Cufflinks via Tablemaker, RSEM)!
    Differential Expression Tests!
    (default Ballgown models, limma, EdgeR, DESeq,…)!
    paired-end
    RNA-seq
    reads!
    Transcriptome
    Assembly!
    Pipelines!
    R/Bioconductor
    Pipelines!
    Ballgown as connecting
    framework!
    transcriptome
    assembly
    pipelines
    R/Bioconductor
    DE analysis

    View Slide

  23. S4 class for transcript assemblies
    ballgown object
    data
    structure
    indexes
    exon
    intron
    transcript exon
    intron
    transcript
    e2t i2t t2g
    pData
    bamfiles
    expr
    matrices
    GRanges
    data frames

    View Slide

  24. easy exploration and DE analysis
    24376000 24378000 24380000 24382000 24384000
    genomic position
    332
    333
    334
    335
    336
    337
    338
    plotting
    functions

    View Slide

  25. stat_results = stattest(my_assembly, feature='transcript',
    meas='FPKM', covariate='group’)
    head(stat_results)
    ## feature id pval qval
    ## transcript 10 0.01381576 0.105212332
    ## transcript 25 0.26773622 0.791149753
    ## transcript 35 0.01085070 0.089518254
    ## transcript 41 0.47108019 0.902537475
    ## transcript 45 0.08402948 0.489348136
    ## transcript 67 0.27317385 0.79114975
    easy exploration and DE analysis
    statistical tests
    (drop-in replacement for Cuffdiff)

    View Slide

  26. ballgown object
    data
    structure
    indexes
    exon
    intron
    transcript exon
    intron
    transcript
    e2t i2t t2g
    pData
    bamfiles
    expr
    easy exploration and DE analysis
    statistical tests
    easily connects to
    existing DE packages

    View Slide

  27. easy exploration and DE analysis
    annotation
    functions
    24615000 24620000 24625000 24630000 24635000 24640000
    genomic position
    Assembled and Annotated Transcripts
    annotated assembled
    get corresponding gene
    names, match assembled
    and annotated transcripts,
    plot assembly alongside
    annotation

    View Slide

  28. highly flexible

    View Slide

  29. freely available!
    Bioconductor (devel):
    source(“http://bioconductor.org/biocLite.R”)
    biocLite(“ballgown”)
    GitHub:  
    https://github.com/alyssafrazee/ballgown
    Cufflinks users will also need Tablemaker:
    https://github.com/alyssafrazee/tablemaker

    View Slide

  30. Thanks
    Collaborators: Jeff Leek (advisor), Steven Salzberg,
    Ben Langmead, Andrew Jaffe, Rafa Irizarry, Sarven
    Sabunciyan, Kasper Hansen, Geo Pertea, Leonardo
    Collado Torres
    Contact: alyssafrazee.com, [email protected],
    @acfrazee (Twitter)

    View Slide

  31. differential expression model
    for each transcript, compare the fits of the following
    models using an F-test. Null hypothesis is that the fits
    of model (a) and model (b) are equally good; alternative
    is that (a) fits better.
    (a)
    (b)
    BRIEF ARTICLE
    THE AUTHOR
    expressioni =

    + 0groupi +
    P
    X
    p=1
    pconfounderip + noiseip
    expressioni =
    ↵⇤
    +
    P
    X
    p=1

    pconfounderip + noise

    ip
    expressioni =

    +
    K
    X
    k=1
    ksplinek(timei) +
    P
    X
    p=1
    pconfounderip + noiseip
    P

    View Slide