Differential expression analysis tools

567d15666cd2891a4e6c49e007f30a08?s=47 Alyssa Frazee
September 07, 2014

Differential expression analysis tools

Talk given at ECCB 2014 workshop: http://www.eccb14.org/program/workshops/rna-seq, and at University of Zurich, on tools we've built to do differential expression analysis without relying on existing gene/exon/isoform annotation.

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

September 07, 2014
Tweet

Transcript

  1. Engineering annotation- agnostic tools for differential expression analysis Alyssa Frazee

    Johns Hopkins University @acfrazee
  2. Why annotation-agnostic?

  3. Why annotation-agnostic? (1) isoform-level analysis

  4. scientifically important   cell differentiation (Trapnell 2010) organism development (Graveley

    2010) cancer (Govindan 2012)
  5. 24376000 24378000 24380000 24382000 24384000 genomic position 332 333 334

    335 336 337 338 but read-counting is challenging  
  6. Why annotation-agnostic? (2) allows for new discoveries

  7. None
  8. One solution: assembly genome some possible assemblies

  9. Approach 1: avoid assembly altogether

  10. DER Finder PMID 24398039 idea: scan genome base-by- base, highlight

    segments showing differential expression signal
  11. DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670

    normal tumor −1 0 1 2 3 4 5 t statistic states genomic position DE signal read coverage
  12. DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670

    normal tumor −1 0 1 2 3 4 5 t statistic states genomic position DE signal read coverage
  13. find signal at each nucleotide samples indexed by i locations

    indexed by l j confounders indexed by k expression confounders covariate of interest
  14. samples indexed by i locations indexed by l j confounders

    indexed by k expression confounders covariate of interest  v   find signal at each nucleotide
  15. DE DE not DE segment genome into groups of nucleotides

    with similar signal t 1 t 2 t 3 t 4 t 5 DE not DE
  16. DE DE not DE segment genome into groups of nucleotides

    with similar signal t 1 t 2 t 3 t 4 t 5 DE not DE Hidden Markov Model
  17. 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor

    −1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position linear models HMM (candidate DERs)
  18. 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor

    −1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position linear models HMM permutation tests for statistical significance
  19. 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor

    −1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position match to annotation if desired: CECR1, “may play a role in regulating cell proliferation”
  20. engineering challenges •  creating and handling nucleotide-by- sample matrix • 

    efficient linear model fitting (solution: lmFit) •  efficient segmentation with HMM •  efficient p-value calculations Initial solution: https://github.com/alyssafrazee/derfinder
  21. Approach 2: improve the current software infrastructure for analysis of

    transcriptome assemblies
  22. Ballgown biorXiv preprint: http://biorxiv.org/content/early/2014/03/30/003665, Bioconductor package “ballgown” Align Reads! (e.g.

    TopHat)! Assemble Transcripts! (e.g. Cufflinks)! Estimate Expression! (Cufflinks via Tablemaker, RSEM)! Differential Expression Tests! (default Ballgown models, limma, EdgeR, DESeq,…)! paired-end RNA-seq reads! Transcriptome Assembly! Pipelines! R/Bioconductor Pipelines! Ballgown as connecting framework! transcriptome assembly pipelines R/Bioconductor DE analysis
  23. S4 class for transcript assemblies ballgown object data structure indexes

    exon intron transcript exon intron transcript e2t i2t t2g pData bamfiles expr matrices GRanges data frames
  24. easy exploration and DE analysis 24376000 24378000 24380000 24382000 24384000

    genomic position 332 333 334 335 336 337 338 plotting functions
  25. stat_results = stattest(my_assembly, feature='transcript', meas='FPKM', covariate='group’) head(stat_results) ## feature id

    pval qval ## transcript 10 0.01381576 0.105212332 ## transcript 25 0.26773622 0.791149753 ## transcript 35 0.01085070 0.089518254 ## transcript 41 0.47108019 0.902537475 ## transcript 45 0.08402948 0.489348136 ## transcript 67 0.27317385 0.79114975 easy exploration and DE analysis statistical tests (drop-in replacement for Cuffdiff)
  26. ballgown object data structure indexes exon intron transcript exon intron

    transcript e2t i2t t2g pData bamfiles expr easy exploration and DE analysis statistical tests easily connects to existing DE packages
  27. easy exploration and DE analysis annotation functions 24615000 24620000 24625000

    24630000 24635000 24640000 genomic position Assembled and Annotated Transcripts annotated assembled get corresponding gene names, match assembled and annotated transcripts, plot assembly alongside annotation
  28. highly flexible

  29. freely available! Bioconductor (devel): source(“http://bioconductor.org/biocLite.R”) biocLite(“ballgown”) GitHub:   https://github.com/alyssafrazee/ballgown Cufflinks

    users will also need Tablemaker: https://github.com/alyssafrazee/tablemaker
  30. Thanks Collaborators: Jeff Leek (advisor), Steven Salzberg, Ben Langmead, Andrew

    Jaffe, Rafa Irizarry, Sarven Sabunciyan, Kasper Hansen, Geo Pertea, Leonardo Collado Torres Contact: alyssafrazee.com, alyssa.frazee@jhu.edu, @acfrazee (Twitter)
  31. differential expression model for each transcript, compare the fits of

    the following models using an F-test. Null hypothesis is that the fits of model (a) and model (b) are equally good; alternative is that (a) fits better. (a) (b) BRIEF ARTICLE THE AUTHOR expressioni = ↵ + 0groupi + P X p=1 pconfounderip + noiseip expressioni = ↵⇤ + P X p=1 ⇤ pconfounderip + noise ⇤ ip expressioni = ↵ + K X k=1 ksplinek(timei) + P X p=1 pconfounderip + noiseip P