Differential expression analysis tools

Engineering annotation- agnostic tools for differential expression analysis Alyssa Frazee
Johns Hopkins University @acfrazee

Why annotation-agnostic?

Why annotation-agnostic? (1) isoform-level analysis

scientifically important cell differentiation (Trapnell 2010) organism development (Graveley
2010) cancer (Govindan 2012)

24376000 24378000 24380000 24382000 24384000 genomic position 332 333 334
335 336 337 338 but read-counting is challenging

Why annotation-agnostic? (2) allows for new discoveries

One solution: assembly genome some possible assemblies

Approach 1: avoid assembly altogether

DER Finder PMID 24398039 idea: scan genome base-by- base, highlight
segments showing differential expression signal

DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670
normal tumor −1 0 1 2 3 4 5 t statistic states genomic position DE signal read coverage

find signal at each nucleotide samples indexed by i locations
indexed by l j confounders indexed by k expression confounders covariate of interest

samples indexed by i locations indexed by l j confounders
indexed by k expression confounders covariate of interest v find signal at each nucleotide

DE DE not DE segment genome into groups of nucleotides
with similar signal t 1 t 2 t 3 t 4 t 5 DE not DE

DE DE not DE segment genome into groups of nucleotides
with similar signal t 1 t 2 t 3 t 4 t 5 DE not DE Hidden Markov Model

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor
−1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position linear models HMM (candidate DERs)

−1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position linear models HMM permutation tests for statistical significance

−1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position match to annotation if desired: CECR1, “may play a role in regulating cell proliferation”

engineering challenges •  creating and handling nucleotide-by- sample matrix • 
efficient linear model fitting (solution: lmFit) •  efficient segmentation with HMM •  efficient p-value calculations Initial solution: https://github.com/alyssafrazee/derfinder

Approach 2: improve the current software infrastructure for analysis of
transcriptome assemblies

Ballgown biorXiv preprint: http://biorxiv.org/content/early/2014/03/30/003665, Bioconductor package “ballgown” Align Reads! (e.g.
TopHat)! Assemble Transcripts! (e.g. Cufﬂinks)! Estimate Expression! (Cufﬂinks via Tablemaker, RSEM)! Differential Expression Tests! (default Ballgown models, limma, EdgeR, DESeq,…)! paired-end RNA-seq reads! Transcriptome Assembly! Pipelines! R/Bioconductor Pipelines! Ballgown as connecting framework! transcriptome assembly pipelines R/Bioconductor DE analysis

S4 class for transcript assemblies ballgown object data structure indexes
exon intron transcript exon intron transcript e2t i2t t2g pData bamﬁles expr matrices GRanges data frames

easy exploration and DE analysis 24376000 24378000 24380000 24382000 24384000
genomic position 332 333 334 335 336 337 338 plotting functions

stat_results = stattest(my_assembly, feature='transcript', meas='FPKM', covariate='group’) head(stat_results) ## feature id
pval qval ## transcript 10 0.01381576 0.105212332 ## transcript 25 0.26773622 0.791149753 ## transcript 35 0.01085070 0.089518254 ## transcript 41 0.47108019 0.902537475 ## transcript 45 0.08402948 0.489348136 ## transcript 67 0.27317385 0.79114975 easy exploration and DE analysis statistical tests (drop-in replacement for Cuffdiff)

ballgown object data structure indexes exon intron transcript exon intron
transcript e2t i2t t2g pData bamﬁles expr easy exploration and DE analysis statistical tests easily connects to existing DE packages

easy exploration and DE analysis annotation functions 24615000 24620000 24625000
24630000 24635000 24640000 genomic position Assembled and Annotated Transcripts annotated assembled get corresponding gene names, match assembled and annotated transcripts, plot assembly alongside annotation

highly flexible

freely available! Bioconductor (devel): source(“http://bioconductor.org/biocLite.R”) biocLite(“ballgown”) GitHub: https://github.com/alyssafrazee/ballgown Cufflinks
users will also need Tablemaker: https://github.com/alyssafrazee/tablemaker

Thanks Collaborators: Jeff Leek (advisor), Steven Salzberg, Ben Langmead, Andrew
Jaffe, Rafa Irizarry, Sarven Sabunciyan, Kasper Hansen, Geo Pertea, Leonardo Collado Torres Contact: alyssafrazee.com, [email protected], @acfrazee (Twitter)

differential expression model for each transcript, compare the fits of
the following models using an F-test. Null hypothesis is that the fits of model (a) and model (b) are equally good; alternative is that (a) fits better. (a) (b) BRIEF ARTICLE THE AUTHOR expressioni = ↵ + 0groupi + P X p=1 pconfounderip + noiseip expressioni = ↵⇤ + P X p=1 ⇤ pconfounderip + noise ⇤ ip expressioni = ↵ + K X k=1 ksplinek(timei) + P X p=1 pconfounderip + noiseip P

Differential expression analysis tools

Differential expression analysis tools

Alyssa Frazee

More Decks by Alyssa Frazee

Other Decks in Science

Featured

Transcript

Engineering annotation- agnostic tools for differential expression analysis Alyssa Frazee

Why annotation-agnostic?

Why annotation-agnostic? (1) isoform-level analysis

scientifically important cell differentiation (Trapnell 2010) organism development (Graveley

24376000 24378000 24380000 24382000 24384000 genomic position 332 333 334

Why annotation-agnostic? (2) allows for new discoveries

One solution: assembly genome some possible assemblies

Approach 1: avoid assembly altogether

DER Finder PMID 24398039 idea: scan genome base-by- base, highlight

DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670

DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670

find signal at each nucleotide samples indexed by i locations

samples indexed by i locations indexed by l j confounders

DE DE not DE segment genome into groups of nucleotides

DE DE not DE segment genome into groups of nucleotides

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor

engineering challenges •  creating and handling nucleotide-by- sample matrix •

Approach 2: improve the current software infrastructure for analysis of

Ballgown biorXiv preprint: http://biorxiv.org/content/early/2014/03/30/003665, Bioconductor package “ballgown” Align Reads! (e.g.

S4 class for transcript assemblies ballgown object data structure indexes

easy exploration and DE analysis 24376000 24378000 24380000 24382000 24384000

stat_results = stattest(my_assembly, feature='transcript', meas='FPKM', covariate='group’) head(stat_results) ## feature id

ballgown object data structure indexes exon intron transcript exon intron

easy exploration and DE analysis annotation functions 24615000 24620000 24625000

highly flexible

freely available! Bioconductor (devel): source(“http://bioconductor.org/biocLite.R”) biocLite(“ballgown”) GitHub: https://github.com/alyssafrazee/ballgown Cufflinks

Thanks Collaborators: Jeff Leek (advisor), Steven Salzberg, Ben Langmead, Andrew

differential expression model for each transcript, compare the fits of