Slide 1

Slide 1 text

Engineering annotation- agnostic tools for differential expression analysis Alyssa Frazee Johns Hopkins University @acfrazee

Slide 2

Slide 2 text

Why annotation-agnostic?

Slide 3

Slide 3 text

Why annotation-agnostic? (1) isoform-level analysis

Slide 4

Slide 4 text

scientifically important   cell differentiation (Trapnell 2010) organism development (Graveley 2010) cancer (Govindan 2012)

Slide 5

Slide 5 text

24376000 24378000 24380000 24382000 24384000 genomic position 332 333 334 335 336 337 338 but read-counting is challenging  

Slide 6

Slide 6 text

Why annotation-agnostic? (2) allows for new discoveries

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

One solution: assembly genome some possible assemblies

Slide 9

Slide 9 text

Approach 1: avoid assembly altogether

Slide 10

Slide 10 text

DER Finder PMID 24398039 idea: scan genome base-by- base, highlight segments showing differential expression signal

Slide 11

Slide 11 text

DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor −1 0 1 2 3 4 5 t statistic states genomic position DE signal read coverage

Slide 12

Slide 12 text

DER Finder 0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor −1 0 1 2 3 4 5 t statistic states genomic position DE signal read coverage

Slide 13

Slide 13 text

find signal at each nucleotide samples indexed by i locations indexed by l j confounders indexed by k expression confounders covariate of interest

Slide 14

Slide 14 text

samples indexed by i locations indexed by l j confounders indexed by k expression confounders covariate of interest  v   find signal at each nucleotide

Slide 15

Slide 15 text

DE DE not DE segment genome into groups of nucleotides with similar signal t 1 t 2 t 3 t 4 t 5 DE not DE

Slide 16

Slide 16 text

DE DE not DE segment genome into groups of nucleotides with similar signal t 1 t 2 t 3 t 4 t 5 DE not DE Hidden Markov Model

Slide 17

Slide 17 text

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor −1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position linear models HMM (candidate DERs)

Slide 18

Slide 18 text

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor −1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position linear models HMM permutation tests for statistical significance

Slide 19

Slide 19 text

0 2 4 6 8 log2(count+1) chr22: 17684448−17684670 normal tumor −1 0 1 2 3 4 5 t statistic xaxinds exons states 17684041 17684451 17684551 17684651 17684754 genomic position match to annotation if desired: CECR1, “may play a role in regulating cell proliferation”

Slide 20

Slide 20 text

engineering challenges •  creating and handling nucleotide-by- sample matrix •  efficient linear model fitting (solution: lmFit) •  efficient segmentation with HMM •  efficient p-value calculations Initial solution: https://github.com/alyssafrazee/derfinder

Slide 21

Slide 21 text

Approach 2: improve the current software infrastructure for analysis of transcriptome assemblies

Slide 22

Slide 22 text

Ballgown biorXiv preprint: http://biorxiv.org/content/early/2014/03/30/003665, Bioconductor package “ballgown” Align Reads! (e.g. TopHat)! Assemble Transcripts! (e.g. Cufflinks)! Estimate Expression! (Cufflinks via Tablemaker, RSEM)! Differential Expression Tests! (default Ballgown models, limma, EdgeR, DESeq,…)! paired-end RNA-seq reads! Transcriptome Assembly! Pipelines! R/Bioconductor Pipelines! Ballgown as connecting framework! transcriptome assembly pipelines R/Bioconductor DE analysis

Slide 23

Slide 23 text

S4 class for transcript assemblies ballgown object data structure indexes exon intron transcript exon intron transcript e2t i2t t2g pData bamfiles expr matrices GRanges data frames

Slide 24

Slide 24 text

easy exploration and DE analysis 24376000 24378000 24380000 24382000 24384000 genomic position 332 333 334 335 336 337 338 plotting functions

Slide 25

Slide 25 text

stat_results = stattest(my_assembly, feature='transcript', meas='FPKM', covariate='group’) head(stat_results) ## feature id pval qval ## transcript 10 0.01381576 0.105212332 ## transcript 25 0.26773622 0.791149753 ## transcript 35 0.01085070 0.089518254 ## transcript 41 0.47108019 0.902537475 ## transcript 45 0.08402948 0.489348136 ## transcript 67 0.27317385 0.79114975 easy exploration and DE analysis statistical tests (drop-in replacement for Cuffdiff)

Slide 26

Slide 26 text

ballgown object data structure indexes exon intron transcript exon intron transcript e2t i2t t2g pData bamfiles expr easy exploration and DE analysis statistical tests easily connects to existing DE packages

Slide 27

Slide 27 text

easy exploration and DE analysis annotation functions 24615000 24620000 24625000 24630000 24635000 24640000 genomic position Assembled and Annotated Transcripts annotated assembled get corresponding gene names, match assembled and annotated transcripts, plot assembly alongside annotation

Slide 28

Slide 28 text

highly flexible

Slide 29

Slide 29 text

freely available! Bioconductor (devel): source(“http://bioconductor.org/biocLite.R”) biocLite(“ballgown”) GitHub:   https://github.com/alyssafrazee/ballgown Cufflinks users will also need Tablemaker: https://github.com/alyssafrazee/tablemaker

Slide 30

Slide 30 text

Thanks Collaborators: Jeff Leek (advisor), Steven Salzberg, Ben Langmead, Andrew Jaffe, Rafa Irizarry, Sarven Sabunciyan, Kasper Hansen, Geo Pertea, Leonardo Collado Torres Contact: alyssafrazee.com, [email protected], @acfrazee (Twitter)

Slide 31

Slide 31 text

differential expression model for each transcript, compare the fits of the following models using an F-test. Null hypothesis is that the fits of model (a) and model (b) are equally good; alternative is that (a) fits better. (a) (b) BRIEF ARTICLE THE AUTHOR expressioni = ↵ + 0groupi + P X p=1 pconfounderip + noiseip expressioni = ↵⇤ + P X p=1 ⇤ pconfounderip + noise ⇤ ip expressioni = ↵ + K X k=1 ksplinek(timei) + P X p=1 pconfounderip + noiseip P