user2013

Differential expression analysis of RNA-seq data at base-pair resolution in
multiple biological replicates Leonardo Collado-Torres tweet: @fellgernon blog: tinyurl.com/FellBit

@fellgernon #useR2013 Field overview Ultimate Goal: What is the biological
(genomic) cause, if any, of X disease? Currently: What is the difference between two groups? Tools –  Large number of biological samples –  High-throughput sequencing –  Reverse transcriptase –  Biostatistics –  Computers

@fellgernon #useR2013 High-throughput sequencing h"p://res.illumina.com/documents/products/ datasheets/datasheet_hiseq2000.pdf

@fellgernon #useR2013 RNA-seq puriﬁed RNA shear cDNA fragments
sequence fragment end(s) or paired-‐end: single-‐end: map reads to reference genome summarize mapped reads into feature-‐ level abundances (gene, exon, transcript…) normalize abundances test each feature for diﬀerenJal expression Oshlack et al. Genome Biology 2010 Adapted from @acfrazee

@fellgernon #useR2013 Mapping reads Trapnell et al. Nat. Biotech 2009

@fellgernon #useR2013 Mapping result: our initial data n samples à
3 billion nt Frazee et al. Biostatistics in review Adapted from @leekgroup

@fellgernon #useR2013 Approach 1: Annotate-then-Identify Genome Union model: Count by
taking union of all exons Bullard et al. BMC Bioinformatics 2010 Adapted from @leekgroup

@fellgernon #useR2013 Approach 2: assemble-then-identify Genome Fragments Transcripts Trapnell et
al. Nat. Biotech 2010 Adapted from @leekgroup

@fellgernon #useR2013 Approach 2: inherent ambiguity (assembly) Genome Alternative Assemblies
Adapted from @leekgroup

@fellgernon #useR2013 Methods overview annotate-identify 1.  align 2.  gene-model 3. 
abundances 4.  analyze Pros: •  analogous to microarray, •  processed data easy to handle Cons: •  incorrect/variable annotation •  gene model choices have a big impact assemble-identify 1.  align 2.  assemble 3.  abundances 4.  analyze Pros: •  alternative transcription •  (potentially) less annotation dependent Cons: •  ambiguity/variation in assembly Adapted from @leekgroup

@fellgernon #useR2013 derFinder: differentially expressed region finder g() = Transform
(Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups Xi = covariate of interest for sample i γk ()= adjustment’s for confounders Wik = value of kth confounder on ith sample Frazee et al. Biostatistics in review Adapted from @leekgroup base-‐pair model (case/control)

@fellgernon #useR2013 Determining differential expressed regions •  Test that β(l)
= 0 s(l) = statistic for location l •  HMM with three states D(l) = hidden state at location l D(l) = 0 if α(l) = β(l) = 0 (no expression) D(l) = 1 if α(l) ≠ 0 and β(l) = 0 (expressed, but no group difference) D(l) = 2 if β(l) ≠ 0 (group difference!) •  Assume statistics conditional on state is Normal s(l)|D(l) = d ~ (µd ,σd 2) f0 : µ0 =0, σ0 2=ε f1 : µ1 =0, σ1 2 f2 : µ2 , σ2 2 •  Estimate parameters using a mixture model t ~ π0 f0 + π1 f1 + π2 f2 Frazee et al. Biostatistics in review Adapted from @leekgroup

@fellgernon #useR2013 Monte-Carlo p-value Observed Null Jaffe
et al. Biostatistics 2011 Frazee et al. Biostatistics in review Langmead et al. in prep Adapted from @leekgroup

@fellgernon #useR2013 Annotation incorrect? 4.5 5.0 5.5 6.0 6.5 7.0
log2(count+32) chrY: 15016699 - 15017219 female male 1 2 3 4 5 6 7 t statistic xaxinds exons states 15016742 15016842 15016942 15017119 15017219 genomic position Frazee et al. Biostatistics in review Adapted from @leekgroup

@fellgernon #useR2013 Annotation missing? 4.5 5.0 5.5 6.0 6.5 7.0
log2(count+32) chrY: 2715932-2716691 female male 2.0 2.5 3.0 3.5 4.0 4.5 t statistic xaxinds exons states 2715882 2716082 2716282 2716482 2716682 genomic position Frazee et al. Biostatistics in review Adapted from @leekgroup

@fellgernon #useR2013 derfinder R package •  Input: alignment files (TopHat,
…) •  Processing coverage by chr into R via Rsamtools •  Memory reduction via IRanges – 3 to 43 GB RAM per chr for 30 samples – Alternative: sql database •  Linear model fitting via limma •  HMM via HiddenMarkov •  Custom plotting functions •  To-do: reduce wallclock time

@fellgernon #useR2013 derfinder R package https://github.com/alyssafrazee/derfinder

@fellgernon #useR2013 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins
Sarven Sabunciyan Ben Langmead Leiber Institute Andrew Jaffe Harvard Rafa Irizarry Funding NIH CONACyT México

user2013

user2013

Leonardo Collado-Torres

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript

Differential expression analysis of RNA-seq data at base-pair resolution in

@fellgernon #useR2013 Field overview Ultimate Goal: What is the biological

@fellgernon #useR2013 High-throughput sequencing h"p://res.illumina.com/documents/products/ datasheets/datasheet_hiseq2000.pdf

@fellgernon #useR2013 RNA-seq puriﬁed RNA shear cDNA fragments

@fellgernon #useR2013 Mapping reads Trapnell et al. Nat. Biotech 2009

@fellgernon #useR2013 Mapping result: our initial data n samples à

@fellgernon #useR2013 Approach 1: Annotate-then-Identify Genome Union model: Count by

@fellgernon #useR2013 Approach 2: assemble-then-identify Genome Fragments Transcripts Trapnell et

@fellgernon #useR2013 Approach 2: inherent ambiguity (assembly) Genome Alternative Assemblies

@fellgernon #useR2013 Methods overview annotate-identify 1.  align 2.  gene-model 3.

@fellgernon #useR2013 derFinder: differentially expressed region finder g() = Transform

@fellgernon #useR2013 Determining differential expressed regions •  Test that β(l)

@fellgernon #useR2013 Monte-Carlo p-value Observed Null Jaffe

@fellgernon #useR2013 Annotation incorrect? 4.5 5.0 5.5 6.0 6.5 7.0

@fellgernon #useR2013 Annotation missing? 4.5 5.0 5.5 6.0 6.5 7.0

@fellgernon #useR2013 derfinder R package •  Input: alignment files (TopHat,

@fellgernon #useR2013 derfinder R package https://github.com/alyssafrazee/derfinder

@fellgernon #useR2013 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins