jgm2013

Fast differential expression analysis annotation-agnostic across groups with biological replicates
Leonardo Collado-Torres tweet: @fellgernon blog: tinyurl.com/FellBit

@fellgernon #JGM2013 Field overview Ultimate Goal What is the biological
(genomic) cause, if any, of X disease? Currently What are the most likely genomic difference(s) between two+ groups?

@fellgernon #JGM2013 Tools • Molecular biology: reverse transcriptase • High-throughput
sequencing • $$ and – > Large number of biological replicates • Computers • Biostatistics Image: http://bit.ly/15MVhSU

@fellgernon #JGM2013 Mapping reads from mRNAs Trapnell et al. Nat.
Biotech 2009

@fellgernon #JGM2013 Mapping result: our initial data n samples à
3 billion nt Adapted from @leekgroup

@fellgernon #JGM2013 Split by chromosome and filter n samples à
~760 million nt Rows with at least 1 sample with coverage > 5

@fellgernon #JGM2013 Test for base-level DE Adapted from @andrewejaffe

@fellgernon #JGM2013 F-statistic at each base-pair • Null model •
Alternative Model

@fellgernon #JGM2013 Threshold on F-statistics F-statistic corresponding to p-value <
10-8 (F5,30 ) Adapted from @andrewejaffe

@fellgernon #JGM2013 Q-values: qvalue::qvalue Permute model matrices and find null
regions for all chromosomes.

@fellgernon #JGM2013 How can we make it fast? • Avoid
Input/Output as much as possible • Work by chromosome • Reduce memory – Run Length Encoding (IRanges::Rle) 0000111111222 = (0, 1, 2) (4, 6, 3) • Use multiple cores (parallel::mclapply) – Split data to use cores efficiently • Calculate F-stats using Rcpp (Has + and -)

@fellgernon #JGM2013 Finding candidate DERs: example dataRegions 450 500 550
600 segs 450 500 550 600 pieces 450 500 550 600 ders 450 500 550 600 450 500 550 600 0.0 1.0 2.0 Index f

@fellgernon #JGM2013 Remember to map back to genomic coords dataRegions
450 500 550 600

@fellgernon #JGM2013 Calculate F-stats and threshold 450 500 550 600
0.0 0.5 1.0 1.5 2.0 2.5 Index f

@fellgernon #JGM2013 Thresholding identifies segments segs 450 500 550 600

@fellgernon #JGM2013 Disjoin data regions & segments pieces 450 500
550 600

@fellgernon #JGM2013 Identify pieces that overlap segments ders 450 500
550 600

@fellgernon #JGM2013 Example: re-cap dataRegions 450 500 550 600 segs
450 500 550 600 pieces 450 500 550 600 ders 450 500 550 600 450 500 550 600 0.0 1.0 2.0 Index f

@fellgernon #JGM2013 Example: result Cluster for region with name COL6A1
and q-value 0.8256 chr21 chr21 Coverage 1 2 group CEU YRI Mean coverage 0.125 0.500 group CEU YRI Regions significantQval TRUE FALSE tx_name (gene_id) tx_name(gene_id) 47411000 47411200 47411400 47411600

@fellgernon #JGM2013 Public datasets • derfinderExample: – Blood CEU vs
YRI non-related individuals • derHippo: – Brain hippocampus from cocaine addicts, alcohol addicts, and controls • derSnyder: – Michael Snyder time course (~1 year): 2 x diseases, recovery & healthy periods • derStem: – 5 stem cell types, 2 replicates per group

@fellgernon #JGM2013 Time and memory needed: derSnyder • Load &
filter data: 10 cores with mclapply 1hr 15min, 177 GB • Make models: 20 min, 52 GB • Analysis: 10 permutations, 4 cores each chr, total 59 mins – chr1 41 min, 46 GB • Merging: 30 min, 22 GB • Report: 27 min, 17 GB • Total wallclock time: 3 hr 46 min 20 samples

@fellgernon #JGM2013 A richer data set: 69 samples • Load
raw data: each chr, total 1hr 28 min – chr1 1hr 28 min, 18 GB – Merge 1hr 7 min, 67 GB • Filter data: each chr, total 12 min – chr1 12 min, 10 GB – Merge 1hr, 62 GB • Make models: 1 hr 49 min, 234 GB • Analysis: 0 permutations, 8 cores each chr, 52 min (1 hr 41 min) – chr1 49 min, 258 GB, had to run twice • Merging: 1 hr 6 min, 46 GB • Report: 1hr 29 min, 45 GB • Total wallclock time: 9 hr 3 min (9 hr 52 min)

@fellgernon #JGM2013 Counts: derSnyder • Load & filter data: 10
cores with mclapply 1hr 15min, 177 GB • Create count table: 26 min, 24 GB • Total wallclock time: 1 hr 41 min 20 samples

@fellgernon #JGM2013 Counts with richer data set: 69 samples •
Load raw data: each chr, total 1hr 28 min – chr1 1hr 28 min, 18 GB – Merge 1hr 7 min, 67 GB • Count tables: 53 min, 53 GB • Total wallclock time: 3 hr 3 min

@fellgernon #JGM2013 P-values: derfinderExample

@fellgernon #JGM2013 P-values: derStem

@fellgernon #JGM2013 derHippo Genome overview of regions with q-value <
0.1

@fellgernon #JGM2013 derHippo: cluster #2

@fellgernon #JGM2013 derfinderExample: cluster #6

@fellgernon #JGM2013 derSnyder: cluster #3

@fellgernon #JGM2013 Coverage adjustment? • • • • • •
• • • • • • • • • • • • • • 6.0e+07 8.0e+07 1.0e+08 1.2e+08 1.4e+08 1.6e+08 1.8e+08 6.0e+08 8.0e+08 1.0e+09 1.2e+09 1.4e+09 1.6e+09 chr 1 total Cov vs Cov < quantile 0.9 Coverage at bases < quantile 0.9 Total coverage 0to186 186to294 294to322 322to400 • • • • • • • • • • • • • • • • • • • • 1.5e+07 2.0e+07 2.5e+07 3.0e+07 3.5e+07 4.0e+07 6.0e+08 8.0e+08 1.0e+09 1.2e+09 1.4e+09 1.6e+09 chr 1 total Cov vs # bases with data Number bases with Cov > 0 Total coverage 0to186 186to294 294to322 322to400 derSnyder Similar to metagenomeSeq::cumNorm

@fellgernon #JGM2013 derfinder R package https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample

@fellgernon #JGM2013 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins
Sarven Sabunciyan Ben Langmead Lieber Institute (LIBD) Andrew Jaffe Harvard Rafael Irizarry Funding NIH (Aug 2012- July 2013) LIBD (Aug 2013 - now) CONACyT México

jgm2013

jgm2013

Leonardo Collado-Torres

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript

Fast differential expression analysis annotation-agnostic across groups with biological replicates

@fellgernon #JGM2013 Field overview Ultimate Goal What is the biological

@fellgernon #JGM2013 Tools • Molecular biology: reverse transcriptase • High-throughput

@fellgernon #JGM2013 Mapping reads from mRNAs Trapnell et al. Nat.

@fellgernon #JGM2013 Mapping result: our initial data n samples à

@fellgernon #JGM2013 Split by chromosome and filter n samples à

@fellgernon #JGM2013 Test for base-level DE Adapted from @andrewejaffe

@fellgernon #JGM2013 F-statistic at each base-pair • Null model •

@fellgernon #JGM2013 Threshold on F-statistics F-statistic corresponding to p-value <

@fellgernon #JGM2013 Q-values: qvalue::qvalue Permute model matrices and find null

@fellgernon #JGM2013 How can we make it fast? • Avoid

@fellgernon #JGM2013 Finding candidate DERs: example dataRegions 450 500 550

@fellgernon #JGM2013 Remember to map back to genomic coords dataRegions

@fellgernon #JGM2013 Calculate F-stats and threshold 450 500 550 600

@fellgernon #JGM2013 Thresholding identifies segments segs 450 500 550 600

@fellgernon #JGM2013 Disjoin data regions & segments pieces 450 500

@fellgernon #JGM2013 Identify pieces that overlap segments ders 450 500

@fellgernon #JGM2013 Example: re-cap dataRegions 450 500 550 600 segs

@fellgernon #JGM2013 Example: result Cluster for region with name COL6A1

@fellgernon #JGM2013 Public datasets • derfinderExample: – Blood CEU vs

@fellgernon #JGM2013 Time and memory needed: derSnyder • Load &

@fellgernon #JGM2013 A richer data set: 69 samples • Load

@fellgernon #JGM2013 Counts: derSnyder • Load & filter data: 10

@fellgernon #JGM2013 Counts with richer data set: 69 samples •

@fellgernon #JGM2013 P-values: derfinderExample

@fellgernon #JGM2013 P-values: derStem

@fellgernon #JGM2013 derHippo Genome overview of regions with q-value <

@fellgernon #JGM2013 derHippo: cluster #2

@fellgernon #JGM2013 derfinderExample: cluster #6

@fellgernon #JGM2013 derSnyder: cluster #3

@fellgernon #JGM2013 Coverage adjustment? • • • • • •

@fellgernon #JGM2013 derfinder R package https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample

@fellgernon #JGM2013 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins