lcg2014

Fast differential expression analysis annotation-agnostic across groups with biological replicates
Leonardo Collado-Torres tweet: @fellgernon blog: bit.ly/FellBit

@fellgernon #LCG2014 Field overview Ultimate Goal What is the biological
(genomic) cause, if any, of X disease? Currently What are the most likely genomic difference(s) between two+ groups?

@fellgernon #LCG2014 Tools •  Molecular biology: reverse transcriptase •  High-throughput
sequencing •  $$ and – > Large number of biological replicates •  Computers •  Biostatistics Image: http://bit.ly/15MVhSU

@fellgernon #LCG2014 Mapping reads from mRNAs Trapnell et al. Nat.
Biotech 2009

@fellgernon #LCG2014 Mapping result: our initial data n samples à
3 billion nt Adapted from @leekgroup

@fellgernon #LCG2014 Split by chromosome and filter n samples à
~760 million nt Rows with at least 1 sample with coverage > 5

@fellgernon #LCG2014 Test for base-level DE Adapted from @andrewejaffe

@fellgernon #LCG2014 F-statistic at each base-pair •  Null model • 
Alternative Model

@fellgernon #LCG2014 Threshold on F-statistics F-‐sta6s6c corresponding to p-‐value
< 10-‐8 (F5,30 ) Adapted from @andrewejaffe

@fellgernon #LCG2014 Q-values: qvalue::qvalue Permute model matrices and find null
regions for all chromosomes.

@fellgernon #LCG2014 How can we make it fast? •  Avoid
Input/Output as much as possible •  Work by chromosome •  Reduce memory –  Run Length Encoding (IRanges::Rle) 0000111111222 = (0, 1, 2) (4, 6, 3) •  Use multiple cores (parallel::mclapply) –  Split data to use cores efficiently •  Calculate F-stats using Rcpp

@fellgernon #LCG2014 Public datasets •  derfinderExample: –  Blood CEU vs
YRI non-related individuals •  derHippo: –  Brain hippocampus from cocaine addicts, alcohol addicts, and controls •  derSnyder: –  Michael Snyder time course (~1 year): 2 x diseases, recovery & healthy periods •  derStem: –  5 stem cell types, 2 replicates per group

@fellgernon #LCG2014 derSnyder: cluster #1

@fellgernon #LCG2014 derHippo: cluster #4

@fellgernon #LCG2014 derHippo: vs original paper

@fellgernon #LCG2014 Time and memory needed: derSnyder •  Load &
filter data: 10 cores with mclapply 1hr 15min, 177 GB •  Make models: 20 min, 52 GB •  Analysis: 10 permutations, 4 cores each chr, total 59 mins –  chr1 41 min, 46 GB •  Merging: 30 min, 22 GB •  Report: 27 min, 17 GB •  Total wallclock time: 3 hr 46 min 20 samples

@fellgernon #LCG2014 A richer data set: 69 samples •  Load
raw data: each chr, total 1hr 28 min –  chr1 1hr 28 min, 18 GB –  Merge 1hr 7 min, 67 GB •  Filter data: each chr, total 12 min –  chr1 12 min, 10 GB –  Merge 1hr, 62 GB •  Make models: 1 hr 49 min, 234 GB •  Analysis: 0 permutations, 8 cores each chr, 52 min (1 hr 41 min) –  chr1 49 min, 258 GB, had to run twice •  Merging: 1 hr 6 min, 46 GB •  Report: 1hr 29 min, 45 GB •  Total wallclock time: 9 hr 3 min (9 hr 52 min)

@fellgernon #LCG2014 derfinder R package https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample

@fellgernon #LCG2014 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins
Sarven Sabunciyan Ben Langmead Lieber Institute (LIBD) Andrew Jaffe Harvard Rafael Irizarry Funding NIH LIBD CONACyT México

lcg2014

lcg2014

Leonardo Collado-Torres

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript

Fast differential expression analysis annotation-agnostic across groups with biological replicates

@fellgernon #LCG2014 Field overview Ultimate Goal What is the biological

@fellgernon #LCG2014 Tools •  Molecular biology: reverse transcriptase •  High-throughput

@fellgernon #LCG2014 Mapping reads from mRNAs Trapnell et al. Nat.

@fellgernon #LCG2014 Mapping result: our initial data n samples à

@fellgernon #LCG2014 Split by chromosome and filter n samples à

@fellgernon #LCG2014 Test for base-level DE Adapted from @andrewejaffe

@fellgernon #LCG2014 F-statistic at each base-pair •  Null model •

@fellgernon #LCG2014 Threshold on F-statistics F-‐sta6s6c corresponding to p-‐value

@fellgernon #LCG2014 Q-values: qvalue::qvalue Permute model matrices and find null

@fellgernon #LCG2014 How can we make it fast? •  Avoid

@fellgernon #LCG2014 Public datasets •  derfinderExample: –  Blood CEU vs

@fellgernon #LCG2014 derSnyder: cluster #1

@fellgernon #LCG2014 derHippo: cluster #4

@fellgernon #LCG2014 derHippo: vs original paper

@fellgernon #LCG2014 Time and memory needed: derSnyder •  Load &

@fellgernon #LCG2014 A richer data set: 69 samples •  Load

@fellgernon #LCG2014 derfinder R package https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample

@fellgernon #LCG2014 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins