$30 off During Our Annual Pro Sale. View Details »

lcg2014

 lcg2014

Leonardo Collado-Torres

January 07, 2014
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. Fast differential expression analysis
    annotation-agnostic across groups with
    biological replicates
    Leonardo Collado-Torres
    tweet: @fellgernon blog: bit.ly/FellBit

    View Slide

  2. @fellgernon
    #LCG2014
    Field overview
    Ultimate Goal
    What is the biological (genomic) cause, if any,
    of X disease?
    Currently
    What are the most likely genomic difference(s)
    between two+ groups?

    View Slide

  3. @fellgernon
    #LCG2014
    Tools
    •  Molecular biology: reverse transcriptase
    •  High-throughput sequencing
    •  $$ and
    – > Large number of biological replicates
    •  Computers
    •  Biostatistics
    Image: http://bit.ly/15MVhSU

    View Slide

  4. @fellgernon
    #LCG2014
    Mapping reads from mRNAs
    Trapnell  et  al.  Nat.  Biotech  2009  

    View Slide

  5. @fellgernon
    #LCG2014
    Mapping result: our initial data
    n  samples  à  
    3  billion  nt    
    Adapted from @leekgroup

    View Slide

  6. @fellgernon
    #LCG2014
    Split by chromosome and filter
    n  samples  à  
    ~760  million  nt    
    Rows with at least 1 sample with coverage > 5

    View Slide

  7. @fellgernon
    #LCG2014
    Test for base-level DE
    Adapted from @andrewejaffe

    View Slide

  8. @fellgernon
    #LCG2014
    F-statistic at each base-pair
    •  Null model
    •  Alternative Model

    View Slide

  9. @fellgernon
    #LCG2014
    Threshold on F-statistics
    F-­‐sta6s6c  corresponding  to  
    p-­‐value  <  10-­‐8    (F5,30
    )  
    Adapted from @andrewejaffe

    View Slide

  10. @fellgernon
    #LCG2014
    Q-values: qvalue::qvalue
    Permute model matrices and find null
    regions for all chromosomes.

    View Slide

  11. @fellgernon
    #LCG2014
    How can we make it fast?
    •  Avoid Input/Output as much as possible
    •  Work by chromosome
    •  Reduce memory
    –  Run Length Encoding (IRanges::Rle)
    0000111111222 = (0, 1, 2)
    (4, 6, 3)
    •  Use multiple cores (parallel::mclapply)
    –  Split data to use cores efficiently
    •  Calculate F-stats using Rcpp

    View Slide

  12. @fellgernon
    #LCG2014
    Public datasets
    •  derfinderExample:
    –  Blood CEU vs YRI non-related individuals
    •  derHippo:
    –  Brain hippocampus from cocaine addicts, alcohol
    addicts, and controls
    •  derSnyder:
    –  Michael Snyder time course (~1 year):
    2 x diseases, recovery & healthy periods
    •  derStem:
    –  5 stem cell types, 2 replicates per group

    View Slide

  13. @fellgernon
    #LCG2014
    derSnyder: cluster #1

    View Slide

  14. @fellgernon
    #LCG2014
    derHippo: cluster #4

    View Slide

  15. @fellgernon
    #LCG2014
    derHippo: vs original paper

    View Slide

  16. @fellgernon
    #LCG2014
    Time and memory needed: derSnyder
    •  Load & filter data: 10 cores with mclapply
    1hr 15min, 177 GB
    •  Make models: 20 min, 52 GB
    •  Analysis: 10 permutations, 4 cores each chr,
    total 59 mins
    –  chr1 41 min, 46 GB
    •  Merging: 30 min, 22 GB
    •  Report: 27 min, 17 GB
    •  Total wallclock time: 3 hr 46 min
    20 samples

    View Slide

  17. @fellgernon
    #LCG2014
    A richer data set: 69 samples
    •  Load raw data: each chr, total 1hr 28 min
    –  chr1 1hr 28 min, 18 GB
    –  Merge 1hr 7 min, 67 GB
    •  Filter data: each chr, total 12 min
    –  chr1 12 min, 10 GB
    –  Merge 1hr, 62 GB
    •  Make models: 1 hr 49 min, 234 GB
    •  Analysis: 0 permutations, 8 cores each chr, 52 min
    (1 hr 41 min)
    –  chr1 49 min, 258 GB, had to run twice
    •  Merging: 1 hr 6 min, 46 GB
    •  Report: 1hr 29 min, 45 GB
    •  Total wallclock time: 9 hr 3 min (9 hr 52 min)

    View Slide

  18. @fellgernon
    #LCG2014
    derfinder R package
    https://github.com/lcolladotor/derfinder
    https://github.com/lcolladotor/derfinderReport
    https://github.com/lcolladotor/derfinderExample

    View Slide

  19. @fellgernon
    #LCG2014
    Acknowledgements
    Leek Group
    Jeffrey Leek
    Alyssa Frazee
    Hopkins
    Sarven Sabunciyan
    Ben Langmead
    Lieber Institute (LIBD)
    Andrew Jaffe
    Harvard
    Rafael Irizarry
    Funding
    NIH
    LIBD
    CONACyT México

    View Slide