$30 off During Our Annual Pro Sale. View Details »

jgm2013

 jgm2013

Leonardo Collado-Torres

October 07, 2013
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. Fast differential expression analysis
    annotation-agnostic across groups with
    biological replicates
    Leonardo Collado-Torres
    tweet: @fellgernon blog: tinyurl.com/FellBit

    View Slide

  2. @fellgernon
    #JGM2013
    Field overview
    Ultimate Goal
    What is the biological (genomic) cause, if any,
    of X disease?
    Currently
    What are the most likely genomic difference(s)
    between two+ groups?

    View Slide

  3. @fellgernon
    #JGM2013
    Tools
    • Molecular biology: reverse transcriptase
    • High-throughput sequencing
    • $$ and
    – > Large number of biological replicates
    • Computers
    • Biostatistics
    Image: http://bit.ly/15MVhSU

    View Slide

  4. @fellgernon
    #JGM2013
    Mapping reads from mRNAs
    Trapnell et al. Nat. Biotech 2009

    View Slide

  5. @fellgernon
    #JGM2013
    Mapping result: our initial data
    n samples à
    3 billion nt
    Adapted from @leekgroup

    View Slide

  6. @fellgernon
    #JGM2013
    Split by chromosome and filter
    n samples à
    ~760 million nt
    Rows with at least 1 sample with coverage > 5

    View Slide

  7. @fellgernon
    #JGM2013
    Test for base-level DE
    Adapted from @andrewejaffe

    View Slide

  8. @fellgernon
    #JGM2013
    F-statistic at each base-pair
    • Null model
    • Alternative Model

    View Slide

  9. @fellgernon
    #JGM2013
    Threshold on F-statistics
    F-statistic corresponding to
    p-value < 10-8 (F5,30
    )
    Adapted from @andrewejaffe

    View Slide

  10. @fellgernon
    #JGM2013
    Q-values: qvalue::qvalue
    Permute model matrices and find null
    regions for all chromosomes.

    View Slide

  11. @fellgernon
    #JGM2013
    How can we make it fast?
    • Avoid Input/Output as much as possible
    • Work by chromosome
    • Reduce memory
    – Run Length Encoding (IRanges::Rle)
    0000111111222 = (0, 1, 2)
    (4, 6, 3)
    • Use multiple cores (parallel::mclapply)
    – Split data to use cores efficiently
    • Calculate F-stats using Rcpp (Has + and -)

    View Slide

  12. @fellgernon
    #JGM2013
    Finding candidate DERs: example
    dataRegions
    450 500 550 600
    segs
    450 500 550 600
    pieces
    450 500 550 600
    ders
    450 500 550 600
    450 500 550 600
    0.0 1.0 2.0
    Index
    f

    View Slide

  13. @fellgernon
    #JGM2013
    Remember to map back to genomic coords
    dataRegions
    450 500 550 600

    View Slide

  14. @fellgernon
    #JGM2013
    Calculate F-stats and threshold
    450 500 550 600
    0.0 0.5 1.0 1.5 2.0 2.5
    Index
    f

    View Slide

  15. @fellgernon
    #JGM2013
    Thresholding identifies segments
    segs
    450 500 550 600

    View Slide

  16. @fellgernon
    #JGM2013
    Disjoin data regions & segments
    pieces
    450 500 550 600

    View Slide

  17. @fellgernon
    #JGM2013
    Identify pieces that overlap segments
    ders
    450 500 550 600

    View Slide

  18. @fellgernon
    #JGM2013
    Example: re-cap
    dataRegions
    450 500 550 600
    segs
    450 500 550 600
    pieces
    450 500 550 600
    ders
    450 500 550 600
    450 500 550 600
    0.0 1.0 2.0
    Index
    f

    View Slide

  19. @fellgernon
    #JGM2013
    Example: result
    Cluster for region with name COL6A1 and q-value 0.8256
    chr21
    chr21
    Coverage
    1
    2 group
    CEU
    YRI
    Mean coverage
    0.125
    0.500
    group
    CEU
    YRI
    Regions
    significantQval
    TRUE
    FALSE
    tx_name
    (gene_id)
    tx_name(gene_id)
    47411000 47411200 47411400 47411600

    View Slide

  20. @fellgernon
    #JGM2013
    Public datasets
    • derfinderExample:
    – Blood CEU vs YRI non-related individuals
    • derHippo:
    – Brain hippocampus from cocaine addicts, alcohol
    addicts, and controls
    • derSnyder:
    – Michael Snyder time course (~1 year):
    2 x diseases, recovery & healthy periods
    • derStem:
    – 5 stem cell types, 2 replicates per group

    View Slide

  21. @fellgernon
    #JGM2013
    Time and memory needed: derSnyder
    • Load & filter data: 10 cores with mclapply
    1hr 15min, 177 GB
    • Make models: 20 min, 52 GB
    • Analysis: 10 permutations, 4 cores each chr,
    total 59 mins
    – chr1 41 min, 46 GB
    • Merging: 30 min, 22 GB
    • Report: 27 min, 17 GB
    • Total wallclock time: 3 hr 46 min
    20 samples

    View Slide

  22. @fellgernon
    #JGM2013
    A richer data set: 69 samples
    • Load raw data: each chr, total 1hr 28 min
    – chr1 1hr 28 min, 18 GB
    – Merge 1hr 7 min, 67 GB
    • Filter data: each chr, total 12 min
    – chr1 12 min, 10 GB
    – Merge 1hr, 62 GB
    • Make models: 1 hr 49 min, 234 GB
    • Analysis: 0 permutations, 8 cores each chr, 52 min
    (1 hr 41 min)
    – chr1 49 min, 258 GB, had to run twice
    • Merging: 1 hr 6 min, 46 GB
    • Report: 1hr 29 min, 45 GB
    • Total wallclock time: 9 hr 3 min (9 hr 52 min)

    View Slide

  23. @fellgernon
    #JGM2013
    Counts: derSnyder
    • Load & filter data: 10 cores with
    mclapply 1hr 15min, 177 GB
    • Create count table: 26 min, 24 GB
    • Total wallclock time: 1 hr 41 min
    20 samples

    View Slide

  24. @fellgernon
    #JGM2013
    Counts with richer data set: 69 samples
    • Load raw data: each chr, total 1hr 28 min
    – chr1 1hr 28 min, 18 GB
    – Merge 1hr 7 min, 67 GB
    • Count tables: 53 min, 53 GB
    • Total wallclock time: 3 hr 3 min

    View Slide

  25. @fellgernon
    #JGM2013
    P-values: derfinderExample

    View Slide

  26. @fellgernon
    #JGM2013
    P-values: derStem

    View Slide

  27. @fellgernon
    #JGM2013
    derHippo
    Genome
    overview of
    regions with
    q-value < 0.1

    View Slide

  28. @fellgernon
    #JGM2013
    derHippo: cluster #2

    View Slide

  29. @fellgernon
    #JGM2013
    derfinderExample: cluster #6

    View Slide

  30. @fellgernon
    #JGM2013
    derSnyder: cluster #3

    View Slide

  31. @fellgernon
    #JGM2013
    Coverage adjustment?




















    6.0e+07 8.0e+07 1.0e+08 1.2e+08 1.4e+08 1.6e+08 1.8e+08
    6.0e+08 8.0e+08 1.0e+09 1.2e+09 1.4e+09 1.6e+09
    chr 1 total Cov vs Cov < quantile 0.9
    Coverage at bases < quantile 0.9
    Total coverage
    0to186 186to294 294to322 322to400
















    ● ●


    1.5e+07 2.0e+07 2.5e+07 3.0e+07 3.5e+07 4.0e+07
    6.0e+08 8.0e+08 1.0e+09 1.2e+09 1.4e+09 1.6e+09
    chr 1 total Cov vs # bases with data
    Number bases with Cov > 0
    Total coverage
    0to186 186to294 294to322 322to400
    derSnyder
    Similar to metagenomeSeq::cumNorm

    View Slide

  32. @fellgernon
    #JGM2013
    derfinder R package
    https://github.com/lcolladotor/derfinder
    https://github.com/lcolladotor/derfinderReport
    https://github.com/lcolladotor/derfinderExample

    View Slide

  33. @fellgernon
    #JGM2013
    Acknowledgements
    Leek Group
    Jeffrey Leek
    Alyssa Frazee
    Hopkins
    Sarven Sabunciyan
    Ben Langmead
    Lieber Institute (LIBD)
    Andrew Jaffe
    Harvard
    Rafael Irizarry
    Funding
    NIH (Aug 2012- July 2013)
    LIBD (Aug 2013 - now)
    CONACyT México

    View Slide