$30 off During Our Annual Pro Sale. View Details »

jc2013

 jc2013

Leonardo Collado-Torres

October 03, 2013
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. Fast differential expression analysis
    annotation-agnostic across groups with
    biological replicates
    Leonardo Collado-Torres
    tweet: @fellgernon blog: tinyurl.com/FellBit

    View Slide

  2. @fellgernon
    #biostatJC2013
    Field overview
    Ultimate Goal
    What is the biological (genomic) cause, if any,
    of X disease?
    Currently
    What are the most likely genomic difference(s)
    between two+ groups?

    View Slide

  3. @fellgernon
    #biostatJC2013
    Tools
    • Molecular biology: reverse transcriptase
    • High-throughput sequencing
    • $$ and
    – > Large number of biological replicates
    • Computers
    • Biostatistics
    Image: http://bit.ly/15MVhSU

    View Slide

  4. @fellgernon
    #biostatJC2013
    High-throughput sequencing: DNA
    Mardis. Ann Rev of Genomics and
    Human Genetics 2008

    View Slide

  5. @fellgernon
    #biostatJC2013
    High-throughput sequencing
    Mardis. Ann Rev of Genomics and
    Human Genetics 2008

    View Slide

  6. @fellgernon
    #biostatJC2013
    High-throughput sequencing
    Mardis. Ann Rev of Genomics and
    Human Genetics 2008

    View Slide

  7. @fellgernon
    #biostatJC2013
    Not as pretty in real life
    Whiteford et al. Bioinformatics
    2009

    View Slide

  8. @fellgernon
    #biostatJC2013
    Many things can go wrong, like phasing
    Whiteford et al. Bioinformatics
    2009

    View Slide

  9. @fellgernon
    #biostatJC2013
    So why do we use HTS?
    http://cdn.memegenerator.net/
    instances/400x/33773502.jpg

    View Slide

  10. @fellgernon
    #biostatJC2013
    Transcription -> Translation
    Alberts et al. Molecular Biology of the Cell,
    4th edition 2002 Fig 6-21

    View Slide

  11. @fellgernon
    #biostatJC2013
    Mapping reads from mRNAs
    Trapnell et al. Nat. Biotech 2009

    View Slide

  12. @fellgernon
    #biostatJC2013
    Mapping result: our initial data
    n samples à
    3 billion nt
    Adapted from @leekgroup

    View Slide

  13. @fellgernon
    #biostatJC2013
    Split by chromosome and filter
    n samples à
    ~760 million nt
    Rows with at least 1 sample with coverage > 5

    View Slide

  14. @fellgernon
    #biostatJC2013
    Test for base-level DE
    Adapted from @andrewejaffe

    View Slide

  15. @fellgernon
    #biostatJC2013
    F-statistic at each base-pair
    • Null model
    • Alternative Model

    View Slide

  16. @fellgernon
    #biostatJC2013
    Threshold on F-statistics
    F-statistic corresponding to
    p-value < 10-8 (F5,30
    )
    Adapted from @andrewejaffe

    View Slide

  17. @fellgernon
    #biostatJC2013
    Q-values: qvalue::qvalue
    Permute model matrices and find null
    regions for all chromosomes.

    View Slide

  18. @fellgernon
    #biostatJC2013
    How can we make it fast?
    • Avoid Input/Output as much as possible
    • Work by chromosome
    • Reduce memory
    – Run Length Encoding (IRanges::Rle)
    0000111111222 = (0, 1, 2)
    (4, 6, 3)
    • Use multiple cores (parallel::mclapply)
    – Split data to use cores efficiently
    • Calculate F-stats using Rcpp (Has + and -)

    View Slide

  19. @fellgernon
    #biostatJC2013
    Finding candidate DERs: example
    dataRegions
    450 500 550 600
    segs
    450 500 550 600
    pieces
    450 500 550 600
    ders
    450 500 550 600
    450 500 550 600
    0.0 1.0 2.0
    Index
    f

    View Slide

  20. @fellgernon
    #biostatJC2013
    Remember to map back to genomic coords
    dataRegions
    450 500 550 600

    View Slide

  21. @fellgernon
    #biostatJC2013
    Calculate F-stats and threshold
    450 500 550 600
    0.0 0.5 1.0 1.5 2.0 2.5
    Index
    f

    View Slide

  22. @fellgernon
    #biostatJC2013
    Thresholding identifies segments
    segs
    450 500 550 600

    View Slide

  23. @fellgernon
    #biostatJC2013
    Disjoin data regions & segments
    pieces
    450 500 550 600

    View Slide

  24. @fellgernon
    #biostatJC2013
    Identify pieces that overlap segments
    ders
    450 500 550 600

    View Slide

  25. @fellgernon
    #biostatJC2013
    Example: re-cap
    dataRegions
    450 500 550 600
    segs
    450 500 550 600
    pieces
    450 500 550 600
    ders
    450 500 550 600
    450 500 550 600
    0.0 1.0 2.0
    Index
    f

    View Slide

  26. @fellgernon
    #biostatJC2013
    Example: result
    Cluster for region with name COL6A1 and q-value 0.8256
    chr21
    chr21
    Coverage
    1
    2 group
    CEU
    YRI
    Mean coverage
    0.125
    0.500
    group
    CEU
    YRI
    Regions
    significantQval
    TRUE
    FALSE
    tx_name
    (gene_id)
    tx_name(gene_id)
    47411000 47411200 47411400 47411600

    View Slide

  27. @fellgernon
    #biostatJC2013
    Public datasets
    • derfinderExample:
    – Blood CEU vs YRI non-related individuals
    • derHippo:
    – Brain hippocampus from cocaine addicts, alcohol
    addicts, and controls
    • derSnyder:
    – Michael Snyder time course (~1 year):
    2 x diseases, recovery & healthy periods
    • derStem:
    – 5 stem cell types, 2 replicates per group

    View Slide

  28. @fellgernon
    #biostatJC2013
    P-values: derfinderExample

    View Slide

  29. @fellgernon
    #biostatJC2013
    P-values: derStem

    View Slide

  30. @fellgernon
    #biostatJC2013
    MA-style plots: derfinderExample

    View Slide

  31. @fellgernon
    #biostatJC2013
    derHippo: CO vs CT

    View Slide

  32. @fellgernon
    #biostatJC2013
    derHippo: ETOH vs CT

    View Slide

  33. @fellgernon
    #biostatJC2013
    derHippo
    Genome
    overview of
    regions with
    q-value < 0.1

    View Slide

  34. @fellgernon
    #biostatJC2013
    derHippo: cluster #2

    View Slide

  35. @fellgernon
    #biostatJC2013
    derfinderExample: cluster #6

    View Slide

  36. @fellgernon
    #biostatJC2013
    derSnyder: cluster #3

    View Slide

  37. @fellgernon
    #biostatJC2013
    Coverage adjustment?




















    6.0e+07 8.0e+07 1.0e+08 1.2e+08 1.4e+08 1.6e+08 1.8e+08
    6.0e+08 8.0e+08 1.0e+09 1.2e+09 1.4e+09 1.6e+09
    chr 1 total Cov vs Cov < quantile 0.9
    Coverage at bases < quantile 0.9
    Total coverage
    0to186 186to294 294to322 322to400
















    ● ●


    1.5e+07 2.0e+07 2.5e+07 3.0e+07 3.5e+07 4.0e+07
    6.0e+08 8.0e+08 1.0e+09 1.2e+09 1.4e+09 1.6e+09
    chr 1 total Cov vs # bases with data
    Number bases with Cov > 0
    Total coverage
    0to186 186to294 294to322 322to400
    derSnyder
    Similar to metagenomeSeq::cumNorm

    View Slide

  38. @fellgernon
    #biostatJC2013
    derfinder R package
    https://github.com/lcolladotor/derfinder
    https://github.com/lcolladotor/derfinderReport
    https://github.com/lcolladotor/derfinderExample

    View Slide

  39. @fellgernon
    #biostatJC2013
    Acknowledgements
    Leek Group
    Jeffrey Leek
    Alyssa Frazee
    Hopkins
    Sarven Sabunciyan
    Ben Langmead
    Lieber Institute (LIBD)
    Andrew Jaffe
    Harvard
    Rafa Irizarry
    Funding
    NIH (Aug 2012- July 2013)
    LIBD (Aug 2013 - now)
    CONACyT México

    View Slide