$30 off During Our Annual Pro Sale. View Details »

user2013

 user2013

Leonardo Collado-Torres

July 10, 2013
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. Differential expression analysis of RNA-seq
    data at base-pair resolution in multiple
    biological replicates
    Leonardo Collado-Torres
    tweet: @fellgernon blog: tinyurl.com/FellBit

    View Slide

  2. @fellgernon
    #useR2013
    Field overview
    Ultimate Goal: What is the biological
    (genomic) cause, if any, of X disease?
    Currently: What is the difference between two
    groups?
    Tools
    –  Large number of biological samples
    –  High-throughput sequencing
    –  Reverse transcriptase
    –  Biostatistics
    –  Computers

    View Slide

  3. @fellgernon
    #useR2013
    High-throughput sequencing
    h"p://res.illumina.com/documents/products/
    datasheets/datasheet_hiseq2000.pdf  

    View Slide

  4. @fellgernon
    #useR2013
    RNA-seq
    purified  RNA  
    shear  
    cDNA  fragments  
    sequence  fragment  
    end(s)  
    or  paired-­‐end:  
    single-­‐end:  
    map  reads  to  
    reference  genome  
    summarize  mapped  
    reads  into  feature-­‐
    level  abundances  
    (gene,  exon,  
    transcript…)  
    normalize  abundances  
    test  each  
    feature  for  
    differenJal  
    expression  
    Oshlack  et  al.  Genome  Biology  2010  
    Adapted  from  @acfrazee  

    View Slide

  5. @fellgernon
    #useR2013
    Mapping reads
    Trapnell  et  al.  Nat.  Biotech  2009  

    View Slide

  6. @fellgernon
    #useR2013
    Mapping result: our initial data
    n  samples  à  
    3  billion  nt    
    Frazee et al. Biostatistics in review
    Adapted from @leekgroup

    View Slide

  7. @fellgernon
    #useR2013
    Approach 1: Annotate-then-Identify
    Genome
    Union model:
    Count by taking
    union of all exons
    Bullard et al. BMC Bioinformatics 2010
    Adapted from @leekgroup

    View Slide

  8. @fellgernon
    #useR2013
    Approach 2: assemble-then-identify
    Genome
    Fragments
    Transcripts
    Trapnell et al. Nat. Biotech 2010
    Adapted from @leekgroup

    View Slide

  9. @fellgernon
    #useR2013
    Approach 2: inherent ambiguity (assembly)
    Genome
    Alternative
    Assemblies
    Adapted from @leekgroup

    View Slide

  10. @fellgernon
    #useR2013
    Methods overview
    annotate-identify
    1.  align
    2.  gene-model
    3.  abundances
    4.  analyze
    Pros:
    •  analogous to microarray,
    •  processed data easy to handle
    Cons:
    •  incorrect/variable annotation
    •  gene model choices have a big
    impact
    assemble-identify
    1.  align
    2.  assemble
    3.  abundances
    4.  analyze
    Pros:
    •  alternative transcription
    •  (potentially) less annotation
    dependent
    Cons:
    •  ambiguity/variation in
    assembly
    Adapted from @leekgroup

    View Slide

  11. @fellgernon
    #useR2013
    derFinder: differentially expressed region
    finder
    g() = Transform (Box-Cox, log(+32) etc.)
    Yi,j
    = coverage on sample i at base j
    lj
    = genomic location j
    α() = baseline coverage
    β() = change in coverage between groups
    Xi
    = covariate of interest for sample i
    γk
    ()= adjustment’s for confounders
    Wik
    = value of kth confounder on ith sample
    Frazee et al. Biostatistics in review
    Adapted from @leekgroup
    base-­‐pair  model  (case/control)  

    View Slide

  12. @fellgernon
    #useR2013
    Determining differential expressed regions
    •  Test that β(l) = 0
    s(l) = statistic for location l
    •  HMM with three states
    D(l) = hidden state at location l
    D(l) = 0 if α(l) = β(l) = 0 (no expression)
    D(l) = 1 if α(l) ≠ 0 and β(l) = 0 (expressed, but no group difference)
    D(l) = 2 if β(l) ≠ 0 (group difference!)
    •  Assume statistics conditional on state is Normal
    s(l)|D(l) = d ~ (µd
    ,σd
    2)
    f0
    : µ0
    =0, σ0
    2=ε
    f1
    : µ1
    =0, σ1
    2
    f2
    : µ2
    , σ2
    2
    •  Estimate parameters using a mixture model
    t ~ π0
    f0
    + π1
    f1
    + π2
    f2
    Frazee et al. Biostatistics in review
    Adapted from @leekgroup

    View Slide

  13. @fellgernon
    #useR2013
    Monte-Carlo p-value
    Observed   Null    
    Jaffe et al. Biostatistics 2011
    Frazee et al. Biostatistics in review
    Langmead et al. in prep
    Adapted from @leekgroup

    View Slide

  14. @fellgernon
    #useR2013
    Annotation incorrect?
    4.5 5.0 5.5 6.0 6.5 7.0
    log2(count+32)
    chrY: 15016699 - 15017219
    female
    male
    1 2 3 4 5 6 7
    t statistic
    xaxinds
    exons states
    15016742 15016842 15016942 15017119 15017219
    genomic position
    Frazee et al. Biostatistics in review
    Adapted from @leekgroup

    View Slide

  15. @fellgernon
    #useR2013
    Annotation missing?
    4.5 5.0 5.5 6.0 6.5 7.0
    log2(count+32)
    chrY: 2715932-2716691
    female
    male
    2.0 2.5 3.0 3.5 4.0 4.5
    t statistic
    xaxinds
    exons states
    2715882 2716082 2716282 2716482 2716682
    genomic position
    Frazee et al. Biostatistics in review
    Adapted from @leekgroup

    View Slide

  16. @fellgernon
    #useR2013
    derfinder R package
    •  Input: alignment files (TopHat, …)
    •  Processing coverage by chr into R via
    Rsamtools
    •  Memory reduction via IRanges
    – 3 to 43 GB RAM per chr for 30 samples
    – Alternative: sql database
    •  Linear model fitting via limma
    •  HMM via HiddenMarkov
    •  Custom plotting functions
    •  To-do: reduce wallclock time

    View Slide

  17. @fellgernon
    #useR2013
    derfinder R package
    https://github.com/alyssafrazee/derfinder

    View Slide

  18. @fellgernon
    #useR2013
    Acknowledgements
    Leek Group
    Jeffrey Leek
    Alyssa Frazee
    Hopkins
    Sarven Sabunciyan
    Ben Langmead
    Leiber Institute
    Andrew Jaffe
    Harvard
    Rafa Irizarry
    Funding
    NIH
    CONACyT México

    View Slide