Upgrade to Pro — share decks privately, control downloads, hide ads and more …

user2013

 user2013

7382f7fe30561274624635116513ca37?s=128

Leonardo Collado-Torres

July 10, 2013
Tweet

Transcript

  1. Differential expression analysis of RNA-seq data at base-pair resolution in

    multiple biological replicates Leonardo Collado-Torres tweet: @fellgernon blog: tinyurl.com/FellBit
  2. @fellgernon #useR2013 Field overview Ultimate Goal: What is the biological

    (genomic) cause, if any, of X disease? Currently: What is the difference between two groups? Tools –  Large number of biological samples –  High-throughput sequencing –  Reverse transcriptase –  Biostatistics –  Computers
  3. @fellgernon #useR2013 High-throughput sequencing h"p://res.illumina.com/documents/products/ datasheets/datasheet_hiseq2000.pdf  

  4. @fellgernon #useR2013 RNA-seq purified  RNA   shear   cDNA  fragments

      sequence  fragment   end(s)   or  paired-­‐end:   single-­‐end:   map  reads  to   reference  genome   summarize  mapped   reads  into  feature-­‐ level  abundances   (gene,  exon,   transcript…)   normalize  abundances   test  each   feature  for   differenJal   expression   Oshlack  et  al.  Genome  Biology  2010   Adapted  from  @acfrazee  
  5. @fellgernon #useR2013 Mapping reads Trapnell  et  al.  Nat.  Biotech  2009

     
  6. @fellgernon #useR2013 Mapping result: our initial data n  samples  à

      3  billion  nt     Frazee et al. Biostatistics in review Adapted from @leekgroup
  7. @fellgernon #useR2013 Approach 1: Annotate-then-Identify Genome Union model: Count by

    taking union of all exons Bullard et al. BMC Bioinformatics 2010 Adapted from @leekgroup
  8. @fellgernon #useR2013 Approach 2: assemble-then-identify Genome Fragments Transcripts Trapnell et

    al. Nat. Biotech 2010 Adapted from @leekgroup
  9. @fellgernon #useR2013 Approach 2: inherent ambiguity (assembly) Genome Alternative Assemblies

    Adapted from @leekgroup
  10. @fellgernon #useR2013 Methods overview annotate-identify 1.  align 2.  gene-model 3. 

    abundances 4.  analyze Pros: •  analogous to microarray, •  processed data easy to handle Cons: •  incorrect/variable annotation •  gene model choices have a big impact assemble-identify 1.  align 2.  assemble 3.  abundances 4.  analyze Pros: •  alternative transcription •  (potentially) less annotation dependent Cons: •  ambiguity/variation in assembly Adapted from @leekgroup
  11. @fellgernon #useR2013 derFinder: differentially expressed region finder g() = Transform

    (Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups Xi = covariate of interest for sample i γk ()= adjustment’s for confounders Wik = value of kth confounder on ith sample Frazee et al. Biostatistics in review Adapted from @leekgroup base-­‐pair  model  (case/control)  
  12. @fellgernon #useR2013 Determining differential expressed regions •  Test that β(l)

    = 0 s(l) = statistic for location l •  HMM with three states D(l) = hidden state at location l D(l) = 0 if α(l) = β(l) = 0 (no expression) D(l) = 1 if α(l) ≠ 0 and β(l) = 0 (expressed, but no group difference) D(l) = 2 if β(l) ≠ 0 (group difference!) •  Assume statistics conditional on state is Normal s(l)|D(l) = d ~ (µd ,σd 2) f0 : µ0 =0, σ0 2=ε f1 : µ1 =0, σ1 2 f2 : µ2 , σ2 2 •  Estimate parameters using a mixture model t ~ π0 f0 + π1 f1 + π2 f2 Frazee et al. Biostatistics in review Adapted from @leekgroup
  13. @fellgernon #useR2013 Monte-Carlo p-value Observed   Null     Jaffe

    et al. Biostatistics 2011 Frazee et al. Biostatistics in review Langmead et al. in prep Adapted from @leekgroup
  14. @fellgernon #useR2013 Annotation incorrect? 4.5 5.0 5.5 6.0 6.5 7.0

    log2(count+32) chrY: 15016699 - 15017219 female male 1 2 3 4 5 6 7 t statistic xaxinds exons states 15016742 15016842 15016942 15017119 15017219 genomic position Frazee et al. Biostatistics in review Adapted from @leekgroup
  15. @fellgernon #useR2013 Annotation missing? 4.5 5.0 5.5 6.0 6.5 7.0

    log2(count+32) chrY: 2715932-2716691 female male 2.0 2.5 3.0 3.5 4.0 4.5 t statistic xaxinds exons states 2715882 2716082 2716282 2716482 2716682 genomic position Frazee et al. Biostatistics in review Adapted from @leekgroup
  16. @fellgernon #useR2013 derfinder R package •  Input: alignment files (TopHat,

    …) •  Processing coverage by chr into R via Rsamtools •  Memory reduction via IRanges – 3 to 43 GB RAM per chr for 30 samples – Alternative: sql database •  Linear model fitting via limma •  HMM via HiddenMarkov •  Custom plotting functions •  To-do: reduce wallclock time
  17. @fellgernon #useR2013 derfinder R package https://github.com/alyssafrazee/derfinder

  18. @fellgernon #useR2013 Acknowledgements Leek Group Jeffrey Leek Alyssa Frazee Hopkins

    Sarven Sabunciyan Ben Langmead Leiber Institute Andrew Jaffe Harvard Rafa Irizarry Funding NIH CONACyT México