RNA-seq analysis beyond gene counting

567d15666cd2891a4e6c49e007f30a08?s=47 Alyssa Frazee
December 15, 2014

RNA-seq analysis beyond gene counting

Talk I gave for my interview for a tenure-track position in a biostatistics department.

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

December 15, 2014
Tweet

Transcript

  1. Case studies in genomic data science: RNA-seq analysis beyond gene

    counting Alyssa Frazee Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health December 15, 2014
  2. research teaching software blog philosophy answer scientific questions using data-driven

    methods with interpretable results RNA-seq: methods + software biostatistics for medical professionals R, Bioconductor, Python alyssafrazee.com
  3. Research goal: Find genes that behave differently between populations

  4. None
  5. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA gene’s “expression level” =

    amount of RNA in cell that was transcribed from that gene ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
  6. Measuring gene expression: RNA-seq RNA-seq reads Genome (DNA) RNA transcripts

  7. Measuring gene expression: RNA-seq RNA-seq reads Genome (DNA) RNA transcripts

    gene exons introns junctions
  8. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2 GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT + @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1 TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT

    + DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2 AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC + HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC############# @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1 AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC + B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2 GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA + IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8 @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1 TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC + HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;<HHHHGIIHHEC-;;3?
  9. Goal: Find genes that behave differently between populations 1. Can

    we discover previously unknown gene activity? 2. Can we discover expression differences at the transcript level?
  10. Goal: Find genes that behave differently between populations 1. Can

    we discover previously unknown gene activity? 2. Can we discover expression differences at the transcript level?
  11. expression = 24 Genome (DNA) Measuring expression using counts

  12. High information loss Many possible RNA variants Genome (DNA) Plus:

    cannot detect expression outside annotated genes, incorrect annotation causes problems, difficult to study non-canonical genomes (e.g., cancer)
  13. Our solution: DER Finder Frazee et al, Biostatistics 2014 Concept:

    scan genome base- by-base, highlight regions showing differential expression signal
  14. Read coverage coverage vector 2 6 0 11 6 Genome

    (DNA)
  15. DER Finder genomic position DE signal read coverage idea similar

    to Jaffe et al, Int J Epidemiol 2012
  16. DER Finder DE signal read coverage genomic position

  17. Nucleotide-level signal samples indexed by i locations indexed by l

    j confounders indexed by k expression confounders covariate of interest
  18. samples indexed by i locations indexed by l j confounders

    indexed by k expression confounders covariate of interest v Nucleotide-level signal
  19. hidden states (unknown truth) DE DE not DE t 1

    t 2 t 3 t 4 t 5 DE not DE emissions (observed): moderated t-statistics (Smyth 2004) Segmentation: Hidden Markov Model
  20. Emission distribution parameter estimation Efron, Statistical Science 2008

  21. Emission distribution parameter estimation

  22. (candidate DERs) linear models HMM

  23. linear models HMM permutation tests for statistical significance

  24. match to annotation if desired: CECR1, “may play a role

    in regulating cell proliferation”
  25. Results: Y chromosome Frazee et al, Biostatistics 2014

  26. Goal: Find genes that behave differently between populations 1. Can

    we discover previously unknown gene activity? (DER Finder) 2. Can we discover expression differences at the transcript level?
  27. Ideal solution: full reconstruction Reads Estimated Transcripts Genome (DNA)

  28. Abundance estimation expression ≈ 12 for both assembled transcripts Genome

    Estimated Transcripts
  29. But: assembly is hard Bernard et al, Bioinformatics 2014 Simulated

    Data
  30. But: assembly is hard Bernard et al, Bioinformatics 2014 Real

    Data
  31. P-values behaving badly Cuffdiff 2 (Trapnell et al, Nature Biotechnology

    2013) on tumor/normal data (Kim et al, PloS One 2013), downloaded from InSilico DB (Coletta et al, Genome Biology 2012)
  32. some possible assemblies Inherently ambiguous Genome

  33. Counting strategy not appropriate Genome

  34. Concept: software infrastructure and simple, robust statistical techniques improve inference

    for assemblies Our solution: Ballgown Frazee et al, Nature Biotechnology (accepted)
  35. Ballgown Frazee et al, Nature Biotechnology (accepted) transcriptome assembly pipelines

    R/Bioconductor DE analysis
  36. Defines R data structure for assemblies expr matrices GRanges data

    frames
  37. Defines R data structure for assemblies expr GRanges data frames

    Canonical format for differential expression analysis
  38. Facilitates exploratory analysis

  39. Facilitates exploratory analysis Includes functions to get corresponding gene names,

    match assembled and annotated transcripts, plot assembly alongside annotation, etc.
  40. Differential expression analysis drop-in replacement for Cuffdiff F-tests comparing nested

    models
  41. Improved accuracy

  42. Flexible and fast • Suitable for transcripts (not count-based) •

    Enables timecourse and multi-group analyses • Can adjust for confounders or batch effects • Runs in seconds: on cancer data set, 0.7 sec; Cuffdiff: 10 hours and EBSeq: 6 hours EBSeq: Leng et al, Bioinformatics 2013
  43. Application: Processing the GEUVADIS dataset Genetic EUropean VAriation in health

    and DISease Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013
  44. turning “big” data into small data “Make big data as

    small as possible as quick as is possible” -Robert Gentleman
  45. turning “big” data into small data @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD

    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD raw reads (~3 Tb) 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb)
  46. 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG

    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb) ballgown objects (200-600 Mb) turning “big” data into small data http://figshare. com/articles/GEUVADIS_Pr ocessed_Data/1130849
  47. Reproducible; freely available

  48. Download my processed data; save time! 3653 hours 999 hours

    651 hours 5299 total hours, assuming 4 cores available
  49. Results: Continuous covariates Frazee et al, Nature Biotechnology (accepted)

  50. Results: Timecourse analysis

  51. Future Work • Statistical inference and uncertainty measures for transcript

    assemblies
  52. None
  53. Future Work • Statistical inference and uncertainty measures for transcript

    assemblies • Reproducibility strategies for time- consuming analyses
  54. Future Work • Statistical inference and uncertainty measures for transcript

    assemblies • Reproducibility strategies for time- consuming analyses • Continue applying and developing methods to solve scientific problems
  55. Thank you! Collaborators: Jeff Leek (advisor), Sarven Sabunciyan, Kasper Hansen,

    Rafa Irizarry, Steven Salzberg, Ben Langmead, Andrew Jaffe, Geo Pertea, Leonardo Collado Torres
  56. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT

    (2014). “Differential expression analysis of RNA-seq data at single-base resolution.” Biostatistics 15(3): 413-426 Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between transcriptome assembly and expression analysis.” Nature Biotechnology, to appear. AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.” Nature Biotechnology 31(11): 1015-22. Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology 11(10): R106. Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from RNA-seq data with network flows.” Bioinformatics 30(17): 2447-2455. Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22. Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology 41(1): 200-209. Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in humans.” Nature 501(7468): 506-11. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM, Kendziorski C (2013). “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29(8): 1035-1043. Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26(1): 139-40. Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1):3. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53. References
  57. Papers • DER Finder: Biostatistics [link] • Ballgown: Nature Biotechnology

    [preprint link] • Polyester: BiorXiv [preprint link] • ReCount: BMC Bioinformatics [link] • RNA-seq book chapter: [link, Chapter 6] Software https://github.com/alyssafrazee • Ballgown: [link][dev link] • Polyester: [link][dev link] • DER Finder (Leonardo Collado Torres): [link] Where to find my work These links, plus side projects and blog posts: alyssafrazee.com
  58. • human diversity: Simon Abrams, CC BY-SA 2.0 [link] •

    tumor cells: cnicholsonpath (via Flickr), CC BY- SA 2.0 [link] • awesome cast: Jennifer Carole, CC BY-NA 2.0 [link] • cell differentiation: Rasback (via Wikipedia), CC BY-SA 2.5 [link] (I cropped it) Image Credits
  59. No genes annotated here Annotation here does not match data

    But: Frazee et al, Biostatistics 2014