RNA-seq analysis beyond gene counting

567d15666cd2891a4e6c49e007f30a08?s=47 Alyssa Frazee
December 15, 2014

RNA-seq analysis beyond gene counting

Talk I gave for my interview for a tenure-track position in a biostatistics department.

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

December 15, 2014
Tweet

Transcript

  1. 1.

    Case studies in genomic data science: RNA-seq analysis beyond gene

    counting Alyssa Frazee Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health December 15, 2014
  2. 2.

    research teaching software blog philosophy answer scientific questions using data-driven

    methods with interpretable results RNA-seq: methods + software biostatistics for medical professionals R, Bioconductor, Python alyssafrazee.com
  3. 4.
  4. 5.

    Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA gene’s “expression level” =

    amount of RNA in cell that was transcribed from that gene ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
  5. 8.

    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2 GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT + @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1 TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT

    + DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2 AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC + HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC############# @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1 AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC + B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2 GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA + IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8 @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1 TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC + HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;<HHHHGIIHHEC-;;3?
  6. 9.

    Goal: Find genes that behave differently between populations 1. Can

    we discover previously unknown gene activity? 2. Can we discover expression differences at the transcript level?
  7. 10.

    Goal: Find genes that behave differently between populations 1. Can

    we discover previously unknown gene activity? 2. Can we discover expression differences at the transcript level?
  8. 12.

    High information loss Many possible RNA variants Genome (DNA) Plus:

    cannot detect expression outside annotated genes, incorrect annotation causes problems, difficult to study non-canonical genomes (e.g., cancer)
  9. 13.

    Our solution: DER Finder Frazee et al, Biostatistics 2014 Concept:

    scan genome base- by-base, highlight regions showing differential expression signal
  10. 15.
  11. 17.

    Nucleotide-level signal samples indexed by i locations indexed by l

    j confounders indexed by k expression confounders covariate of interest
  12. 18.

    samples indexed by i locations indexed by l j confounders

    indexed by k expression confounders covariate of interest v Nucleotide-level signal
  13. 19.

    hidden states (unknown truth) DE DE not DE t 1

    t 2 t 3 t 4 t 5 DE not DE emissions (observed): moderated t-statistics (Smyth 2004) Segmentation: Hidden Markov Model
  14. 24.

    match to annotation if desired: CECR1, “may play a role

    in regulating cell proliferation”
  15. 26.

    Goal: Find genes that behave differently between populations 1. Can

    we discover previously unknown gene activity? (DER Finder) 2. Can we discover expression differences at the transcript level?
  16. 31.

    P-values behaving badly Cuffdiff 2 (Trapnell et al, Nature Biotechnology

    2013) on tumor/normal data (Kim et al, PloS One 2013), downloaded from InSilico DB (Coletta et al, Genome Biology 2012)
  17. 34.

    Concept: software infrastructure and simple, robust statistical techniques improve inference

    for assemblies Our solution: Ballgown Frazee et al, Nature Biotechnology (accepted)
  18. 37.

    Defines R data structure for assemblies expr GRanges data frames

    Canonical format for differential expression analysis
  19. 39.

    Facilitates exploratory analysis Includes functions to get corresponding gene names,

    match assembled and annotated transcripts, plot assembly alongside annotation, etc.
  20. 42.

    Flexible and fast • Suitable for transcripts (not count-based) •

    Enables timecourse and multi-group analyses • Can adjust for confounders or batch effects • Runs in seconds: on cancer data set, 0.7 sec; Cuffdiff: 10 hours and EBSeq: 6 hours EBSeq: Leng et al, Bioinformatics 2013
  21. 43.

    Application: Processing the GEUVADIS dataset Genetic EUropean VAriation in health

    and DISease Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013
  22. 44.

    turning “big” data into small data “Make big data as

    small as possible as quick as is possible” -Robert Gentleman
  23. 45.

    turning “big” data into small data @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD

    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD raw reads (~3 Tb) 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb)
  24. 46.

    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG

    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb) ballgown objects (200-600 Mb) turning “big” data into small data http://figshare. com/articles/GEUVADIS_Pr ocessed_Data/1130849
  25. 48.

    Download my processed data; save time! 3653 hours 999 hours

    651 hours 5299 total hours, assuming 4 cores available
  26. 52.
  27. 53.

    Future Work • Statistical inference and uncertainty measures for transcript

    assemblies • Reproducibility strategies for time- consuming analyses
  28. 54.

    Future Work • Statistical inference and uncertainty measures for transcript

    assemblies • Reproducibility strategies for time- consuming analyses • Continue applying and developing methods to solve scientific problems
  29. 55.

    Thank you! Collaborators: Jeff Leek (advisor), Sarven Sabunciyan, Kasper Hansen,

    Rafa Irizarry, Steven Salzberg, Ben Langmead, Andrew Jaffe, Geo Pertea, Leonardo Collado Torres
  30. 56.

    Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT

    (2014). “Differential expression analysis of RNA-seq data at single-base resolution.” Biostatistics 15(3): 413-426 Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between transcriptome assembly and expression analysis.” Nature Biotechnology, to appear. AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.” Nature Biotechnology 31(11): 1015-22. Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology 11(10): R106. Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from RNA-seq data with network flows.” Bioinformatics 30(17): 2447-2455. Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22. Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology 41(1): 200-209. Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in humans.” Nature 501(7468): 506-11. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM, Kendziorski C (2013). “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29(8): 1035-1043. Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26(1): 139-40. Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1):3. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53. References
  31. 57.

    Papers • DER Finder: Biostatistics [link] • Ballgown: Nature Biotechnology

    [preprint link] • Polyester: BiorXiv [preprint link] • ReCount: BMC Bioinformatics [link] • RNA-seq book chapter: [link, Chapter 6] Software https://github.com/alyssafrazee • Ballgown: [link][dev link] • Polyester: [link][dev link] • DER Finder (Leonardo Collado Torres): [link] Where to find my work These links, plus side projects and blog posts: alyssafrazee.com
  32. 58.

    • human diversity: Simon Abrams, CC BY-SA 2.0 [link] •

    tumor cells: cnicholsonpath (via Flickr), CC BY- SA 2.0 [link] • awesome cast: Jennifer Carole, CC BY-NA 2.0 [link] • cell differentiation: Rasback (via Wikipedia), CC BY-SA 2.5 [link] (I cropped it) Image Credits
  33. 59.

    No genes annotated here Annotation here does not match data

    But: Frazee et al, Biostatistics 2014