Slide 1

Slide 1 text

Case studies in genomic data science: RNA-seq analysis beyond gene counting Alyssa Frazee Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health December 15, 2014

Slide 2

Slide 2 text

research teaching software blog philosophy answer scientific questions using data-driven methods with interpretable results RNA-seq: methods + software biostatistics for medical professionals R, Bioconductor, Python alyssafrazee.com

Slide 3

Slide 3 text

Research goal: Find genes that behave differently between populations

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA gene’s “expression level” = amount of RNA in cell that was transcribed from that gene ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

Slide 6

Slide 6 text

Measuring gene expression: RNA-seq RNA-seq reads Genome (DNA) RNA transcripts

Slide 7

Slide 7 text

Measuring gene expression: RNA-seq RNA-seq reads Genome (DNA) RNA transcripts gene exons introns junctions

Slide 8

Slide 8 text

@22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2 GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT + @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1 TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT + DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2 AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC + HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC############# @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1 AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC + B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2 GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA + IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8 @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1 TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC + HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;

Slide 9

Slide 9 text

Goal: Find genes that behave differently between populations 1. Can we discover previously unknown gene activity? 2. Can we discover expression differences at the transcript level?

Slide 10

Slide 10 text

Goal: Find genes that behave differently between populations 1. Can we discover previously unknown gene activity? 2. Can we discover expression differences at the transcript level?

Slide 11

Slide 11 text

expression = 24 Genome (DNA) Measuring expression using counts

Slide 12

Slide 12 text

High information loss Many possible RNA variants Genome (DNA) Plus: cannot detect expression outside annotated genes, incorrect annotation causes problems, difficult to study non-canonical genomes (e.g., cancer)

Slide 13

Slide 13 text

Our solution: DER Finder Frazee et al, Biostatistics 2014 Concept: scan genome base- by-base, highlight regions showing differential expression signal

Slide 14

Slide 14 text

Read coverage coverage vector 2 6 0 11 6 Genome (DNA)

Slide 15

Slide 15 text

DER Finder genomic position DE signal read coverage idea similar to Jaffe et al, Int J Epidemiol 2012

Slide 16

Slide 16 text

DER Finder DE signal read coverage genomic position

Slide 17

Slide 17 text

Nucleotide-level signal samples indexed by i locations indexed by l j confounders indexed by k expression confounders covariate of interest

Slide 18

Slide 18 text

samples indexed by i locations indexed by l j confounders indexed by k expression confounders covariate of interest v Nucleotide-level signal

Slide 19

Slide 19 text

hidden states (unknown truth) DE DE not DE t 1 t 2 t 3 t 4 t 5 DE not DE emissions (observed): moderated t-statistics (Smyth 2004) Segmentation: Hidden Markov Model

Slide 20

Slide 20 text

Emission distribution parameter estimation Efron, Statistical Science 2008

Slide 21

Slide 21 text

Emission distribution parameter estimation

Slide 22

Slide 22 text

(candidate DERs) linear models HMM

Slide 23

Slide 23 text

linear models HMM permutation tests for statistical significance

Slide 24

Slide 24 text

match to annotation if desired: CECR1, “may play a role in regulating cell proliferation”

Slide 25

Slide 25 text

Results: Y chromosome Frazee et al, Biostatistics 2014

Slide 26

Slide 26 text

Goal: Find genes that behave differently between populations 1. Can we discover previously unknown gene activity? (DER Finder) 2. Can we discover expression differences at the transcript level?

Slide 27

Slide 27 text

Ideal solution: full reconstruction Reads Estimated Transcripts Genome (DNA)

Slide 28

Slide 28 text

Abundance estimation expression ≈ 12 for both assembled transcripts Genome Estimated Transcripts

Slide 29

Slide 29 text

But: assembly is hard Bernard et al, Bioinformatics 2014 Simulated Data

Slide 30

Slide 30 text

But: assembly is hard Bernard et al, Bioinformatics 2014 Real Data

Slide 31

Slide 31 text

P-values behaving badly Cuffdiff 2 (Trapnell et al, Nature Biotechnology 2013) on tumor/normal data (Kim et al, PloS One 2013), downloaded from InSilico DB (Coletta et al, Genome Biology 2012)

Slide 32

Slide 32 text

some possible assemblies Inherently ambiguous Genome

Slide 33

Slide 33 text

Counting strategy not appropriate Genome

Slide 34

Slide 34 text

Concept: software infrastructure and simple, robust statistical techniques improve inference for assemblies Our solution: Ballgown Frazee et al, Nature Biotechnology (accepted)

Slide 35

Slide 35 text

Ballgown Frazee et al, Nature Biotechnology (accepted) transcriptome assembly pipelines R/Bioconductor DE analysis

Slide 36

Slide 36 text

Defines R data structure for assemblies expr matrices GRanges data frames

Slide 37

Slide 37 text

Defines R data structure for assemblies expr GRanges data frames Canonical format for differential expression analysis

Slide 38

Slide 38 text

Facilitates exploratory analysis

Slide 39

Slide 39 text

Facilitates exploratory analysis Includes functions to get corresponding gene names, match assembled and annotated transcripts, plot assembly alongside annotation, etc.

Slide 40

Slide 40 text

Differential expression analysis drop-in replacement for Cuffdiff F-tests comparing nested models

Slide 41

Slide 41 text

Improved accuracy

Slide 42

Slide 42 text

Flexible and fast ● Suitable for transcripts (not count-based) ● Enables timecourse and multi-group analyses ● Can adjust for confounders or batch effects ● Runs in seconds: on cancer data set, 0.7 sec; Cuffdiff: 10 hours and EBSeq: 6 hours EBSeq: Leng et al, Bioinformatics 2013

Slide 43

Slide 43 text

Application: Processing the GEUVADIS dataset Genetic EUropean VAriation in health and DISease Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013

Slide 44

Slide 44 text

turning “big” data into small data “Make big data as small as possible as quick as is possible” -Robert Gentleman

Slide 45

Slide 45 text

turning “big” data into small data @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD raw reads (~3 Tb) 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb)

Slide 46

Slide 46 text

300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb) ballgown objects (200-600 Mb) turning “big” data into small data http://figshare. com/articles/GEUVADIS_Pr ocessed_Data/1130849

Slide 47

Slide 47 text

Reproducible; freely available

Slide 48

Slide 48 text

Download my processed data; save time! 3653 hours 999 hours 651 hours 5299 total hours, assuming 4 cores available

Slide 49

Slide 49 text

Results: Continuous covariates Frazee et al, Nature Biotechnology (accepted)

Slide 50

Slide 50 text

Results: Timecourse analysis

Slide 51

Slide 51 text

Future Work ● Statistical inference and uncertainty measures for transcript assemblies

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Future Work ● Statistical inference and uncertainty measures for transcript assemblies ● Reproducibility strategies for time- consuming analyses

Slide 54

Slide 54 text

Future Work ● Statistical inference and uncertainty measures for transcript assemblies ● Reproducibility strategies for time- consuming analyses ● Continue applying and developing methods to solve scientific problems

Slide 55

Slide 55 text

Thank you! Collaborators: Jeff Leek (advisor), Sarven Sabunciyan, Kasper Hansen, Rafa Irizarry, Steven Salzberg, Ben Langmead, Andrew Jaffe, Geo Pertea, Leonardo Collado Torres

Slide 56

Slide 56 text

Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014). “Differential expression analysis of RNA-seq data at single-base resolution.” Biostatistics 15(3): 413-426 Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between transcriptome assembly and expression analysis.” Nature Biotechnology, to appear. AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.” Nature Biotechnology 31(11): 1015-22. Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology 11(10): R106. Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from RNA-seq data with network flows.” Bioinformatics 30(17): 2447-2455. Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22. Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology 41(1): 200-209. Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in humans.” Nature 501(7468): 506-11. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM, Kendziorski C (2013). “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29(8): 1035-1043. Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26(1): 139-40. Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1):3. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53. References

Slide 57

Slide 57 text

Papers ● DER Finder: Biostatistics [link] ● Ballgown: Nature Biotechnology [preprint link] ● Polyester: BiorXiv [preprint link] ● ReCount: BMC Bioinformatics [link] ● RNA-seq book chapter: [link, Chapter 6] Software https://github.com/alyssafrazee ● Ballgown: [link][dev link] ● Polyester: [link][dev link] ● DER Finder (Leonardo Collado Torres): [link] Where to find my work These links, plus side projects and blog posts: alyssafrazee.com

Slide 58

Slide 58 text

● human diversity: Simon Abrams, CC BY-SA 2.0 [link] ● tumor cells: cnicholsonpath (via Flickr), CC BY- SA 2.0 [link] ● awesome cast: Jennifer Carole, CC BY-NA 2.0 [link] ● cell differentiation: Rasback (via Wikipedia), CC BY-SA 2.5 [link] (I cropped it) Image Credits

Slide 59

Slide 59 text

No genes annotated here Annotation here does not match data But: Frazee et al, Biostatistics 2014