Tackling big questions with gene expression data

567d15666cd2891a4e6c49e007f30a08?s=47 Alyssa Frazee
October 23, 2014

Tackling big questions with gene expression data

My data science job talk! I gave the full version of this talk at one company, and a 15-minute version at another.

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

October 23, 2014
Tweet

Transcript

  1. Alyssa Frazee Oct 23, 2014 Tackling big questions with gene

    expression data
  2. What makes humans diverse?

  3. How does a healthy cell turn into a cancer cell?

  4. How do our bodies deal with injuries? What gives a

    cell its identity? How do we grow and develop?
  5. In a cell:

  6. Stories in DNA data Novembre et al 2008, PMID 18758442

  7. What is gene expression?

  8. AUCAGUCGAUCACCGAU What is gene expression? transcription DNA RNA translation protein

    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
  9. AUCAGUCGAUCACCGAU transcription DNA RNA gene’s “expression level” = amount of

    RNA in cell that was transcribed from that gene What is gene expression? ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
  10. How do we measure gene expression? RNA-seq reads (50-100 bp

    long) Genome RNA transcripts
  11. What does gene expression data look like? @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA +

    GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2 GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT + @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1 TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT + DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2 AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC + HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC############# @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1 AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC + B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2 GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA + IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8 @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1 TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC + HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;<HHHHGIIHHEC-;;3?
  12. How do we measure gene expression? RNA-seq reads (50-100 bp

    long) Genome RNA transcripts (**unknown**)
  13. Data processing (1) put reads back where they came from

  14. Data processing (1) put reads back where they came from

    (2) decide how to measure expression. One strategy: count number of reads originating from each gene expression = 24
  15. Data processing (1) put reads back where they came from

    (2) decide how to measure expression. One strategy: count number of reads originating from each gene expression = 24 (lots of complexity here!)
  16. CGAUCACCGAU AUCAGUCGAUCACCGAU AUCAGUCGAUC Example of a complication: DNA overlapping transcripts

    from same gene ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
  17. Data processing (1) put reads back where they came from

    (2) decide how to measure expression. One strategy: count number of reads originating from each gene expression = 24 (lots of complexity here!)
  18. None
  19. Is population structure also present in gene expression data? Makes

    sense that populations share DNA sequences, but do the genes themselves behave differently in different populations?
  20. GEUVADIS gene expression data Genetic EUropean VAriation in health and

    DISease Raw gene expression data is publicly available for hundreds of individuals, so of course we couldn’t resist analyzing it ourselves. Two papers were published along with data release: one on scientific/biology findings (Lappalainen et al 2013, PMID 24037378) and one on technical patterns in gene expression data (‘t Hoen et al 2013, PMID 24037425)
  21. turning “big” data into small data @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD

    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD raw data (~3 Tb) 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number all transcripts in data set (~150 Mb)
  22. turning “big” data into small data 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282

    0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number all transcripts in data set (~150 Mb) combine to get expression matrix
  23. GEUVADIS gene expression data 464 people 16,808 transcripts matrix entries:

    normalized expression measurement for person i, transcript j My software: https://github.com/alyssafrazee/ballgown & code: https://github. com/alyssafrazee/ballgown_code/tree/master/GEUVADIS_preprocessing for processing this data
  24. Is population structure also present in gene expression data? populations??

  25. Is population structure also present in gene expression data? No

    :(
  26. Is population structure also present in gene expression data?

  27. Lab effects labs 2, 3, 5 labs 1, 4, 6,

    7
  28. Data quality effects

  29. Maybe some population structure

  30. Maybe some population structure

  31. Percent variance explained 16.7% variance explained by PC1 (lab!)

  32. Rich Data: more to explore • genotype information • some

    individual-level characteristics (e.g. sex) • lots of technical information • several individuals sequenced in replicate • microRNA data • raw & processed data available!
  33. Which SNPs affect expression levels of other genes? expression level

    for transcript 5000 nucleotides away from Chr1, location 64,122,505 copies of non-canonical nucleotide at Chr1, location 64,122,505
  34. None
  35. cancer data challenges

  36. (1) put reads back where they came from (2) decide

    how to measure expression. One strategy: count number of reads originating from each gene expression = 24 cancer data challenges ALL BETS ARE OFF
  37. a new strategy Frazee et al 2014, PMID 24398039

  38. Thank you! Collaborators: Jeff Leek (advisor), Sarven Sabunciyan, Ben Langmead,

    Andrew Jaffe, Steven Salzberg, Rafa Irizarry, Kasper Hansen, Geo Pertea, Leonardo Collado Torres Code/data: This analysis: https://gist.github. com/alyssafrazee/5eb6b13f282a030c471f Processed GEUVADIS data: http://files.figshare.com/1625419/fpkm. rda. Software for analyzing transcript assemblies in R: https://github. com/alyssafrazee/ballgown
  39. Image Credits • human diversity: Simon Abrams, CC BY-SA 2.0

    [link] • tumor cells: cnicholsonpath (via Flickr), CC BY- SA 2.0 [link] • awesome cast: Jennifer Carole, CC BY-NA 2.0 [link] • transcription: National Library of Medicine, public domain [link] • microarrays: Caricato da Schutz, CC BY 2.5 [link] • cell differentiation: Rasback (via Wikipedia), CC BY-SA 2.5 [link] (I cropped it) • fly life cycle: Image Editor (via Flickr), CC BY 2.0 [link]
  40. Metzker 2010, PMID 19997069

  41. Metzker 2010, PMID 19997069

  42. samples indexed by i locations indexed by l j confounders

    indexed by k expression confounders covariate of interest [extra] finding per-nt signal
  43. Population structure: DNA data 1387 people 197,146 DNA locations where

    nucleotide varies matrix entries: how many copies of a “non- canonical” nucleotide at position j does person i have? (0, 1, or 2)