Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tackling big questions with gene expression data

Alyssa Frazee
October 23, 2014

Tackling big questions with gene expression data

My data science job talk! I gave the full version of this talk at one company, and a 15-minute version at another.

Alyssa Frazee

October 23, 2014
Tweet

More Decks by Alyssa Frazee

Other Decks in Science

Transcript

  1. Alyssa Frazee
    Oct 23, 2014
    Tackling big questions
    with gene expression data

    View full-size slide

  2. What makes humans diverse?

    View full-size slide

  3. How does a healthy cell turn into a cancer cell?

    View full-size slide

  4. How do our bodies deal with injuries?
    What gives a cell its identity?
    How do we grow and develop?

    View full-size slide

  5. Stories in DNA data
    Novembre et al 2008, PMID 18758442

    View full-size slide

  6. What is gene expression?

    View full-size slide

  7. AUCAGUCGAUCACCGAU
    What is gene expression?
    transcription
    DNA
    RNA
    translation
    protein
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View full-size slide

  8. AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    gene’s “expression level” = amount of
    RNA in cell that was transcribed from
    that gene
    What is gene expression?
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View full-size slide

  9. How do we measure gene expression?
    RNA-seq
    reads
    (50-100 bp long)
    Genome
    RNA
    transcripts

    View full-size slide

  10. What does gene expression data look like?
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +
    @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
    +
    HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +
    B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +
    HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;

    View full-size slide

  11. How do we measure gene expression?
    RNA-seq
    reads
    (50-100 bp long)
    Genome
    RNA transcripts
    (**unknown**)

    View full-size slide

  12. Data processing
    (1) put reads
    back where
    they came
    from

    View full-size slide

  13. Data processing
    (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24

    View full-size slide

  14. Data processing
    (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24
    (lots of complexity here!)

    View full-size slide

  15. CGAUCACCGAU
    AUCAGUCGAUCACCGAU
    AUCAGUCGAUC
    Example of a complication:
    DNA
    overlapping
    transcripts from
    same gene
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View full-size slide

  16. Data processing
    (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24
    (lots of complexity here!)

    View full-size slide

  17. Is population structure also present in
    gene expression data?
    Makes sense that populations share
    DNA sequences, but do the genes
    themselves behave differently in
    different populations?

    View full-size slide

  18. GEUVADIS gene expression data
    Genetic EUropean VAriation in health and DISease
    Raw gene expression data is publicly available for
    hundreds of individuals, so of course we couldn’t
    resist analyzing it ourselves.
    Two papers were published along with data release: one on scientific/biology
    findings (Lappalainen et al 2013, PMID 24037378) and one on technical patterns in
    gene expression data (‘t Hoen et al 2013, PMID 24037425)

    View full-size slide

  19. turning “big” data into small data
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    raw data (~3 Tb)
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    all transcripts in data set (~150 Mb)

    View full-size slide

  20. turning “big” data into small data
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    all transcripts in data set (~150 Mb)
    combine
    to get
    expression
    matrix

    View full-size slide

  21. GEUVADIS gene expression data
    464
    people
    16,808 transcripts
    matrix entries:
    normalized expression
    measurement for
    person i, transcript j
    My software: https://github.com/alyssafrazee/ballgown & code: https://github.
    com/alyssafrazee/ballgown_code/tree/master/GEUVADIS_preprocessing for processing this
    data

    View full-size slide

  22. Is population structure also present in
    gene expression data?
    populations??

    View full-size slide

  23. Is population structure also present in
    gene expression data?
    No :(

    View full-size slide

  24. Is population structure also present in
    gene expression data?

    View full-size slide

  25. Lab effects
    labs 2, 3, 5
    labs 1, 4, 6, 7

    View full-size slide

  26. Data quality effects

    View full-size slide

  27. Maybe some population structure

    View full-size slide

  28. Maybe some population structure

    View full-size slide

  29. Percent variance explained
    16.7% variance explained
    by PC1 (lab!)

    View full-size slide

  30. Rich Data: more to explore
    ● genotype information
    ● some individual-level characteristics
    (e.g. sex)
    ● lots of technical information
    ● several individuals sequenced in
    replicate
    ● microRNA data
    ● raw & processed data available!

    View full-size slide

  31. Which SNPs affect expression levels of
    other genes?
    expression level
    for transcript 5000
    nucleotides away
    from Chr1, location
    64,122,505
    copies of non-canonical nucleotide at
    Chr1, location 64,122,505

    View full-size slide

  32. cancer data challenges

    View full-size slide

  33. (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24
    cancer data challenges
    ALL BETS ARE OFF

    View full-size slide

  34. a new
    strategy
    Frazee et al 2014, PMID 24398039

    View full-size slide

  35. Thank you!
    Collaborators: Jeff Leek (advisor), Sarven
    Sabunciyan, Ben Langmead, Andrew Jaffe, Steven
    Salzberg, Rafa Irizarry, Kasper Hansen, Geo Pertea,
    Leonardo Collado Torres
    Code/data:
    This analysis: https://gist.github.
    com/alyssafrazee/5eb6b13f282a030c471f
    Processed GEUVADIS data: http://files.figshare.com/1625419/fpkm.
    rda.
    Software for analyzing transcript assemblies in R: https://github.
    com/alyssafrazee/ballgown

    View full-size slide

  36. Image Credits
    ● human diversity: Simon Abrams, CC BY-SA 2.0
    [link]
    ● tumor cells: cnicholsonpath (via Flickr), CC BY-
    SA 2.0 [link]
    ● awesome cast: Jennifer Carole, CC BY-NA 2.0
    [link]
    ● transcription: National Library of Medicine,
    public domain [link]
    ● microarrays: Caricato da Schutz, CC BY 2.5 [link]
    ● cell differentiation: Rasback (via Wikipedia), CC
    BY-SA 2.5 [link] (I cropped it)
    ● fly life cycle: Image Editor (via Flickr), CC BY 2.0
    [link]

    View full-size slide

  37. Metzker 2010, PMID 19997069

    View full-size slide

  38. Metzker 2010, PMID 19997069

    View full-size slide

  39. samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest
    [extra] finding per-nt signal

    View full-size slide

  40. Population structure: DNA data
    1387
    people
    197,146 DNA locations
    where nucleotide varies
    matrix entries: how
    many copies of a “non-
    canonical” nucleotide
    at position j does
    person i have?
    (0, 1, or 2)

    View full-size slide