Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tackling big questions with gene expression data

Alyssa Frazee
October 23, 2014

Tackling big questions with gene expression data

My data science job talk! I gave the full version of this talk at one company, and a 15-minute version at another.

Alyssa Frazee

October 23, 2014
Tweet

More Decks by Alyssa Frazee

Other Decks in Science

Transcript

  1. Alyssa Frazee
    Oct 23, 2014
    Tackling big questions
    with gene expression data

    View Slide

  2. What makes humans diverse?

    View Slide

  3. How does a healthy cell turn into a cancer cell?

    View Slide

  4. How do our bodies deal with injuries?
    What gives a cell its identity?
    How do we grow and develop?

    View Slide

  5. In a cell:

    View Slide

  6. Stories in DNA data
    Novembre et al 2008, PMID 18758442

    View Slide

  7. What is gene expression?

    View Slide

  8. AUCAGUCGAUCACCGAU
    What is gene expression?
    transcription
    DNA
    RNA
    translation
    protein
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View Slide

  9. AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    gene’s “expression level” = amount of
    RNA in cell that was transcribed from
    that gene
    What is gene expression?
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View Slide

  10. How do we measure gene expression?
    RNA-seq
    reads
    (50-100 bp long)
    Genome
    RNA
    transcripts

    View Slide

  11. What does gene expression data look like?
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +
    @[email protected]/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>[email protected]
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
    +
    HIGHIHFHEGE4111:.;[email protected][email protected]?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +
    [email protected]=42:[email protected]>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>[email protected]@[email protected]@DFCCAA8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +
    [email protected];[email protected]@8>5554,/':[email protected]@[email protected]:[email protected]?=GG=;

    View Slide

  12. How do we measure gene expression?
    RNA-seq
    reads
    (50-100 bp long)
    Genome
    RNA transcripts
    (**unknown**)

    View Slide

  13. Data processing
    (1) put reads
    back where
    they came
    from

    View Slide

  14. Data processing
    (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24

    View Slide

  15. Data processing
    (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24
    (lots of complexity here!)

    View Slide

  16. CGAUCACCGAU
    AUCAGUCGAUCACCGAU
    AUCAGUCGAUC
    Example of a complication:
    DNA
    overlapping
    transcripts from
    same gene
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View Slide

  17. Data processing
    (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24
    (lots of complexity here!)

    View Slide

  18. View Slide

  19. Is population structure also present in
    gene expression data?
    Makes sense that populations share
    DNA sequences, but do the genes
    themselves behave differently in
    different populations?

    View Slide

  20. GEUVADIS gene expression data
    Genetic EUropean VAriation in health and DISease
    Raw gene expression data is publicly available for
    hundreds of individuals, so of course we couldn’t
    resist analyzing it ourselves.
    Two papers were published along with data release: one on scientific/biology
    findings (Lappalainen et al 2013, PMID 24037378) and one on technical patterns in
    gene expression data (‘t Hoen et al 2013, PMID 24037425)

    View Slide

  21. turning “big” data into small data
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGG[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    raw data (~3 Tb)
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    all transcripts in data set (~150 Mb)

    View Slide

  22. turning “big” data into small data
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    all transcripts in data set (~150 Mb)
    combine
    to get
    expression
    matrix

    View Slide

  23. GEUVADIS gene expression data
    464
    people
    16,808 transcripts
    matrix entries:
    normalized expression
    measurement for
    person i, transcript j
    My software: https://github.com/alyssafrazee/ballgown & code: https://github.
    com/alyssafrazee/ballgown_code/tree/master/GEUVADIS_preprocessing for processing this
    data

    View Slide

  24. Is population structure also present in
    gene expression data?
    populations??

    View Slide

  25. Is population structure also present in
    gene expression data?
    No :(

    View Slide

  26. Is population structure also present in
    gene expression data?

    View Slide

  27. Lab effects
    labs 2, 3, 5
    labs 1, 4, 6, 7

    View Slide

  28. Data quality effects

    View Slide

  29. Maybe some population structure

    View Slide

  30. Maybe some population structure

    View Slide

  31. Percent variance explained
    16.7% variance explained
    by PC1 (lab!)

    View Slide

  32. Rich Data: more to explore
    ● genotype information
    ● some individual-level characteristics
    (e.g. sex)
    ● lots of technical information
    ● several individuals sequenced in
    replicate
    ● microRNA data
    ● raw & processed data available!

    View Slide

  33. Which SNPs affect expression levels of
    other genes?
    expression level
    for transcript 5000
    nucleotides away
    from Chr1, location
    64,122,505
    copies of non-canonical nucleotide at
    Chr1, location 64,122,505

    View Slide

  34. View Slide

  35. cancer data challenges

    View Slide

  36. (1) put reads
    back where
    they came
    from
    (2) decide how to
    measure
    expression. One
    strategy: count
    number of reads
    originating from
    each gene
    expression = 24
    cancer data challenges
    ALL BETS ARE OFF

    View Slide

  37. a new
    strategy
    Frazee et al 2014, PMID 24398039

    View Slide

  38. Thank you!
    Collaborators: Jeff Leek (advisor), Sarven
    Sabunciyan, Ben Langmead, Andrew Jaffe, Steven
    Salzberg, Rafa Irizarry, Kasper Hansen, Geo Pertea,
    Leonardo Collado Torres
    Code/data:
    This analysis: https://gist.github.
    com/alyssafrazee/5eb6b13f282a030c471f
    Processed GEUVADIS data: http://files.figshare.com/1625419/fpkm.
    rda.
    Software for analyzing transcript assemblies in R: https://github.
    com/alyssafrazee/ballgown

    View Slide

  39. Image Credits
    ● human diversity: Simon Abrams, CC BY-SA 2.0
    [link]
    ● tumor cells: cnicholsonpath (via Flickr), CC BY-
    SA 2.0 [link]
    ● awesome cast: Jennifer Carole, CC BY-NA 2.0
    [link]
    ● transcription: National Library of Medicine,
    public domain [link]
    ● microarrays: Caricato da Schutz, CC BY 2.5 [link]
    ● cell differentiation: Rasback (via Wikipedia), CC
    BY-SA 2.5 [link] (I cropped it)
    ● fly life cycle: Image Editor (via Flickr), CC BY 2.0
    [link]

    View Slide

  40. Metzker 2010, PMID 19997069

    View Slide

  41. Metzker 2010, PMID 19997069

    View Slide

  42. samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest
    [extra] finding per-nt signal

    View Slide

  43. Population structure: DNA data
    1387
    people
    197,146 DNA locations
    where nucleotide varies
    matrix entries: how
    many copies of a “non-
    canonical” nucleotide
    at position j does
    person i have?
    (0, 1, or 2)

    View Slide