Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RNA-seq analysis beyond gene counting

Alyssa Frazee
December 15, 2014

RNA-seq analysis beyond gene counting

Talk I gave for my interview for a tenure-track position in a biostatistics department.

Alyssa Frazee

December 15, 2014
Tweet

More Decks by Alyssa Frazee

Other Decks in Research

Transcript

  1. Case studies in genomic
    data science:
    RNA-seq analysis beyond
    gene counting
    Alyssa Frazee
    Department of Biostatistics, Johns Hopkins
    Bloomberg School of Public Health
    December 15, 2014

    View full-size slide

  2. research
    teaching
    software
    blog
    philosophy
    answer scientific questions
    using data-driven methods
    with interpretable results
    RNA-seq: methods +
    software
    biostatistics for
    medical professionals
    R, Bioconductor,
    Python
    alyssafrazee.com

    View full-size slide

  3. Research goal:
    Find genes that behave differently
    between populations

    View full-size slide

  4. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    gene’s “expression level” = amount of
    RNA in cell that was transcribed from
    that gene
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View full-size slide

  5. Measuring gene expression: RNA-seq
    RNA-seq
    reads
    Genome
    (DNA)
    RNA
    transcripts

    View full-size slide

  6. Measuring gene expression: RNA-seq
    RNA-seq
    reads
    Genome
    (DNA)
    RNA
    transcripts
    gene
    exons
    introns
    junctions

    View full-size slide

  7. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +
    @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
    +
    HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +
    B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +
    HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;

    View full-size slide

  8. Goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity?
    2. Can we discover expression differences
    at the transcript level?

    View full-size slide

  9. Goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity?
    2. Can we discover expression differences
    at the transcript level?

    View full-size slide

  10. expression = 24
    Genome
    (DNA)
    Measuring expression using counts

    View full-size slide

  11. High information loss
    Many
    possible RNA
    variants
    Genome
    (DNA)
    Plus: cannot detect expression outside annotated genes, incorrect
    annotation causes problems, difficult to study non-canonical genomes
    (e.g., cancer)

    View full-size slide

  12. Our solution: DER Finder
    Frazee et al, Biostatistics 2014
    Concept: scan genome base-
    by-base, highlight regions
    showing differential
    expression signal

    View full-size slide

  13. Read coverage
    coverage
    vector
    2 6 0 11 6
    Genome
    (DNA)

    View full-size slide

  14. DER Finder
    genomic position
    DE signal
    read
    coverage
    idea similar to Jaffe et al, Int J Epidemiol 2012

    View full-size slide

  15. DER Finder
    DE signal
    read
    coverage
    genomic position

    View full-size slide

  16. Nucleotide-level signal
    samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest

    View full-size slide

  17. samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest
    v
    Nucleotide-level signal

    View full-size slide

  18. hidden states (unknown truth)
    DE DE not
    DE
    t
    1
    t
    2
    t
    3
    t
    4
    t
    5
    DE not
    DE
    emissions (observed): moderated t-statistics (Smyth 2004)
    Segmentation: Hidden Markov Model

    View full-size slide

  19. Emission distribution
    parameter estimation
    Efron, Statistical Science 2008

    View full-size slide

  20. Emission distribution
    parameter estimation

    View full-size slide

  21. (candidate DERs)
    linear
    models
    HMM

    View full-size slide

  22. linear
    models
    HMM
    permutation tests
    for statistical
    significance

    View full-size slide

  23. match to
    annotation if
    desired:
    CECR1, “may
    play a role in
    regulating cell
    proliferation”

    View full-size slide

  24. Results: Y chromosome
    Frazee et al, Biostatistics 2014

    View full-size slide

  25. Goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity? (DER Finder)
    2. Can we discover expression
    differences at the transcript level?

    View full-size slide

  26. Ideal solution: full reconstruction
    Reads
    Estimated
    Transcripts
    Genome
    (DNA)

    View full-size slide

  27. Abundance estimation
    expression ≈ 12 for both
    assembled transcripts
    Genome
    Estimated
    Transcripts

    View full-size slide

  28. But: assembly is hard
    Bernard et al, Bioinformatics 2014
    Simulated Data

    View full-size slide

  29. But: assembly is hard
    Bernard et al, Bioinformatics 2014
    Real Data

    View full-size slide

  30. P-values behaving badly
    Cuffdiff 2 (Trapnell et al, Nature Biotechnology 2013) on tumor/normal data (Kim et al, PloS One 2013),
    downloaded from InSilico DB (Coletta et al, Genome Biology 2012)

    View full-size slide

  31. some
    possible
    assemblies
    Inherently ambiguous
    Genome

    View full-size slide

  32. Counting strategy not appropriate
    Genome

    View full-size slide

  33. Concept: software
    infrastructure and simple,
    robust statistical techniques
    improve inference for
    assemblies
    Our solution: Ballgown
    Frazee et al, Nature Biotechnology (accepted)

    View full-size slide

  34. Ballgown
    Frazee et al, Nature Biotechnology (accepted)
    transcriptome
    assembly
    pipelines
    R/Bioconductor
    DE analysis

    View full-size slide

  35. Defines R data structure for assemblies
    expr
    matrices
    GRanges
    data frames

    View full-size slide

  36. Defines R data structure for assemblies
    expr
    GRanges
    data frames
    Canonical format
    for differential
    expression analysis

    View full-size slide

  37. Facilitates exploratory analysis

    View full-size slide

  38. Facilitates exploratory analysis
    Includes functions to get
    corresponding gene
    names, match assembled
    and annotated transcripts,
    plot assembly alongside
    annotation, etc.

    View full-size slide

  39. Differential expression analysis
    drop-in replacement for Cuffdiff
    F-tests comparing nested models

    View full-size slide

  40. Improved accuracy

    View full-size slide

  41. Flexible and fast
    ● Suitable for transcripts (not count-based)
    ● Enables timecourse and multi-group
    analyses
    ● Can adjust for confounders or batch
    effects
    ● Runs in seconds: on cancer data set, 0.7
    sec; Cuffdiff: 10 hours and EBSeq: 6 hours
    EBSeq: Leng et al, Bioinformatics 2013

    View full-size slide

  42. Application: Processing the
    GEUVADIS dataset
    Genetic EUropean VAriation in health and DISease
    Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013

    View full-size slide

  43. turning “big” data into small data
    “Make big data as small as
    possible as quick as is possible”
    -Robert Gentleman

    View full-size slide

  44. turning “big” data into small data
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    raw reads (~3 Tb)
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    data-driven transcriptome assembly (~150 Mb)

    View full-size slide

  45. 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    data-driven transcriptome assembly
    (~150 Mb)
    ballgown
    objects
    (200-600 Mb)
    turning “big” data into small data
    http://figshare.
    com/articles/GEUVADIS_Pr
    ocessed_Data/1130849

    View full-size slide

  46. Reproducible; freely available

    View full-size slide

  47. Download my processed data; save time!
    3653 hours 999 hours 651 hours
    5299 total hours, assuming 4
    cores available

    View full-size slide

  48. Results: Continuous covariates
    Frazee et al, Nature Biotechnology (accepted)

    View full-size slide

  49. Results: Timecourse analysis

    View full-size slide

  50. Future Work
    ● Statistical inference and uncertainty
    measures for transcript assemblies

    View full-size slide

  51. Future Work
    ● Statistical inference and uncertainty
    measures for transcript assemblies
    ● Reproducibility strategies for time-
    consuming analyses

    View full-size slide

  52. Future Work
    ● Statistical inference and uncertainty
    measures for transcript assemblies
    ● Reproducibility strategies for time-
    consuming analyses
    ● Continue applying and developing
    methods to solve scientific problems

    View full-size slide

  53. Thank you!
    Collaborators: Jeff Leek (advisor), Sarven
    Sabunciyan, Kasper Hansen, Rafa Irizarry, Steven
    Salzberg, Ben Langmead, Andrew Jaffe, Geo Pertea,
    Leonardo Collado Torres

    View full-size slide

  54. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014). “Differential expression analysis of
    RNA-seq data at single-base resolution.” Biostatistics 15(3): 413-426
    Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between
    transcriptome assembly and expression analysis.” Nature Biotechnology, to appear.
    AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across
    laboratories.” Nature Biotechnology 31(11): 1015-22.
    Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology
    11(10): R106.
    Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from
    RNA-seq data with network flows.” Bioinformatics 30(17): 2447-2455.
    Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22.
    Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to
    identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of
    Epidemiology 41(1): 200-209.
    Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in
    humans.” Nature 501(7468): 506-11.
    Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM,
    Kendziorski C (2013). “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq
    experiments.” Bioinformatics 29(8): 1035-1043.
    Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression
    analysis of digital gene expression data.” Bioinformatics 26(1): 139-40.
    Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in
    microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1):3.
    Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene
    regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53.
    References

    View full-size slide

  55. Papers
    ● DER Finder: Biostatistics [link]
    ● Ballgown: Nature Biotechnology [preprint link]
    ● Polyester: BiorXiv [preprint link]
    ● ReCount: BMC Bioinformatics [link]
    ● RNA-seq book chapter: [link, Chapter 6]
    Software
    https://github.com/alyssafrazee
    ● Ballgown: [link][dev link]
    ● Polyester: [link][dev link]
    ● DER Finder (Leonardo Collado Torres): [link]
    Where to find my work
    These links, plus side projects and blog posts: alyssafrazee.com

    View full-size slide

  56. ● human diversity: Simon Abrams, CC BY-SA 2.0
    [link]
    ● tumor cells: cnicholsonpath (via Flickr), CC BY-
    SA 2.0 [link]
    ● awesome cast: Jennifer Carole, CC BY-NA 2.0
    [link]
    ● cell differentiation: Rasback (via Wikipedia), CC
    BY-SA 2.5 [link] (I cropped it)
    Image Credits

    View full-size slide

  57. No genes
    annotated
    here
    Annotation
    here does
    not match
    data
    But:
    Frazee et al, Biostatistics 2014

    View full-size slide