Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RNA-seq analysis beyond gene counting

Alyssa Frazee
December 15, 2014

RNA-seq analysis beyond gene counting

Talk I gave for my interview for a tenure-track position in a biostatistics department.

Alyssa Frazee

December 15, 2014
Tweet

More Decks by Alyssa Frazee

Other Decks in Research

Transcript

  1. Case studies in genomic
    data science:
    RNA-seq analysis beyond
    gene counting
    Alyssa Frazee
    Department of Biostatistics, Johns Hopkins
    Bloomberg School of Public Health
    December 15, 2014

    View Slide

  2. research
    teaching
    software
    blog
    philosophy
    answer scientific questions
    using data-driven methods
    with interpretable results
    RNA-seq: methods +
    software
    biostatistics for
    medical professionals
    R, Bioconductor,
    Python
    alyssafrazee.com

    View Slide

  3. Research goal:
    Find genes that behave differently
    between populations

    View Slide

  4. View Slide

  5. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    gene’s “expression level” = amount of
    RNA in cell that was transcribed from
    that gene
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View Slide

  6. Measuring gene expression: RNA-seq
    RNA-seq
    reads
    Genome
    (DNA)
    RNA
    transcripts

    View Slide

  7. Measuring gene expression: RNA-seq
    RNA-seq
    reads
    Genome
    (DNA)
    RNA
    transcripts
    gene
    exons
    introns
    junctions

    View Slide

  8. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +
    @[email protected]/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>[email protected]
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
    +
    HIGHIHFHEGE4111:.;[email protected][email protected]?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +
    [email protected]=42:[email protected]>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>[email protected]@[email protected]@DFCCAA<:=>8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +
    [email protected];[email protected]@8>5554,/':[email protected]@[email protected]:[email protected]?=GG=;

    View Slide

  9. Goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity?
    2. Can we discover expression differences
    at the transcript level?

    View Slide

  10. Goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity?
    2. Can we discover expression differences
    at the transcript level?

    View Slide

  11. expression = 24
    Genome
    (DNA)
    Measuring expression using counts

    View Slide

  12. High information loss
    Many
    possible RNA
    variants
    Genome
    (DNA)
    Plus: cannot detect expression outside annotated genes, incorrect
    annotation causes problems, difficult to study non-canonical genomes
    (e.g., cancer)

    View Slide

  13. Our solution: DER Finder
    Frazee et al, Biostatistics 2014
    Concept: scan genome base-
    by-base, highlight regions
    showing differential
    expression signal

    View Slide

  14. Read coverage
    coverage
    vector
    2 6 0 11 6
    Genome
    (DNA)

    View Slide

  15. DER Finder
    genomic position
    DE signal
    read
    coverage
    idea similar to Jaffe et al, Int J Epidemiol 2012

    View Slide

  16. DER Finder
    DE signal
    read
    coverage
    genomic position

    View Slide

  17. Nucleotide-level signal
    samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest

    View Slide

  18. samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest
    v
    Nucleotide-level signal

    View Slide

  19. hidden states (unknown truth)
    DE DE not
    DE
    t
    1
    t
    2
    t
    3
    t
    4
    t
    5
    DE not
    DE
    emissions (observed): moderated t-statistics (Smyth 2004)
    Segmentation: Hidden Markov Model

    View Slide

  20. Emission distribution
    parameter estimation
    Efron, Statistical Science 2008

    View Slide

  21. Emission distribution
    parameter estimation

    View Slide

  22. (candidate DERs)
    linear
    models
    HMM

    View Slide

  23. linear
    models
    HMM
    permutation tests
    for statistical
    significance

    View Slide

  24. match to
    annotation if
    desired:
    CECR1, “may
    play a role in
    regulating cell
    proliferation”

    View Slide

  25. Results: Y chromosome
    Frazee et al, Biostatistics 2014

    View Slide

  26. Goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity? (DER Finder)
    2. Can we discover expression
    differences at the transcript level?

    View Slide

  27. Ideal solution: full reconstruction
    Reads
    Estimated
    Transcripts
    Genome
    (DNA)

    View Slide

  28. Abundance estimation
    expression ≈ 12 for both
    assembled transcripts
    Genome
    Estimated
    Transcripts

    View Slide

  29. But: assembly is hard
    Bernard et al, Bioinformatics 2014
    Simulated Data

    View Slide

  30. But: assembly is hard
    Bernard et al, Bioinformatics 2014
    Real Data

    View Slide

  31. P-values behaving badly
    Cuffdiff 2 (Trapnell et al, Nature Biotechnology 2013) on tumor/normal data (Kim et al, PloS One 2013),
    downloaded from InSilico DB (Coletta et al, Genome Biology 2012)

    View Slide

  32. some
    possible
    assemblies
    Inherently ambiguous
    Genome

    View Slide

  33. Counting strategy not appropriate
    Genome

    View Slide

  34. Concept: software
    infrastructure and simple,
    robust statistical techniques
    improve inference for
    assemblies
    Our solution: Ballgown
    Frazee et al, Nature Biotechnology (accepted)

    View Slide

  35. Ballgown
    Frazee et al, Nature Biotechnology (accepted)
    transcriptome
    assembly
    pipelines
    R/Bioconductor
    DE analysis

    View Slide

  36. Defines R data structure for assemblies
    expr
    matrices
    GRanges
    data frames

    View Slide

  37. Defines R data structure for assemblies
    expr
    GRanges
    data frames
    Canonical format
    for differential
    expression analysis

    View Slide

  38. Facilitates exploratory analysis

    View Slide

  39. Facilitates exploratory analysis
    Includes functions to get
    corresponding gene
    names, match assembled
    and annotated transcripts,
    plot assembly alongside
    annotation, etc.

    View Slide

  40. Differential expression analysis
    drop-in replacement for Cuffdiff
    F-tests comparing nested models

    View Slide

  41. Improved accuracy

    View Slide

  42. Flexible and fast
    ● Suitable for transcripts (not count-based)
    ● Enables timecourse and multi-group
    analyses
    ● Can adjust for confounders or batch
    effects
    ● Runs in seconds: on cancer data set, 0.7
    sec; Cuffdiff: 10 hours and EBSeq: 6 hours
    EBSeq: Leng et al, Bioinformatics 2013

    View Slide

  43. Application: Processing the
    GEUVADIS dataset
    Genetic EUropean VAriation in health and DISease
    Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013

    View Slide

  44. turning “big” data into small data
    “Make big data as small as
    possible as quick as is possible”
    -Robert Gentleman

    View Slide

  45. turning “big” data into small data
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    raw reads (~3 Tb)
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    data-driven transcriptome assembly (~150 Mb)

    View Slide

  46. 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    data-driven transcriptome assembly
    (~150 Mb)
    ballgown
    objects
    (200-600 Mb)
    turning “big” data into small data
    http://figshare.
    com/articles/GEUVADIS_Pr
    ocessed_Data/1130849

    View Slide

  47. Reproducible; freely available

    View Slide

  48. Download my processed data; save time!
    3653 hours 999 hours 651 hours
    5299 total hours, assuming 4
    cores available

    View Slide

  49. Results: Continuous covariates
    Frazee et al, Nature Biotechnology (accepted)

    View Slide

  50. Results: Timecourse analysis

    View Slide

  51. Future Work
    ● Statistical inference and uncertainty
    measures for transcript assemblies

    View Slide

  52. View Slide

  53. Future Work
    ● Statistical inference and uncertainty
    measures for transcript assemblies
    ● Reproducibility strategies for time-
    consuming analyses

    View Slide

  54. Future Work
    ● Statistical inference and uncertainty
    measures for transcript assemblies
    ● Reproducibility strategies for time-
    consuming analyses
    ● Continue applying and developing
    methods to solve scientific problems

    View Slide

  55. Thank you!
    Collaborators: Jeff Leek (advisor), Sarven
    Sabunciyan, Kasper Hansen, Rafa Irizarry, Steven
    Salzberg, Ben Langmead, Andrew Jaffe, Geo Pertea,
    Leonardo Collado Torres

    View Slide

  56. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014). “Differential expression analysis of
    RNA-seq data at single-base resolution.” Biostatistics 15(3): 413-426
    Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between
    transcriptome assembly and expression analysis.” Nature Biotechnology, to appear.
    AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across
    laboratories.” Nature Biotechnology 31(11): 1015-22.
    Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology
    11(10): R106.
    Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from
    RNA-seq data with network flows.” Bioinformatics 30(17): 2447-2455.
    Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22.
    Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to
    identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of
    Epidemiology 41(1): 200-209.
    Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in
    humans.” Nature 501(7468): 506-11.
    Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM,
    Kendziorski C (2013). “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq
    experiments.” Bioinformatics 29(8): 1035-1043.
    Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression
    analysis of digital gene expression data.” Bioinformatics 26(1): 139-40.
    Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in
    microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1):3.
    Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene
    regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53.
    References

    View Slide

  57. Papers
    ● DER Finder: Biostatistics [link]
    ● Ballgown: Nature Biotechnology [preprint link]
    ● Polyester: BiorXiv [preprint link]
    ● ReCount: BMC Bioinformatics [link]
    ● RNA-seq book chapter: [link, Chapter 6]
    Software
    https://github.com/alyssafrazee
    ● Ballgown: [link][dev link]
    ● Polyester: [link][dev link]
    ● DER Finder (Leonardo Collado Torres): [link]
    Where to find my work
    These links, plus side projects and blog posts: alyssafrazee.com

    View Slide

  58. ● human diversity: Simon Abrams, CC BY-SA 2.0
    [link]
    ● tumor cells: cnicholsonpath (via Flickr), CC BY-
    SA 2.0 [link]
    ● awesome cast: Jennifer Carole, CC BY-NA 2.0
    [link]
    ● cell differentiation: Rasback (via Wikipedia), CC
    BY-SA 2.5 [link] (I cropped it)
    Image Credits

    View Slide

  59. No genes
    annotated
    here
    Annotation
    here does
    not match
    data
    But:
    Frazee et al, Biostatistics 2014

    View Slide