$30 off During Our Annual Pro Sale. View Details »

High-resolution gene expression analysis

Alyssa Frazee
February 17, 2015

High-resolution gene expression analysis

My PhD thesis defense seminar, given at the Johns Hopkins Biostatistics Department 2/17/15

Alyssa Frazee

February 17, 2015
Tweet

More Decks by Alyssa Frazee

Other Decks in Science

Transcript

  1. High-resolution gene
    expression analysis
    Alyssa Frazee
    Department of Biostatistics
    Thesis Defense Seminar
    February 17, 2015

    View Slide

  2. Research goal:
    Find genes that behave differently
    between populations

    View Slide

  3. View Slide

  4. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

    View Slide

  5. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    gene

    View Slide

  6. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    exons

    View Slide

  7. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    introns

    View Slide

  8. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    transcript or
    isoform

    View Slide

  9. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    junctions

    View Slide

  10. Gene expression
    AUCAGUCGAUCACCGAU
    transcription
    DNA
    RNA
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    gene’s “expression level” = amount of
    RNA in cell that was transcribed from
    that gene

    View Slide

  11. Measuring gene expression: RNA-seq
    RNA-seq
    reads
    Genome
    (DNA)
    RNA transcripts
    (many possible
    variants)

    View Slide

  12. sequencing machine

    View Slide

  13. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +
    @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
    +
    HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +
    B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +
    HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;

    View Slide

  14. expression = 24
    Genome
    (DNA)
    Measuring expression using counts
    EdgeR (Robinson et al, Bioinformatics 2010)
    DESeq (Anders and Huber, Genome Biology 2010)
    Voom (Law et al, Genome Biology 2014)

    View Slide

  15. High information loss
    RNA
    transcripts
    Genome
    (DNA)
    Plus: cannot detect expression outside annotated genes, incorrect
    annotation causes problems, difficult to study non-canonical genomes
    (e.g., cancer)

    View Slide

  16. Research goal:
    Find genes that behave differently
    between populations
    1. Discover previously unknown gene
    activity
    2. Find expression differences at the
    transcript level

    View Slide

  17. Contributions
    1. DER Finder: Novel method to discover
    previously unknown gene activity
    2. Ballgown: Tools for expression
    analysis, including transcript-level
    differential expression analysis
    3. Polyester: simulator for evaluating
    statistical properties of new DE
    methods

    View Slide

  18. DER Finder
    Frazee, Sabunciyan, Hansen,
    Irizarry, and Leek, Biostatistics 2014
    Concept: scan genome base-
    by-base, highlight regions
    showing differential
    expression signal

    View Slide

  19. Read coverage
    coverage
    vector
    2 6 0 11 6
    Genome
    (DNA)

    View Slide

  20. DER Finder
    genomic position
    read
    coverage

    View Slide

  21. Nucleotide-level signal
    samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest

    View Slide

  22. samples indexed by i
    locations indexed by l
    j
    confounders indexed by k
    expression confounders
    covariate of
    interest
    Nucleotide-level signal

    View Slide

  23. DER Finder
    genomic position
    DE signal
    read
    coverage
    “bump hunting” idea: Jaffe et al, Int J Epidemiol 2012

    View Slide

  24. hidden states (unknown truth)
    DE DE not
    DE
    t
    1
    t
    2
    t
    3
    t
    4
    t
    5
    DE not
    DE
    emissions (observed): moderated t-statistics (Smyth 2004)
    Segmentation: Hidden Markov Model

    View Slide

  25. candidate DERs
    region-level statistics
    linear
    models
    HMM

    View Slide

  26. linear
    models
    HMM
    permutation tests
    for statistical
    significance

    View Slide

  27. match to
    annotation if
    desired:
    CECR1, “may
    play a role in
    regulating cell
    proliferation”

    View Slide

  28. ● Data: Y chromosome expression for 9
    males and 6 females
    ● Question: which transcripts are
    differentially expressed between males
    and females?
    ● Expected answer: all
    ● Expected p-value distribution: most near
    0, uniformly distributed away from 0
    Check performance

    View Slide

  29. Results: Y chromosome
    Frazee et al, Biostatistics 2014
    (a) (b)
    (d)
    (c)

    View Slide

  30. No genes
    annotated
    here
    Annotation
    here does
    not match
    data
    Results:
    Frazee et al, Biostatistics 2014

    View Slide

  31. Research goal:
    Find genes that behave differently
    between populations
    1. Can we discover previously unknown
    gene activity? (DER Finder)
    2. Can we discover expression
    differences at the transcript level?
    (Ballgown)

    View Slide

  32. Ideal solution: full reconstruction
    Reads
    Estimated
    Transcripts
    Genome
    (DNA)

    View Slide

  33. Abundance estimation
    expression ≈ 12 for both
    assembled transcripts
    Genome
    Estimated
    Transcripts

    View Slide

  34. Abundance estimation
    expression ≈ 12 for both
    assembled transcripts
    Genome
    Estimated
    Transcripts
    FPKM

    View Slide

  35. But: assembly is hard
    Bernard et al, Bioinformatics 2014
    Simulated Data

    View Slide

  36. But: assembly is hard
    Bernard et al, Bioinformatics 2014
    Real Data

    View Slide

  37. ● Data: RNA-seq from 12 normal samples
    and 12 tumor samples (Kim et al, PloS One 2013)
    ● Question: which transcripts are
    differentially expressed between tumor
    and normal conditions?
    ● Expected answer: most
    ● Expected p-value distribution: most near
    0, uniformly distributed away from 0
    Check performance of current
    assembly-based DE method

    View Slide

  38. Cuffdiff 2 (Trapnell et al, Nature Biotechnology 2013) on tumor/normal data (Kim et al, PloS One 2013),
    downloaded from InSilico DB (Coletta et al, Genome Biology 2012)
    Check performance of current
    assembly-based DE method

    View Slide

  39. some
    possible
    assemblies
    Inherently ambiguous
    Genome

    View Slide

  40. Count models not appropriate
    Genome

    View Slide

  41. Concept: software
    infrastructure and simple,
    robust statistical techniques
    improve inference for
    assemblies
    Ballgown
    Frazee, Pertea, Jaffe, Langmead, Salzberg, and Leek.
    Nature Biotechnology (accepted)

    View Slide

  42. Ballgown
    Frazee et al, Nature Biotechnology (accepted)
    transcriptome
    assembly
    pipelines
    R/Bioconductor
    DE analysis

    View Slide

  43. Defines R data structure for assemblies
    expr
    GRanges
    data frames
    Canonical format
    for differential
    expression analysis

    View Slide

  44. Facilitates exploratory analysis

    View Slide

  45. Facilitates exploratory analysis

    View Slide

  46. Differential expression analysis
    drop-in replacement for Cuffdiff
    F-tests comparing nested models

    View Slide

  47. Improved accuracy

    View Slide

  48. Results: Timecourse analysis

    View Slide

  49. Ballgown: flexible, fast, accurate
    ● Suitable for transcripts (not count-based)
    ● Enables timecourse and multi-group
    analyses
    ● Can adjust for confounders or batch
    effects
    ● Runs in seconds: on cancer data set, 0.7
    sec; Cuffdiff: 10 hours and EBSeq: 6 hours
    ● Correctly identifies known differential
    expression
    EBSeq: Leng et al, Bioinformatics 2013

    View Slide

  50. How is accuracy assessed?
    sequence
    RNA
    align
    reads
    estimate
    transcript
    abundances
    test for
    differential
    expression
    DE pipeline:
    assemble
    transcripts
    simulate
    abundances from
    expression model

    View Slide

  51. How is accuracy assessed?
    sequence
    RNA
    align
    reads
    estimate
    transcript
    abundances
    test for
    differential
    expression
    DE pipeline:
    assemble
    transcripts
    spike-in
    experiment

    View Slide

  52. How is accuracy assessed?
    sequence
    RNA
    align
    reads
    estimate
    transcript
    abundances
    test for
    differential
    expression
    DE pipeline:
    assemble
    transcripts
    simulate reads

    View Slide

  53. How is accuracy assessed?
    sequence
    RNA
    align
    reads
    estimate
    transcript
    abundances
    test for
    differential
    expression
    DE pipeline:
    assemble
    transcripts
    simulate reads
    Existing read simulation software did not
    simulate differential expression

    View Slide

  54. Polyester
    annotated
    transcript
    sequences
    $ R
    > library(polyester)
    > simulate_experiment(fasta,
    baseline_counts, fold_changes, ...)
    Frazee, Jaffe, Langmead, and Leek. Manuscript under revision.

    View Slide

  55. Polyester
    $ R
    > library(polyester)
    > simulate_experiment(fasta,
    baseline_counts, fold_changes, ...)
    read counts: drawn
    from negative binomial
    distribution across
    replicates
    Frazee, Jaffe, Langmead, and Leek. Manuscript under revision.

    View Slide

  56. Statistical model for read counts
    samples indexed by i
    transcripts indexed by j
    groups indexed by k

    View Slide

  57. $ R
    > library(polyester)
    > simulate_experiment(fasta,
    baseline_counts, fold_changes, ...)
    Polyester
    user-set
    differential
    expression
    Frazee, Jaffe, Langmead, and Leek. Manuscript under revision.

    View Slide

  58. Additional features
    ● GC expression bias
    ● Positional sequencing bias
    ● Empirical error models
    ● Empirical fragment length distribution
    ● Exact specification of number of reads per
    sample per transcript

    View Slide

  59. Compare to real data

    View Slide

  60. Assess differential expression
    methods
    Frazee et al, Nature Biotechnology (accepted)

    View Slide

  61. Thank you!
    Co-authors / collaborators:
    Jeff Leek
    Sarven Sabunciyan
    Kasper Hansen
    Rafael Irizarry
    Steven Salzberg
    Ben Langmead
    Andrew Jaffe
    Geo Pertea
    Leonardo Collado Torres

    View Slide

  62. Thank you!
    Committee Members:
    Jeff Leek
    Kasper Hansen
    Steven Salzberg
    Anthony Leung
    Dan Arking

    View Slide

  63. Thank you!
    Biostatistics Department
    Karen Bandeen-Roche
    Marie Diener-West, John McGready
    Mary Joy Argo, Ashley Johnson, Marti Gilbert
    Marvin Newhouse, Mark Miller, Fernando Pineda,
    Jiong Yang
    Classmates, officemates, friends (!!!)
    Genomics Working Group
    Hopkins Sommer Scholars Program

    View Slide

  64. Thank you!
    My parents Shelley and Dave and my sister Kayla

    View Slide

  65. Thank you!
    Jeff Leek
    Thanks for believing in me, exemplifying
    fearlessness for me, continually pushing me to
    improve, constantly supporting my career goals, and
    relentlessly encouraging me.

    View Slide

  66. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014). “Differential expression analysis of RNA-seq data at single-
    base resolution.” Biostatistics 15(3): 413-426
    Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between transcriptome assembly and
    expression analysis.” Nature Biotechnology, to appear.
    Frazee AC, Jaffe AE, Langmead B, Leek JT (2014): “Polyester: simulating RNA-seq datasets with differential transcript
    expression.” Under revision at Bioinformatics.
    AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.” Nature
    Biotechnology 31(11): 1015-22.
    Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology 11(10): R106.
    Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from RNA-seq data with
    network flows.” Bioinformatics 30(17): 2447-2455.
    Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22.
    Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to identify differentially
    methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology 41(1): 200-209.
    Law CW, Chen Y, Shi W, Smyth GK (2014): “Voom: precision weights unlock linear model analysis tools for RNA-seq read
    counts.” Genome Biology 15(2): R29.
    Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in humans.” Nature 501
    (7468): 506-11.
    Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM, Kendziorski C (2013).
    “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29(8): 1035-1043.
    Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital
    gene expression data.” Bioinformatics 26(1): 139-40.
    Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
    Statistical Applications in Genetics and Molecular Biology 3(1):3.
    Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene regulation at
    transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53.
    References

    View Slide

  67. ● human diversity: Simon Abrams (via Flickr), CC
    BY-SA 2.0 [link]
    ● tumor cells: cnicholsonpath (via Flickr), CC BY-
    SA 2.0 [link]
    ● awesome cast: Jennifer Carole, CC BY-NA 2.0
    [link]
    ● cell differentiation: Rasback (via Wikipedia), CC
    BY-SA 2.5 [link] (I cropped it)
    ● sequencer: Kinghorn Centre for Clinical
    Genomics (via Flickr), CC-BY-ND 2.0 [link]
    Image Credits

    View Slide

  68. Emission distribution
    parameter estimation
    Efron, Statistical Science 2008

    View Slide

  69. Emission distribution
    parameter estimation

    View Slide

  70. Processing the GEUVADIS
    dataset
    Genetic EUropean VAriation in health and DISease
    Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013

    View Slide

  71. turning “big” data into small data
    “Make big data as small as
    possible as quick as is possible”
    -Robert Gentleman

    View Slide

  72. turning “big” data into small data
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    raw reads (~3 Tb)
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    data-driven transcriptome assembly (~150 Mb)

    View Slide

  73. 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0
    101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG
    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG
    [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ
    O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0
    MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971
    HI:i:0
    aligned reads (~1.5 Tb)
    chr1 Cufflinks exon 14765 16672 . + .
    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number
    "1"; oId "CUFF.9.1"; tss_id "TSS1";
    chr1 Cufflinks exon 566984 569564 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "1"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 569902 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number
    "2"; oId "CUFF.14.1"; tss_id "TSS2";
    chr1 Cufflinks exon 567008 568410 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "1"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 569017 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number
    "2"; oId "CUFF.14.2"; tss_id "TSS2";
    chr1 Cufflinks exon 567066 567843 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    "1"; oId "CUFF.14.3"; tss_id "TSS2";
    chr1 Cufflinks exon 568627 570307 . + .
    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number
    data-driven transcriptome assembly
    (~150 Mb)
    ballgown
    objects
    (200-600 Mb)
    turning “big” data into small data
    http://figshare.
    com/articles/GEUVADIS_Pr
    ocessed_Data/1130849

    View Slide

  74. Reproducible; freely available

    View Slide

  75. Download my processed data; save time!
    3653 hours 999 hours 651 hours
    5299 total hours, assuming 4
    cores available

    View Slide