High-resolution gene expression analysis

567d15666cd2891a4e6c49e007f30a08?s=47 Alyssa Frazee
February 17, 2015

High-resolution gene expression analysis

My PhD thesis defense seminar, given at the Johns Hopkins Biostatistics Department 2/17/15

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

February 17, 2015
Tweet

Transcript

  1. High-resolution gene expression analysis Alyssa Frazee Department of Biostatistics Thesis

    Defense Seminar February 17, 2015
  2. Research goal: Find genes that behave differently between populations

  3. None
  4. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

  5. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT gene

  6. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT exons

  7. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT introns

  8. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT transcript or isoform

  9. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT junctions

  10. Gene expression AUCAGUCGAUCACCGAU transcription DNA RNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT gene’s “expression level”

    = amount of RNA in cell that was transcribed from that gene
  11. Measuring gene expression: RNA-seq RNA-seq reads Genome (DNA) RNA transcripts

    (many possible variants)
  12. sequencing machine

  13. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2 GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT + @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1 TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT

    + DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2 AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC + HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC############# @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1 AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC + B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2 GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA + IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8 @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1 TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC + HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;<HHHHGIIHHEC-;;3?
  14. expression = 24 Genome (DNA) Measuring expression using counts EdgeR

    (Robinson et al, Bioinformatics 2010) DESeq (Anders and Huber, Genome Biology 2010) Voom (Law et al, Genome Biology 2014)
  15. High information loss RNA transcripts Genome (DNA) Plus: cannot detect

    expression outside annotated genes, incorrect annotation causes problems, difficult to study non-canonical genomes (e.g., cancer)
  16. Research goal: Find genes that behave differently between populations 1.

    Discover previously unknown gene activity 2. Find expression differences at the transcript level
  17. Contributions 1. DER Finder: Novel method to discover previously unknown

    gene activity 2. Ballgown: Tools for expression analysis, including transcript-level differential expression analysis 3. Polyester: simulator for evaluating statistical properties of new DE methods
  18. DER Finder Frazee, Sabunciyan, Hansen, Irizarry, and Leek, Biostatistics 2014

    Concept: scan genome base- by-base, highlight regions showing differential expression signal
  19. Read coverage coverage vector 2 6 0 11 6 Genome

    (DNA)
  20. DER Finder genomic position read coverage

  21. Nucleotide-level signal samples indexed by i locations indexed by l

    j confounders indexed by k expression confounders covariate of interest
  22. samples indexed by i locations indexed by l j confounders

    indexed by k expression confounders covariate of interest Nucleotide-level signal
  23. DER Finder genomic position DE signal read coverage “bump hunting”

    idea: Jaffe et al, Int J Epidemiol 2012
  24. hidden states (unknown truth) DE DE not DE t 1

    t 2 t 3 t 4 t 5 DE not DE emissions (observed): moderated t-statistics (Smyth 2004) Segmentation: Hidden Markov Model
  25. candidate DERs region-level statistics linear models HMM

  26. linear models HMM permutation tests for statistical significance

  27. match to annotation if desired: CECR1, “may play a role

    in regulating cell proliferation”
  28. • Data: Y chromosome expression for 9 males and 6

    females • Question: which transcripts are differentially expressed between males and females? • Expected answer: all • Expected p-value distribution: most near 0, uniformly distributed away from 0 Check performance
  29. Results: Y chromosome Frazee et al, Biostatistics 2014 (a) (b)

    (d) (c)
  30. No genes annotated here Annotation here does not match data

    Results: Frazee et al, Biostatistics 2014
  31. Research goal: Find genes that behave differently between populations 1.

    Can we discover previously unknown gene activity? (DER Finder) 2. Can we discover expression differences at the transcript level? (Ballgown)
  32. Ideal solution: full reconstruction Reads Estimated Transcripts Genome (DNA)

  33. Abundance estimation expression ≈ 12 for both assembled transcripts Genome

    Estimated Transcripts
  34. Abundance estimation expression ≈ 12 for both assembled transcripts Genome

    Estimated Transcripts FPKM
  35. But: assembly is hard Bernard et al, Bioinformatics 2014 Simulated

    Data
  36. But: assembly is hard Bernard et al, Bioinformatics 2014 Real

    Data
  37. • Data: RNA-seq from 12 normal samples and 12 tumor

    samples (Kim et al, PloS One 2013) • Question: which transcripts are differentially expressed between tumor and normal conditions? • Expected answer: most • Expected p-value distribution: most near 0, uniformly distributed away from 0 Check performance of current assembly-based DE method
  38. Cuffdiff 2 (Trapnell et al, Nature Biotechnology 2013) on tumor/normal

    data (Kim et al, PloS One 2013), downloaded from InSilico DB (Coletta et al, Genome Biology 2012) Check performance of current assembly-based DE method
  39. some possible assemblies Inherently ambiguous Genome

  40. Count models not appropriate Genome

  41. Concept: software infrastructure and simple, robust statistical techniques improve inference

    for assemblies Ballgown Frazee, Pertea, Jaffe, Langmead, Salzberg, and Leek. Nature Biotechnology (accepted)
  42. Ballgown Frazee et al, Nature Biotechnology (accepted) transcriptome assembly pipelines

    R/Bioconductor DE analysis
  43. Defines R data structure for assemblies expr GRanges data frames

    Canonical format for differential expression analysis
  44. Facilitates exploratory analysis

  45. Facilitates exploratory analysis

  46. Differential expression analysis drop-in replacement for Cuffdiff F-tests comparing nested

    models
  47. Improved accuracy

  48. Results: Timecourse analysis

  49. Ballgown: flexible, fast, accurate • Suitable for transcripts (not count-based)

    • Enables timecourse and multi-group analyses • Can adjust for confounders or batch effects • Runs in seconds: on cancer data set, 0.7 sec; Cuffdiff: 10 hours and EBSeq: 6 hours • Correctly identifies known differential expression EBSeq: Leng et al, Bioinformatics 2013
  50. How is accuracy assessed? sequence RNA align reads estimate transcript

    abundances test for differential expression DE pipeline: assemble transcripts simulate abundances from expression model
  51. How is accuracy assessed? sequence RNA align reads estimate transcript

    abundances test for differential expression DE pipeline: assemble transcripts spike-in experiment
  52. How is accuracy assessed? sequence RNA align reads estimate transcript

    abundances test for differential expression DE pipeline: assemble transcripts simulate reads
  53. How is accuracy assessed? sequence RNA align reads estimate transcript

    abundances test for differential expression DE pipeline: assemble transcripts simulate reads Existing read simulation software did not simulate differential expression
  54. Polyester annotated transcript sequences $ R > library(polyester) > simulate_experiment(fasta,

    baseline_counts, fold_changes, ...) Frazee, Jaffe, Langmead, and Leek. Manuscript under revision.
  55. Polyester $ R > library(polyester) > simulate_experiment(fasta, baseline_counts, fold_changes, ...)

    read counts: drawn from negative binomial distribution across replicates Frazee, Jaffe, Langmead, and Leek. Manuscript under revision.
  56. Statistical model for read counts samples indexed by i transcripts

    indexed by j groups indexed by k
  57. $ R > library(polyester) > simulate_experiment(fasta, baseline_counts, fold_changes, ...) Polyester

    user-set differential expression Frazee, Jaffe, Langmead, and Leek. Manuscript under revision.
  58. Additional features • GC expression bias • Positional sequencing bias

    • Empirical error models • Empirical fragment length distribution • Exact specification of number of reads per sample per transcript
  59. Compare to real data

  60. Assess differential expression methods Frazee et al, Nature Biotechnology (accepted)

  61. Thank you! Co-authors / collaborators: Jeff Leek Sarven Sabunciyan Kasper

    Hansen Rafael Irizarry Steven Salzberg Ben Langmead Andrew Jaffe Geo Pertea Leonardo Collado Torres
  62. Thank you! Committee Members: Jeff Leek Kasper Hansen Steven Salzberg

    Anthony Leung Dan Arking
  63. Thank you! Biostatistics Department Karen Bandeen-Roche Marie Diener-West, John McGready

    Mary Joy Argo, Ashley Johnson, Marti Gilbert Marvin Newhouse, Mark Miller, Fernando Pineda, Jiong Yang Classmates, officemates, friends (!!!) Genomics Working Group Hopkins Sommer Scholars Program
  64. Thank you! My parents Shelley and Dave and my sister

    Kayla
  65. Thank you! Jeff Leek Thanks for believing in me, exemplifying

    fearlessness for me, continually pushing me to improve, constantly supporting my career goals, and relentlessly encouraging me.
  66. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT

    (2014). “Differential expression analysis of RNA-seq data at single- base resolution.” Biostatistics 15(3): 413-426 Frazee AC, Pertea G, Jaffe AE, Salzberg SL, Leek JT (2015). “Ballgown bridges the gap between transcriptome assembly and expression analysis.” Nature Biotechnology, to appear. Frazee AC, Jaffe AE, Langmead B, Leek JT (2014): “Polyester: simulating RNA-seq datasets with differential transcript expression.” Under revision at Bioinformatics. AC’t Hoen P et al (2013): “Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.” Nature Biotechnology 31(11): 1015-22. Anders S and Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology 11(10): R106. Bernard E, Jacob L, Mairal J, Vert J (2014). “Efficient RNA isoform identification and quantification from RNA-seq data with network flows.” Bioinformatics 30(17): 2447-2455. Efron B (2008): “Microarrays, empirical Bayes, and the two-groups model.” Statistical Science 23(1): 1-22. Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA (2012): “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology 41(1): 200-209. Law CW, Chen Y, Shi W, Smyth GK (2014): “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15(2): R29. Lappalainen T et al (2013). “Transcriptome and genome sequencing uncovers functional variation in humans.” Nature 501 (7468): 506-11. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Steward RM, Kendziorski C (2013). “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29(8): 1035-1043. Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26(1): 139-40. Smyth GK (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1):3. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013). “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31(1): 46-53. References
  67. • human diversity: Simon Abrams (via Flickr), CC BY-SA 2.0

    [link] • tumor cells: cnicholsonpath (via Flickr), CC BY- SA 2.0 [link] • awesome cast: Jennifer Carole, CC BY-NA 2.0 [link] • cell differentiation: Rasback (via Wikipedia), CC BY-SA 2.5 [link] (I cropped it) • sequencer: Kinghorn Centre for Clinical Genomics (via Flickr), CC-BY-ND 2.0 [link] Image Credits
  68. Emission distribution parameter estimation Efron, Statistical Science 2008

  69. Emission distribution parameter estimation

  70. Processing the GEUVADIS dataset Genetic EUropean VAriation in health and

    DISease Lappalainen et al, Nature 2013; AC’t Hoen et al, Nature Biotechnology 2013
  71. turning “big” data into small data “Make big data as

    small as possible as quick as is possible” -Robert Gentleman
  72. turning “big” data into small data @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD

    @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2 CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA + GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD raw reads (~3 Tb) 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb)
  73. 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG

    GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 300811_fcB_:1:2207:14419:123617:0:1 256 1 19282 0 101M * 0 0 GCGAGCCTGTGTGGTGCGCAGGGATGAGAAG GCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAGGGGTTTGAGAAG [_J\cccegggegh^efghiiihg`fhfhiiifihiiiiiifgecccccZaccdccccccccccaccccc^aacQQ O[_cacca]]_[`^acT]^abcacc AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:5 CC:Z:16 CP:i:68971 HI:i:0 aligned reads (~1.5 Tb) chr1 Cufflinks exon 14765 16672 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS1"; chr1 Cufflinks exon 566984 569564 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 569902 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.14.1"; tss_id "TSS2"; chr1 Cufflinks exon 567008 568410 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 569017 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.14.2"; tss_id "TSS2"; chr1 Cufflinks exon 567066 567843 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.14.3"; tss_id "TSS2"; chr1 Cufflinks exon 568627 570307 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number data-driven transcriptome assembly (~150 Mb) ballgown objects (200-600 Mb) turning “big” data into small data http://figshare. com/articles/GEUVADIS_Pr ocessed_Data/1130849
  74. Reproducible; freely available

  75. Download my processed data; save time! 3653 hours 999 hours

    651 hours 5299 total hours, assuming 4 cores available