RNA-seq QC and Data Analysis using the Tuxedo Suite

RNA-seq QC and Data Analysis using the Tuxedo Suite

Overview of RNA-seq quality control and data analysis using TopHat, Cufflinks, and CummeRbund. Given at the CSHL DNA Learning Center RNA-seq for the Next Generation (http://www.rnaseqforthenextgeneration.org/)

8c8cb9d49f0ff8139e459414aeb4c055?s=128

Stephen Turner

June 10, 2015
Tweet

Transcript

  1. RNA-SEQ quality control and analysis of DIFFERENTIAL GENE EXPRESSION Stephen

    D. Turner, Ph.D.! University of Virginia Bioinformatics Core Director bioinformatics.virginia.edu Slides: stephenturner.us/slides using the TUXEDO software suite
  2. Outline 1. RNA-seq overview & FAQs 2. Motivation for quantifying

    transcripts 3. In-depth look at alignment, assembly, quant, & DE using Tuxedo tools 4. Newer tools & alternative approaches 2
  3. Hi. I’m Stephen Turner. • Asst. Prof. Public Health Sciences

    • Founder/Director UVA Bioinformatics Core - Established Oct 2011 - Mission: help collaborators publish and fund their work by providing service, training, & infrastructure. - bioinformatics.virginia.edu 3
  4. Service • Gene expression: - RNA-seq! - Microarrays (Affy, Illumina)

    - Other (Nanostring, ...) • DNA Variation: - NGS (exome, custom amplicons, WGS, …) - Array-based (GWAS, ...) • DNA Methylation: - NGS (MeDIP, RRBS, …) - Array-based (Illumina Infinium chips). • DNA Binding / ChIP-Seq • “Pathway Analysis” • Metagenomics - Microbiome studies (16S, whole-genome) - Pathogen detection/ characterization, - Phylogenetic / compositional analysis - Functional Metagenomics • Acquisition / analysis of publicly available data (GEO, dbGaP), & deposition. • Grant / Manuscript support • Custom development 4
  5. Service 5

  6. Education • Bioinformatics: sub-discipline of molecular biology - Critical for

    molecular biologists to understand computational biology. - Same brain considers both biology and bioinformatics. • Lots of biologists approach genomics as if rules of experimental design don't matter. - N=1 - No controls or improper controls - False comfort provided by P-values, etc. 6
  7. Education • Workshops / short courses: - Browsing Genes &

    Genomes with Ensembl - Using Galaxy for data intensive biology - Introduction to R for Life Sciences - RNA-seq data analysis bootcamp • Software Carpentry: - Software Carpentry Software Skills Bootcamp (Unix, Python, automation, version control) - Software Carpentry Instructor Training 7
  8. Education 8 All courseware is open-source ! Source code: github.com/bioconnector/workshops/

    ! Rendered course materials: bioconnector.org/workshops ! RNA-seq course: • bioconnector.org/workshops/ws-rnaseq-1day/ • PDF: Turner, Stephen (2014): RNA-seq workshop course materials 11/10/2014. figshare. http:// dx.doi.org/10.6084/m9.figshare.1247658
  9. Infrastructure 9 rstudio.bioconnector.virginia.edu galaxy.bioconnector.virginia.edu

  10. RNA-seq Overview 10 Prep samples Prep libraries Sequence Bioinformatics:! QA/QC,

    Analysis, … Step 1: Bioinformatics! Study design and planning. Consult your bioinformatician!
  11. RNA-seq common question #1: ! Depth • Question: how much

    sequence do I need? • Answer: it’s complicated. • Oversimplified answer: 20-50 million PE reads / sample. • Depends on: - Size & complexity of transcriptome. - Application: differential gene expression, transcript discovery. - Tissue type, RNA quality, library preparation. - Sequencing type: length, paired-end vs single-end, etc. • Find a publication in your field with similar goals. • Good news: A fraction of a HiSeq lane good enough. 11
  12. RNA-seq common question #2: sample size • Question: How many

    samples should I sequence? • Oversimplified Answer: Never, less than 3 biological replicates per condition. • Depends on: - Application - Goals (prioritization, biomarker discovery, etc.) - Effect size, desired power, statistical significance • Find a publication with similar goals 12
  13. RNA-seq common question #3: Workflow 13 “How do I analyze

    the data?” Perception: ACACTCGCATCCGCACATCGCACTA GGTCAGCATACGCCGACTCCGACCG GCGCTATCGCCAGCGGAAATCGCAA Sequence Data
  14. RNA-seq common question #3:! Workflow 14 Eyras et al. Methods

    to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782 Reality: a bit more complicated…
  15. RNA-seq workflow #1:! Differential Gene Expression 15 Turner, Stephen D.

    (2015): RNA-seq workflows. http://dx.doi.org/10.6084/m9.figshare.1430386
  16. RNA-seq workflow #2:! Differential Isoform Expression, Exon Usage 16 Turner,

    Stephen D. (2015): RNA-seq workflows. http://dx.doi.org/10.6084/m9.figshare.1430386
  17. Beware of Pipelineitis • “Pipelines” can kill your creativity and

    force you to think too rigidly. • Don’t “pipeline” too early, if at all. • Does it even need to be pipeline-ified? • Who’s running it? - You, once: don’t pipeline-ify. Document, move along. - You, 2-5 times: documented script? - You, 10+ times: consider pipeline-ifying. - Others: create sharable pipeline (GreenLine, VM, Galaxy, containerize, makefiles, …) • See: Loman & Watson. "So you want to be a computational biologist?" Nat Biotechnol 31 (2013): 996-998. 17
  18. Gene Counts vs. ! Transcript Abundance Quantification 18 Gene counting

    is easy* Isoform deconvolution is hard(er) *=relatively.
  19. Gene Counts vs. ! Transcript Abundance Quantification 19 Trapnell, Cole,

    et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53.
  20. Gene Counts vs. ! Transcript Abundance Quantification 20 Trapnell, Cole,

    et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. How often does this happen in reality? A lot, maybe.
  21. Gene Counts vs. ! Transcript Abundance Quantification • What are

    you actually measuring? - Genes are not expressed. Isoforms are. • Count-based models bias differential expression estimates. • But assigning fragments to isoforms is hard. • You don’t get this: ! ! ! ! • Average human protein-coding transcript: - ~ 2000-bp long, - ~10 exons • Illumina RNA-seq library: 100-bp fragments 21
  22. Alignment • Ideally, start estimating abundance by measuring % reads

    originating from known transcripts. • Problem: transcriptome annotation incomplete. • Solution: map to genome instead. • Problem: - Genome has introns. - Transcriptome doesn’t (and neither do reads). - If avg txp is 2000 bp, and reads randomly distributed along a transcript, then ~1/3 100-bp reads will span a splice junction. • Solution: splice-aware alignment. 22
  23. Alignment with TopHat2 23 Kim, Daehwan, et al. "TopHat2: accurate

    alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36.
  24. Assembly with Cufflinks • Could do completely de-novo. • But

    let’s use what we already know from reference. • Identify minimum set of transcripts that explain our reads. 24 Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329.
  25. Abundance Estimation • Assume you have a set of reads

    mapping to a gene region (alignments from TopHat), • Assume you have a set of isoforms for that gene (assembly from Cufflinks). • Assuming you know already know the abundance of each isoform, you could probabilistically distribute each read across the isoforms that it could possibly have come from using the known abundance of those isoforms. 25 Mapped reads, unknown transcript Mapped reads, known transcript Reference transcriptome “Known” abundances
  26. Abundance Estimation • But you don’t know the abundances a

    priori! • And you don’t know in advance the probability that a read originated from an isoform, so you can’t estimate it either! • But, you can figure it out with EM! 26 Mapped reads, unknown transcript Unknown abundances
  27. Abundance Estimation 1. Guess probability of each read coming from

    each isoform based on the abundances of the isoforms that are compatible with the read (using spliced / paired-end alignments). 2. Recompute isoform abundance with the updated read-to-isoform assignment 3. Reassign read assignment based on the newly updated abundance estimates 4. Rinse, wash, repeat, until convergence. 27 Reference transcriptome Initialization Convergence Iteration
  28. Differential Expression • Now have FPKM (read count normalized by

    length and read depth) for each sample, for each transcript. • Want to know DE genes, isoforms, TSS groups, … • Cuffdiff 2 model accounts for both sources of noise: - Biological variation between replicates of same condition (negative binomial distribution) - Uncertainty in assigning reads to isoforms (beta distribution) • Resulting beta-negative binomial mixture distribution used to model over dispersion in hypothesis testing framework. 28 Differential isoform expression ••• Differential gene expression • Differential splicing •• Differential TSS usage ••, •
  29. Tuxedo Output • Input: FASTQ files (sequence & quality) •

    TopHat: - accepted_hits.bam: List of read alignments (the original read sequence, where it mapped, “how” it mapped). One for each sample. - junctions.bed: Tracks of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. - insertions.bed and deletions.bed: Tracks of insertions and deletions reported by TopHat. 29
  30. Tuxedo Output • Cufflinks: - transcripts.gtf: Cufflinks' assembled isoforms (start/end

    positions, gene/transcript name, feature name, FPKMs, confidence intervals for abundance estimation, etc.). One for each sample. - Tracking files (more details about abundance estimation, uncertainty, etc). • Cuffmerge: - merged.gtf: an assembly merged from all individual assemblies. • Cuffdiff - diff files (tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences) ‣ isoform_exp.diff! ‣ gene_exp.diff! ‣ tss_group_exp.diff ‣ cds_exp.diff - Tracking files (more details about abundance estimation, testing, uncertainty, etc). 30
  31. Tuxedo Output • CummeRbund: R package for querying & visualizing

    cuffdiff output. 31
  32. RNA-seq data analysis: Then and Now • TopHat first published

    over 6 years ago. • Cufflinks first published over 5 years ago. • Lots of newer tools exist: - Better & faster spliced alignment than TopHat - Better & faster assembly than Cufflinks - Better & faster quantification than Cufflinks - Better statistical modeling than Cuffdiff - Ultrafast “alignment-free” transcript quantification - Full-length isoform sequencing 32
  33. Alignment: STAR 33 Dobin, Alexander, et al. "STAR: ultrafast universal

    RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. FAST! Accurate!
  34. Alignment: HISAT 34 FAST! Accurate! Kim, Daehwan, Ben Langmead, and

    Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. HISAT to be underlying engine for TopHat3
  35. Assembly: StringTie 35 Pertea, Mihaela, et al. "StringTie enables improved

    reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. Faster! Less RAM! Better sensitivity & precision!
  36. Differential Expression: Ballgown • Testing frameworks: - Cufflinks: ‣ Two-group

    comparison (e.g., WT vs Mut) - Ballgown: ‣ Two-group comparison ‣ eQTL analysis ‣ Time-course experiments ‣ Continuous covariates ‣ Confounded experimental designs ‣ Testing at the exon, gene or transcript level. • Performance: - Better sensitivity, better precision demonstrated in simulation. - 18 seconds on 8GB MacBook vs 69 hours on 148GB cluster! 36 Frazee, Alyssa C., et al. "Flexible isoform-level differential expression analysis with Ballgown." bioRxiv (2014): 003665.
  37. Quantification: eXpress 37 Equal or better accuracy Fast! Low-resource! Time

    RAM
  38. “Alignment-free” quantification: Sailfish 38 25x Faster! Accurate! Sailfish skips the

    alignment step altogether, uses kmer counting. Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment- free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464.
  39. “Pseudo-alignment” quantification: kallisto 39 Blazing fast! Runtime for processing 20

    datasets of 30M PE reads each using 20 CPUs. Accuracy: near-optimal Q: So wait a sec, you mean to tell me I can quantify my RNA-seq data to isoform level resolution in 5 minutes on my laptop with better accuracy than I can running other tools on a supercomputing cluster? A: Yeah, that’s right. Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv preprint arXiv:1505.02710 (2015).
  40. “Alignment-free” quantification redux:! Salmon 40 Sailfish: 25x faster than anything

    that came before it. Accuracy just as good. ! http://www.cs.cmu.edu/~ckingsf/ software/sailfish/ Kallisto: 10x faster than Salmon, more accurate. ! http://pachterlab.github.io/kallisto/ Salmon: Sailfish’s unpublished successor. Borrows some techniques from kallisto. Better/ faster? Wait and see. ! http://combine-lab.github.io/salmon/
  41. Full-length Isoform Sequencing • From earlier: - Average protein-coding transcript

    is ~2kb long, ~10 exons, alternatively spliced. - Illumina reads are only ~100bp long. • Full-length isoform sequencing, no assembly required: the future of RNA- seq? • pacificbiosciences.com/applications/isoseq • Sequence single molecules 1kb+. • See: Sharon, Donald, et al. "A single- molecule long-read survey of the human transcriptome." Nature biotechnology 31.11 (2013): 1009-1014. 41
  42. Annotated References • Original TopHat paper: Trapnell, Cole, Lior Pachter,

    and Steven L. Salzberg. "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25.9 (2009): 1105-1111. • TopHat2: Kim, Daehwan, et al. "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36. • Original Cufflinks paper: Trapnell, Cole, et al. "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." Nature biotechnology 28.5 (2010): 511-515. • Cufflinks bias correction: Roberts, Adam, et al. "Improving RNA-Seq expression estimates by correcting for fragment bias." Genome Biol 12.3 (2011): R22. • Cufflinks RABT: Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329. • Cuffdiff2: Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. • TopHat / Cufflinks Protocol: Trapnell, Cole, et al. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks." Nature protocols 7.3 (2012): 562-578. • cummeRbund (unpublished): bioconductor.org/packages/cummeRbund 42
  43. Annotated References • STAR: Dobin, Alexander, et al. "STAR: ultrafast

    universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. • HISAT: Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. • StringTie: Pertea, Mihaela, et al. "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. • Ballgown: Frazee, Alyssa C., et al. "Ballgown bridges the gap between transcriptome assembly and expression analysis." Nature biotechnology 33.3 (2015): 243-246. • eXpress: Roberts, Adam, and Lior Pachter. "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature methods 10.1 (2013): 71-73. • Sailfish: Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464. • Kallisto: Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv: 1505.02710 (2015). 43
  44. Resources: where to get data • Gene Expression Omnibus (GEO)

    • www.ncbi.nlm.nih.gov/geo • As of June 2015: - ~4,000 datasets - ~1,500,000 samples • Free data, no registration required. Go play! 44
  45. Resources: where to get help • Seqanswers: - SEQanswers.com -

    Twitter: @SEQquestions - Format: Forum - Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). • BioStar: - biostars.org - Twitter: @BioStarQuestion - Format: Q&A - Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. 45
  46. 46 Web: bioinformatics.virginia.edu E-Mail: bioinformatics@virginia.edu Blog: GettingGeneticsDone.com Twitter: @genetics_blog Facebook:

    facebook.com/UVABioinformaticsCore Slides: stephenturner.us/slides