RNA-seq QC and Data Analysis using the Tuxedo Suite
Overview of RNA-seq quality control and data analysis using TopHat, Cufflinks, and CummeRbund. Given at the CSHL DNA Learning Center RNA-seq for the Next Generation (http://www.rnaseqforthenextgeneration.org/)
D. Turner, Ph.D.! University of Virginia Bioinformatics Core Director bioinformatics.virginia.edu Slides: stephenturner.us/slides using the TUXEDO software suite
• Founder/Director UVA Bioinformatics Core - Established Oct 2011 - Mission: help collaborators publish and fund their work by providing service, training, & infrastructure. - bioinformatics.virginia.edu 3
molecular biologists to understand computational biology. - Same brain considers both biology and bioinformatics. • Lots of biologists approach genomics as if rules of experimental design don't matter. - N=1 - No controls or improper controls - False comfort provided by P-values, etc. 6
Genomes with Ensembl - Using Galaxy for data intensive biology - Introduction to R for Life Sciences - RNA-seq data analysis bootcamp • Software Carpentry: - Software Carpentry Software Skills Bootcamp (Unix, Python, automation, version control) - Software Carpentry Instructor Training 7
sequence do I need? • Answer: it’s complicated. • Oversimplified answer: 20-50 million PE reads / sample. • Depends on: - Size & complexity of transcriptome. - Application: differential gene expression, transcript discovery. - Tissue type, RNA quality, library preparation. - Sequencing type: length, paired-end vs single-end, etc. • Find a publication in your field with similar goals. • Good news: A fraction of a HiSeq lane good enough. 11
samples should I sequence? • Oversimplified Answer: Never, less than 3 biological replicates per condition. • Depends on: - Application - Goals (prioritization, biomarker discovery, etc.) - Effect size, desired power, statistical significance • Find a publication with similar goals 12
to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782 Reality: a bit more complicated…
force you to think too rigidly. • Don’t “pipeline” too early, if at all. • Does it even need to be pipeline-ified? • Who’s running it? - You, once: don’t pipeline-ify. Document, move along. - You, 2-5 times: documented script? - You, 10+ times: consider pipeline-ifying. - Others: create sharable pipeline (GreenLine, VM, Galaxy, containerize, makefiles, …) • See: Loman & Watson. "So you want to be a computational biologist?" Nat Biotechnol 31 (2013): 996-998. 17
et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. How often does this happen in reality? A lot, maybe.
originating from known transcripts. • Problem: transcriptome annotation incomplete. • Solution: map to genome instead. • Problem: - Genome has introns. - Transcriptome doesn’t (and neither do reads). - If avg txp is 2000 bp, and reads randomly distributed along a transcript, then ~1/3 100-bp reads will span a splice junction. • Solution: splice-aware alignment. 22
let’s use what we already know from reference. • Identify minimum set of transcripts that explain our reads. 24 Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329.
mapping to a gene region (alignments from TopHat), • Assume you have a set of isoforms for that gene (assembly from Cufflinks). • Assuming you know already know the abundance of each isoform, you could probabilistically distribute each read across the isoforms that it could possibly have come from using the known abundance of those isoforms. 25 Mapped reads, unknown transcript Mapped reads, known transcript Reference transcriptome “Known” abundances
priori! • And you don’t know in advance the probability that a read originated from an isoform, so you can’t estimate it either! • But, you can figure it out with EM! 26 Mapped reads, unknown transcript Unknown abundances
each isoform based on the abundances of the isoforms that are compatible with the read (using spliced / paired-end alignments). 2. Recompute isoform abundance with the updated read-to-isoform assignment 3. Reassign read assignment based on the newly updated abundance estimates 4. Rinse, wash, repeat, until convergence. 27 Reference transcriptome Initialization Convergence Iteration
length and read depth) for each sample, for each transcript. • Want to know DE genes, isoforms, TSS groups, … • Cuffdiff 2 model accounts for both sources of noise: - Biological variation between replicates of same condition (negative binomial distribution) - Uncertainty in assigning reads to isoforms (beta distribution) • Resulting beta-negative binomial mixture distribution used to model over dispersion in hypothesis testing framework. 28 Differential isoform expression ••• Differential gene expression • Differential splicing •• Differential TSS usage ••, •
TopHat: - accepted_hits.bam: List of read alignments (the original read sequence, where it mapped, “how” it mapped). One for each sample. - junctions.bed: Tracks of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. - insertions.bed and deletions.bed: Tracks of insertions and deletions reported by TopHat. 29
Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. HISAT to be underlying engine for TopHat3
datasets of 30M PE reads each using 20 CPUs. Accuracy: near-optimal Q: So wait a sec, you mean to tell me I can quantify my RNA-seq data to isoform level resolution in 5 minutes on my laptop with better accuracy than I can running other tools on a supercomputing cluster? A: Yeah, that’s right. Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv preprint arXiv:1505.02710 (2015).
that came before it. Accuracy just as good. ! http://www.cs.cmu.edu/~ckingsf/ software/sailfish/ Kallisto: 10x faster than Salmon, more accurate. ! http://pachterlab.github.io/kallisto/ Salmon: Sailfish’s unpublished successor. Borrows some techniques from kallisto. Better/ faster? Wait and see. ! http://combine-lab.github.io/salmon/
is ~2kb long, ~10 exons, alternatively spliced. - Illumina reads are only ~100bp long. • Full-length isoform sequencing, no assembly required: the future of RNA- seq? • pacificbiosciences.com/applications/isoseq • Sequence single molecules 1kb+. • See: Sharon, Donald, et al. "A single- molecule long-read survey of the human transcriptome." Nature biotechnology 31.11 (2013): 1009-1014. 41
and Steven L. Salzberg. "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25.9 (2009): 1105-1111. • TopHat2: Kim, Daehwan, et al. "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36. • Original Cufflinks paper: Trapnell, Cole, et al. "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." Nature biotechnology 28.5 (2010): 511-515. • Cufflinks bias correction: Roberts, Adam, et al. "Improving RNA-Seq expression estimates by correcting for fragment bias." Genome Biol 12.3 (2011): R22. • Cufflinks RABT: Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329. • Cuffdiff2: Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. • TopHat / Cufflinks Protocol: Trapnell, Cole, et al. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks." Nature protocols 7.3 (2012): 562-578. • cummeRbund (unpublished): bioconductor.org/packages/cummeRbund 42
universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. • HISAT: Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. • StringTie: Pertea, Mihaela, et al. "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. • Ballgown: Frazee, Alyssa C., et al. "Ballgown bridges the gap between transcriptome assembly and expression analysis." Nature biotechnology 33.3 (2015): 243-246. • eXpress: Roberts, Adam, and Lior Pachter. "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature methods 10.1 (2013): 71-73. • Sailfish: Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464. • Kallisto: Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv: 1505.02710 (2015). 43
Twitter: @SEQquestions - Format: Forum - Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). • BioStar: - biostars.org - Twitter: @BioStarQuestion - Format: Q&A - Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. 45