RNA-seq QC and Data Analysis using the Tuxedo Suite

RNA-SEQ quality control and analysis of DIFFERENTIAL GENE EXPRESSION Stephen
D. Turner, Ph.D.! University of Virginia Bioinformatics Core Director bioinformatics.virginia.edu Slides: stephenturner.us/slides using the TUXEDO software suite

Outline 1. RNA-seq overview & FAQs 2. Motivation for quantifying
transcripts 3. In-depth look at alignment, assembly, quant, & DE using Tuxedo tools 4. Newer tools & alternative approaches 2

Hi. I’m Stephen Turner. • Asst. Prof. Public Health Sciences
• Founder/Director UVA Bioinformatics Core - Established Oct 2011 - Mission: help collaborators publish and fund their work by providing service, training, & infrastructure. - bioinformatics.virginia.edu 3

Service • Gene expression: - RNA-seq! - Microarrays (Affy, Illumina)
- Other (Nanostring, ...) • DNA Variation: - NGS (exome, custom amplicons, WGS, …) - Array-based (GWAS, ...) • DNA Methylation: - NGS (MeDIP, RRBS, …) - Array-based (Illumina Inﬁnium chips). • DNA Binding / ChIP-Seq • “Pathway Analysis” • Metagenomics - Microbiome studies (16S, whole-genome) - Pathogen detection/ characterization, - Phylogenetic / compositional analysis - Functional Metagenomics • Acquisition / analysis of publicly available data (GEO, dbGaP), & deposition. • Grant / Manuscript support • Custom development 4

Service 5

Education • Bioinformatics: sub-discipline of molecular biology - Critical for
molecular biologists to understand computational biology. - Same brain considers both biology and bioinformatics. • Lots of biologists approach genomics as if rules of experimental design don't matter. - N=1 - No controls or improper controls - False comfort provided by P-values, etc. 6

Education • Workshops / short courses: - Browsing Genes &
Genomes with Ensembl - Using Galaxy for data intensive biology - Introduction to R for Life Sciences - RNA-seq data analysis bootcamp • Software Carpentry: - Software Carpentry Software Skills Bootcamp (Unix, Python, automation, version control) - Software Carpentry Instructor Training 7

Education 8 All courseware is open-source ! Source code: github.com/bioconnector/workshops/
! Rendered course materials: bioconnector.org/workshops ! RNA-seq course: • bioconnector.org/workshops/ws-rnaseq-1day/ • PDF: Turner, Stephen (2014): RNA-seq workshop course materials 11/10/2014. ﬁgshare. http:// dx.doi.org/10.6084/m9.ﬁgshare.1247658

Infrastructure 9 rstudio.bioconnector.virginia.edu galaxy.bioconnector.virginia.edu

RNA-seq Overview 10 Prep samples Prep libraries Sequence Bioinformatics:! QA/QC,
Analysis, … Step 1: Bioinformatics! Study design and planning. Consult your bioinformatician!

RNA-seq common question #1: ! Depth • Question: how much
sequence do I need? • Answer: it’s complicated. • Oversimpliﬁed answer: 20-50 million PE reads / sample. • Depends on: - Size & complexity of transcriptome. - Application: differential gene expression, transcript discovery. - Tissue type, RNA quality, library preparation. - Sequencing type: length, paired-end vs single-end, etc. • Find a publication in your ﬁeld with similar goals. • Good news: A fraction of a HiSeq lane good enough. 11

RNA-seq common question #2: sample size • Question: How many
samples should I sequence? • Oversimpliﬁed Answer: Never, less than 3 biological replicates per condition. • Depends on: - Application - Goals (prioritization, biomarker discovery, etc.) - Effect size, desired power, statistical signiﬁcance • Find a publication with similar goals 12

RNA-seq common question #3: Workflow 13 “How do I analyze
the data?” Perception: ACACTCGCATCCGCACATCGCACTA GGTCAGCATACGCCGACTCCGACCG GCGCTATCGCCAGCGGAAATCGCAA Sequence Data

RNA-seq common question #3:! Workflow 14 Eyras et al. Methods
to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782 Reality: a bit more complicated…

RNA-seq workflow #1:! Differential Gene Expression 15 Turner, Stephen D.
(2015): RNA-seq workﬂows. http://dx.doi.org/10.6084/m9.ﬁgshare.1430386

RNA-seq workflow #2:! Differential Isoform Expression, Exon Usage 16 Turner,
Stephen D. (2015): RNA-seq workﬂows. http://dx.doi.org/10.6084/m9.ﬁgshare.1430386

Beware of Pipelineitis • “Pipelines” can kill your creativity and
force you to think too rigidly. • Don’t “pipeline” too early, if at all. • Does it even need to be pipeline-iﬁed? • Who’s running it? - You, once: don’t pipeline-ify. Document, move along. - You, 2-5 times: documented script? - You, 10+ times: consider pipeline-ifying. - Others: create sharable pipeline (GreenLine, VM, Galaxy, containerize, makeﬁles, …) • See: Loman & Watson. "So you want to be a computational biologist?" Nat Biotechnol 31 (2013): 996-998. 17

Gene Counts vs. ! Transcript Abundance Quantification 18 Gene counting
is easy* Isoform deconvolution is hard(er) *=relatively.

Gene Counts vs. ! Transcript Abundance Quantification 19 Trapnell, Cole,
et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53.

Gene Counts vs. ! Transcript Abundance Quantification 20 Trapnell, Cole,
et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. How often does this happen in reality? A lot, maybe.

Gene Counts vs. ! Transcript Abundance Quantification • What are
you actually measuring? - Genes are not expressed. Isoforms are. • Count-based models bias differential expression estimates. • But assigning fragments to isoforms is hard. • You don’t get this: ! ! ! ! • Average human protein-coding transcript: - ~ 2000-bp long, - ~10 exons • Illumina RNA-seq library: 100-bp fragments 21

Alignment • Ideally, start estimating abundance by measuring % reads
originating from known transcripts. • Problem: transcriptome annotation incomplete. • Solution: map to genome instead. • Problem: - Genome has introns. - Transcriptome doesn’t (and neither do reads). - If avg txp is 2000 bp, and reads randomly distributed along a transcript, then ~1/3 100-bp reads will span a splice junction. • Solution: splice-aware alignment. 22

Alignment with TopHat2 23 Kim, Daehwan, et al. "TopHat2: accurate
alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36.

Assembly with Cufflinks • Could do completely de-novo. • But
let’s use what we already know from reference. • Identify minimum set of transcripts that explain our reads. 24 Roberts, Adam, et al. "Identiﬁcation of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329.

Abundance Estimation • Assume you have a set of reads
mapping to a gene region (alignments from TopHat), • Assume you have a set of isoforms for that gene (assembly from Cufﬂinks). • Assuming you know already know the abundance of each isoform, you could probabilistically distribute each read across the isoforms that it could possibly have come from using the known abundance of those isoforms. 25 Mapped reads, unknown transcript Mapped reads, known transcript Reference transcriptome “Known” abundances

Abundance Estimation • But you don’t know the abundances a
priori! • And you don’t know in advance the probability that a read originated from an isoform, so you can’t estimate it either! • But, you can ﬁgure it out with EM! 26 Mapped reads, unknown transcript Unknown abundances

Abundance Estimation 1. Guess probability of each read coming from
each isoform based on the abundances of the isoforms that are compatible with the read (using spliced / paired-end alignments). 2. Recompute isoform abundance with the updated read-to-isoform assignment 3. Reassign read assignment based on the newly updated abundance estimates 4. Rinse, wash, repeat, until convergence. 27 Reference transcriptome Initialization Convergence Iteration

Differential Expression • Now have FPKM (read count normalized by
length and read depth) for each sample, for each transcript. • Want to know DE genes, isoforms, TSS groups, … • Cuffdiff 2 model accounts for both sources of noise: - Biological variation between replicates of same condition (negative binomial distribution) - Uncertainty in assigning reads to isoforms (beta distribution) • Resulting beta-negative binomial mixture distribution used to model over dispersion in hypothesis testing framework. 28 Differential isoform expression ••• Differential gene expression • Differential splicing •• Differential TSS usage ••, •

Tuxedo Output • Input: FASTQ ﬁles (sequence & quality) •
TopHat: - accepted_hits.bam: List of read alignments (the original read sequence, where it mapped, “how” it mapped). One for each sample. - junctions.bed: Tracks of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. - insertions.bed and deletions.bed: Tracks of insertions and deletions reported by TopHat. 29

Tuxedo Output • Cufflinks: - transcripts.gtf: Cufflinks' assembled isoforms (start/end
positions, gene/transcript name, feature name, FPKMs, confidence intervals for abundance estimation, etc.). One for each sample. - Tracking files (more details about abundance estimation, uncertainty, etc). • Cuffmerge: - merged.gtf: an assembly merged from all individual assemblies. • Cuffdiff - diff files (tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences) ‣ isoform_exp.diff! ‣ gene_exp.diff! ‣ tss_group_exp.diff ‣ cds_exp.diff - Tracking files (more details about abundance estimation, testing, uncertainty, etc). 30

Tuxedo Output • CummeRbund: R package for querying & visualizing
cuffdiff output. 31

RNA-seq data analysis: Then and Now • TopHat first published
over 6 years ago. • Cufflinks first published over 5 years ago. • Lots of newer tools exist: - Better & faster spliced alignment than TopHat - Better & faster assembly than Cufflinks - Better & faster quantification than Cufflinks - Better statistical modeling than Cuffdiff - Ultrafast “alignment-free” transcript quantification - Full-length isoform sequencing 32

Alignment: STAR 33 Dobin, Alexander, et al. "STAR: ultrafast universal
RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. FAST! Accurate!

Alignment: HISAT 34 FAST! Accurate! Kim, Daehwan, Ben Langmead, and
Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. HISAT to be underlying engine for TopHat3

Assembly: StringTie 35 Pertea, Mihaela, et al. "StringTie enables improved
reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. Faster! Less RAM! Better sensitivity & precision!

Differential Expression: Ballgown • Testing frameworks: - Cufﬂinks: ‣ Two-group
comparison (e.g., WT vs Mut) - Ballgown: ‣ Two-group comparison ‣ eQTL analysis ‣ Time-course experiments ‣ Continuous covariates ‣ Confounded experimental designs ‣ Testing at the exon, gene or transcript level. • Performance: - Better sensitivity, better precision demonstrated in simulation. - 18 seconds on 8GB MacBook vs 69 hours on 148GB cluster! 36 Frazee, Alyssa C., et al. "Flexible isoform-level differential expression analysis with Ballgown." bioRxiv (2014): 003665.

Quantification: eXpress 37 Equal or better accuracy Fast! Low-resource! Time
RAM

“Alignment-free” quantification: Sailfish 38 25x Faster! Accurate! Sailfish skips the
alignment step altogether, uses kmer counting. Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment- free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464.

“Pseudo-alignment” quantification: kallisto 39 Blazing fast! Runtime for processing 20
datasets of 30M PE reads each using 20 CPUs. Accuracy: near-optimal Q: So wait a sec, you mean to tell me I can quantify my RNA-seq data to isoform level resolution in 5 minutes on my laptop with better accuracy than I can running other tools on a supercomputing cluster? A: Yeah, that’s right. Bray, Nicolas, et al. "Near-optimal RNA-Seq quantiﬁcation." arXiv preprint arXiv:1505.02710 (2015).

“Alignment-free” quantification redux:! Salmon 40 Sailfish: 25x faster than anything
that came before it. Accuracy just as good. ! http://www.cs.cmu.edu/~ckingsf/ software/sailfish/ Kallisto: 10x faster than Salmon, more accurate. ! http://pachterlab.github.io/kallisto/ Salmon: Sailfish’s unpublished successor. Borrows some techniques from kallisto. Better/ faster? Wait and see. ! http://combine-lab.github.io/salmon/

Full-length Isoform Sequencing • From earlier: - Average protein-coding transcript
is ~2kb long, ~10 exons, alternatively spliced. - Illumina reads are only ~100bp long. • Full-length isoform sequencing, no assembly required: the future of RNA- seq? • paciﬁcbiosciences.com/applications/isoseq • Sequence single molecules 1kb+. • See: Sharon, Donald, et al. "A single- molecule long-read survey of the human transcriptome." Nature biotechnology 31.11 (2013): 1009-1014. 41

Annotated References • Original TopHat paper: Trapnell, Cole, Lior Pachter,
and Steven L. Salzberg. "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25.9 (2009): 1105-1111. • TopHat2: Kim, Daehwan, et al. "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36. • Original Cufflinks paper: Trapnell, Cole, et al. "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." Nature biotechnology 28.5 (2010): 511-515. • Cufflinks bias correction: Roberts, Adam, et al. "Improving RNA-Seq expression estimates by correcting for fragment bias." Genome Biol 12.3 (2011): R22. • Cufflinks RABT: Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329. • Cuffdiff2: Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. • TopHat / Cufflinks Protocol: Trapnell, Cole, et al. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks." Nature protocols 7.3 (2012): 562-578. • cummeRbund (unpublished): bioconductor.org/packages/cummeRbund 42

Annotated References • STAR: Dobin, Alexander, et al. "STAR: ultrafast
universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. • HISAT: Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. • StringTie: Pertea, Mihaela, et al. "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. • Ballgown: Frazee, Alyssa C., et al. "Ballgown bridges the gap between transcriptome assembly and expression analysis." Nature biotechnology 33.3 (2015): 243-246. • eXpress: Roberts, Adam, and Lior Pachter. "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature methods 10.1 (2013): 71-73. • Sailfish: Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464. • Kallisto: Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv: 1505.02710 (2015). 43

Resources: where to get data • Gene Expression Omnibus (GEO)
• www.ncbi.nlm.nih.gov/geo • As of June 2015: - ~4,000 datasets - ~1,500,000 samples • Free data, no registration required. Go play! 44

Resources: where to get help • Seqanswers: - SEQanswers.com -
Twitter: @SEQquestions - Format: Forum - Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). • BioStar: - biostars.org - Twitter: @BioStarQuestion - Format: Q&A - Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. 45

46 Web: bioinformatics.virginia.edu E-Mail: bioinformatics@virginia.edu Blog: GettingGeneticsDone.com Twitter: @genetics_blog Facebook:
facebook.com/UVABioinformaticsCore Slides: stephenturner.us/slides

RNA-seq QC and Data Analysis using the Tuxedo S...

RNA-seq QC and Data Analysis using the Tuxedo Suite

More Decks by Stephen Turner

Other Decks in Education

Featured

Transcript