Slide 1

Slide 1 text

RNA-SEQ quality control and analysis of DIFFERENTIAL GENE EXPRESSION Stephen D. Turner, Ph.D.! University of Virginia Bioinformatics Core Director bioinformatics.virginia.edu Slides: stephenturner.us/slides using the TUXEDO software suite

Slide 2

Slide 2 text

Outline 1. RNA-seq overview & FAQs 2. Motivation for quantifying transcripts 3. In-depth look at alignment, assembly, quant, & DE using Tuxedo tools 4. Newer tools & alternative approaches 2

Slide 3

Slide 3 text

Hi. I’m Stephen Turner. • Asst. Prof. Public Health Sciences • Founder/Director UVA Bioinformatics Core - Established Oct 2011 - Mission: help collaborators publish and fund their work by providing service, training, & infrastructure. - bioinformatics.virginia.edu 3

Slide 4

Slide 4 text

Service • Gene expression: - RNA-seq! - Microarrays (Affy, Illumina) - Other (Nanostring, ...) • DNA Variation: - NGS (exome, custom amplicons, WGS, …) - Array-based (GWAS, ...) • DNA Methylation: - NGS (MeDIP, RRBS, …) - Array-based (Illumina Infinium chips). • DNA Binding / ChIP-Seq • “Pathway Analysis” • Metagenomics - Microbiome studies (16S, whole-genome) - Pathogen detection/ characterization, - Phylogenetic / compositional analysis - Functional Metagenomics • Acquisition / analysis of publicly available data (GEO, dbGaP), & deposition. • Grant / Manuscript support • Custom development 4

Slide 5

Slide 5 text

Service 5

Slide 6

Slide 6 text

Education • Bioinformatics: sub-discipline of molecular biology - Critical for molecular biologists to understand computational biology. - Same brain considers both biology and bioinformatics. • Lots of biologists approach genomics as if rules of experimental design don't matter. - N=1 - No controls or improper controls - False comfort provided by P-values, etc. 6

Slide 7

Slide 7 text

Education • Workshops / short courses: - Browsing Genes & Genomes with Ensembl - Using Galaxy for data intensive biology - Introduction to R for Life Sciences - RNA-seq data analysis bootcamp • Software Carpentry: - Software Carpentry Software Skills Bootcamp (Unix, Python, automation, version control) - Software Carpentry Instructor Training 7

Slide 8

Slide 8 text

Education 8 All courseware is open-source ! Source code: github.com/bioconnector/workshops/ ! Rendered course materials: bioconnector.org/workshops ! RNA-seq course: • bioconnector.org/workshops/ws-rnaseq-1day/ • PDF: Turner, Stephen (2014): RNA-seq workshop course materials 11/10/2014. figshare. http:// dx.doi.org/10.6084/m9.figshare.1247658

Slide 9

Slide 9 text

Infrastructure 9 rstudio.bioconnector.virginia.edu galaxy.bioconnector.virginia.edu

Slide 10

Slide 10 text

RNA-seq Overview 10 Prep samples Prep libraries Sequence Bioinformatics:! QA/QC, Analysis, … Step 1: Bioinformatics! Study design and planning. Consult your bioinformatician!

Slide 11

Slide 11 text

RNA-seq common question #1: ! Depth • Question: how much sequence do I need? • Answer: it’s complicated. • Oversimplified answer: 20-50 million PE reads / sample. • Depends on: - Size & complexity of transcriptome. - Application: differential gene expression, transcript discovery. - Tissue type, RNA quality, library preparation. - Sequencing type: length, paired-end vs single-end, etc. • Find a publication in your field with similar goals. • Good news: A fraction of a HiSeq lane good enough. 11

Slide 12

Slide 12 text

RNA-seq common question #2: sample size • Question: How many samples should I sequence? • Oversimplified Answer: Never, less than 3 biological replicates per condition. • Depends on: - Application - Goals (prioritization, biomarker discovery, etc.) - Effect size, desired power, statistical significance • Find a publication with similar goals 12

Slide 13

Slide 13 text

RNA-seq common question #3: Workflow 13 “How do I analyze the data?” Perception: ACACTCGCATCCGCACATCGCACTA GGTCAGCATACGCCGACTCCGACCG GCGCTATCGCCAGCGGAAATCGCAA Sequence Data

Slide 14

Slide 14 text

RNA-seq common question #3:! Workflow 14 Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782 Reality: a bit more complicated…

Slide 15

Slide 15 text

RNA-seq workflow #1:! Differential Gene Expression 15 Turner, Stephen D. (2015): RNA-seq workflows. http://dx.doi.org/10.6084/m9.figshare.1430386

Slide 16

Slide 16 text

RNA-seq workflow #2:! Differential Isoform Expression, Exon Usage 16 Turner, Stephen D. (2015): RNA-seq workflows. http://dx.doi.org/10.6084/m9.figshare.1430386

Slide 17

Slide 17 text

Beware of Pipelineitis • “Pipelines” can kill your creativity and force you to think too rigidly. • Don’t “pipeline” too early, if at all. • Does it even need to be pipeline-ified? • Who’s running it? - You, once: don’t pipeline-ify. Document, move along. - You, 2-5 times: documented script? - You, 10+ times: consider pipeline-ifying. - Others: create sharable pipeline (GreenLine, VM, Galaxy, containerize, makefiles, …) • See: Loman & Watson. "So you want to be a computational biologist?" Nat Biotechnol 31 (2013): 996-998. 17

Slide 18

Slide 18 text

Gene Counts vs. ! Transcript Abundance Quantification 18 Gene counting is easy* Isoform deconvolution is hard(er) *=relatively.

Slide 19

Slide 19 text

Gene Counts vs. ! Transcript Abundance Quantification 19 Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53.

Slide 20

Slide 20 text

Gene Counts vs. ! Transcript Abundance Quantification 20 Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. How often does this happen in reality? A lot, maybe.

Slide 21

Slide 21 text

Gene Counts vs. ! Transcript Abundance Quantification • What are you actually measuring? - Genes are not expressed. Isoforms are. • Count-based models bias differential expression estimates. • But assigning fragments to isoforms is hard. • You don’t get this: ! ! ! ! • Average human protein-coding transcript: - ~ 2000-bp long, - ~10 exons • Illumina RNA-seq library: 100-bp fragments 21

Slide 22

Slide 22 text

Alignment • Ideally, start estimating abundance by measuring % reads originating from known transcripts. • Problem: transcriptome annotation incomplete. • Solution: map to genome instead. • Problem: - Genome has introns. - Transcriptome doesn’t (and neither do reads). - If avg txp is 2000 bp, and reads randomly distributed along a transcript, then ~1/3 100-bp reads will span a splice junction. • Solution: splice-aware alignment. 22

Slide 23

Slide 23 text

Alignment with TopHat2 23 Kim, Daehwan, et al. "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36.

Slide 24

Slide 24 text

Assembly with Cufflinks • Could do completely de-novo. • But let’s use what we already know from reference. • Identify minimum set of transcripts that explain our reads. 24 Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329.

Slide 25

Slide 25 text

Abundance Estimation • Assume you have a set of reads mapping to a gene region (alignments from TopHat), • Assume you have a set of isoforms for that gene (assembly from Cufflinks). • Assuming you know already know the abundance of each isoform, you could probabilistically distribute each read across the isoforms that it could possibly have come from using the known abundance of those isoforms. 25 Mapped reads, unknown transcript Mapped reads, known transcript Reference transcriptome “Known” abundances

Slide 26

Slide 26 text

Abundance Estimation • But you don’t know the abundances a priori! • And you don’t know in advance the probability that a read originated from an isoform, so you can’t estimate it either! • But, you can figure it out with EM! 26 Mapped reads, unknown transcript Unknown abundances

Slide 27

Slide 27 text

Abundance Estimation 1. Guess probability of each read coming from each isoform based on the abundances of the isoforms that are compatible with the read (using spliced / paired-end alignments). 2. Recompute isoform abundance with the updated read-to-isoform assignment 3. Reassign read assignment based on the newly updated abundance estimates 4. Rinse, wash, repeat, until convergence. 27 Reference transcriptome Initialization Convergence Iteration

Slide 28

Slide 28 text

Differential Expression • Now have FPKM (read count normalized by length and read depth) for each sample, for each transcript. • Want to know DE genes, isoforms, TSS groups, … • Cuffdiff 2 model accounts for both sources of noise: - Biological variation between replicates of same condition (negative binomial distribution) - Uncertainty in assigning reads to isoforms (beta distribution) • Resulting beta-negative binomial mixture distribution used to model over dispersion in hypothesis testing framework. 28 Differential isoform expression ••• Differential gene expression • Differential splicing •• Differential TSS usage ••, •

Slide 29

Slide 29 text

Tuxedo Output • Input: FASTQ files (sequence & quality) • TopHat: - accepted_hits.bam: List of read alignments (the original read sequence, where it mapped, “how” it mapped). One for each sample. - junctions.bed: Tracks of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. - insertions.bed and deletions.bed: Tracks of insertions and deletions reported by TopHat. 29

Slide 30

Slide 30 text

Tuxedo Output • Cufflinks: - transcripts.gtf: Cufflinks' assembled isoforms (start/end positions, gene/transcript name, feature name, FPKMs, confidence intervals for abundance estimation, etc.). One for each sample. - Tracking files (more details about abundance estimation, uncertainty, etc). • Cuffmerge: - merged.gtf: an assembly merged from all individual assemblies. • Cuffdiff - diff files (tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences) ‣ isoform_exp.diff! ‣ gene_exp.diff! ‣ tss_group_exp.diff ‣ cds_exp.diff - Tracking files (more details about abundance estimation, testing, uncertainty, etc). 30

Slide 31

Slide 31 text

Tuxedo Output • CummeRbund: R package for querying & visualizing cuffdiff output. 31

Slide 32

Slide 32 text

RNA-seq data analysis: Then and Now • TopHat first published over 6 years ago. • Cufflinks first published over 5 years ago. • Lots of newer tools exist: - Better & faster spliced alignment than TopHat - Better & faster assembly than Cufflinks - Better & faster quantification than Cufflinks - Better statistical modeling than Cuffdiff - Ultrafast “alignment-free” transcript quantification - Full-length isoform sequencing 32

Slide 33

Slide 33 text

Alignment: STAR 33 Dobin, Alexander, et al. "STAR: ultrafast universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. FAST! Accurate!

Slide 34

Slide 34 text

Alignment: HISAT 34 FAST! Accurate! Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. HISAT to be underlying engine for TopHat3

Slide 35

Slide 35 text

Assembly: StringTie 35 Pertea, Mihaela, et al. "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. Faster! Less RAM! Better sensitivity & precision!

Slide 36

Slide 36 text

Differential Expression: Ballgown • Testing frameworks: - Cufflinks: ‣ Two-group comparison (e.g., WT vs Mut) - Ballgown: ‣ Two-group comparison ‣ eQTL analysis ‣ Time-course experiments ‣ Continuous covariates ‣ Confounded experimental designs ‣ Testing at the exon, gene or transcript level. • Performance: - Better sensitivity, better precision demonstrated in simulation. - 18 seconds on 8GB MacBook vs 69 hours on 148GB cluster! 36 Frazee, Alyssa C., et al. "Flexible isoform-level differential expression analysis with Ballgown." bioRxiv (2014): 003665.

Slide 37

Slide 37 text

Quantification: eXpress 37 Equal or better accuracy Fast! Low-resource! Time RAM

Slide 38

Slide 38 text

“Alignment-free” quantification: Sailfish 38 25x Faster! Accurate! Sailfish skips the alignment step altogether, uses kmer counting. Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment- free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464.

Slide 39

Slide 39 text

“Pseudo-alignment” quantification: kallisto 39 Blazing fast! Runtime for processing 20 datasets of 30M PE reads each using 20 CPUs. Accuracy: near-optimal Q: So wait a sec, you mean to tell me I can quantify my RNA-seq data to isoform level resolution in 5 minutes on my laptop with better accuracy than I can running other tools on a supercomputing cluster? A: Yeah, that’s right. Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv preprint arXiv:1505.02710 (2015).

Slide 40

Slide 40 text

“Alignment-free” quantification redux:! Salmon 40 Sailfish: 25x faster than anything that came before it. Accuracy just as good. ! http://www.cs.cmu.edu/~ckingsf/ software/sailfish/ Kallisto: 10x faster than Salmon, more accurate. ! http://pachterlab.github.io/kallisto/ Salmon: Sailfish’s unpublished successor. Borrows some techniques from kallisto. Better/ faster? Wait and see. ! http://combine-lab.github.io/salmon/

Slide 41

Slide 41 text

Full-length Isoform Sequencing • From earlier: - Average protein-coding transcript is ~2kb long, ~10 exons, alternatively spliced. - Illumina reads are only ~100bp long. • Full-length isoform sequencing, no assembly required: the future of RNA- seq? • pacificbiosciences.com/applications/isoseq • Sequence single molecules 1kb+. • See: Sharon, Donald, et al. "A single- molecule long-read survey of the human transcriptome." Nature biotechnology 31.11 (2013): 1009-1014. 41

Slide 42

Slide 42 text

Annotated References • Original TopHat paper: Trapnell, Cole, Lior Pachter, and Steven L. Salzberg. "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25.9 (2009): 1105-1111. • TopHat2: Kim, Daehwan, et al. "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions." Genome Biol 14.4 (2013): R36. • Original Cufflinks paper: Trapnell, Cole, et al. "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." Nature biotechnology 28.5 (2010): 511-515. • Cufflinks bias correction: Roberts, Adam, et al. "Improving RNA-Seq expression estimates by correcting for fragment bias." Genome Biol 12.3 (2011): R22. • Cufflinks RABT: Roberts, Adam, et al. "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27.17 (2011): 2325-2329. • Cuffdiff2: Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2013): 46-53. • TopHat / Cufflinks Protocol: Trapnell, Cole, et al. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks." Nature protocols 7.3 (2012): 562-578. • cummeRbund (unpublished): bioconductor.org/packages/cummeRbund 42

Slide 43

Slide 43 text

Annotated References • STAR: Dobin, Alexander, et al. "STAR: ultrafast universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21. • HISAT: Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. "HISAT: a fast spliced aligner with low memory requirements." Nature methods 12.4 (2015): 357-360. • StringTie: Pertea, Mihaela, et al. "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. • Ballgown: Frazee, Alyssa C., et al. "Ballgown bridges the gap between transcriptome assembly and expression analysis." Nature biotechnology 33.3 (2015): 243-246. • eXpress: Roberts, Adam, and Lior Pachter. "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature methods 10.1 (2013): 71-73. • Sailfish: Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464. • Kallisto: Bray, Nicolas, et al. "Near-optimal RNA-Seq quantification." arXiv: 1505.02710 (2015). 43

Slide 44

Slide 44 text

Resources: where to get data • Gene Expression Omnibus (GEO) • www.ncbi.nlm.nih.gov/geo • As of June 2015: - ~4,000 datasets - ~1,500,000 samples • Free data, no registration required. Go play! 44

Slide 45

Slide 45 text

Resources: where to get help • Seqanswers: - SEQanswers.com - Twitter: @SEQquestions - Format: Forum - Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). • BioStar: - biostars.org - Twitter: @BioStarQuestion - Format: Q&A - Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. 45

Slide 46

Slide 46 text

46 Web: bioinformatics.virginia.edu E-Mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog Facebook: facebook.com/UVABioinformaticsCore Slides: stephenturner.us/slides