RNA-seq: From (good) experimental design to (accurate) gene expression abundance. Steve Munger Narayanan Raghupathy The Jackson Laboratory 21st Century Mouse GeneJcs 11 August 2016
Outline General overview of RNA-seq analysis. • IntroducJon to RNA-seq • The importance of a good experimental design • Quality Control • Read alignment • QuanJfying isoform and gene expression • NormalizaJon of expression esJmates
ApplicaJons of RNA-seq Technology Mom Dad TAGATGCTCA AGCTAATCCTAG TAGATGCTCA AGCTAATCCTAG A G A A A ATGCTCA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGCTA A G G G ATGCTCA AGCTATCC ATGCTCA AGCTATCCT ATGCTCA AGCTA A A A ATGCTCA TAGATGCTCA AGCTA GCTCA AGCTAAT TGCTCA AGCTAA AGCTA A Allele-Specific gene Expression (ASE) PreferenJal expression of one allele over the other.
Know your applicaJon – Design your experiment accordingly • How many reads? Read depth • Single-end or Paired-end sequencing? • Read length? • How many samples? N
RNA-seq Experimental design • DifferenJal expression of highly expressed and well annotated genes? – Smaller sample depth; more biological replicates – No need for paired end reads; shorter reads (50bp) may be sufficient. – Beder to have 20 million 50bp reads than 10 million 100bp reads. • Looking for novel genes/splicing/isoforms? – More read depth, paired-end reads from longer fragments. N
Two Illumina Lanes Bad Design RNA-Seq Experimental Design: RandomizaJon Experimental Group 2 Experimental Group 1 N Beder Design Mouse ENCODE reanalysis: hdp://f1000research.com/arJcles/4-121/v1
Index Sequence @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 X-Y Coordinate in flowcell Flowcell lane and Jle number Instrument: run/flowcell id The member of a pair Millions and millions of reads… @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>C@CCCACCCDCCIJ Phred Score: Q = -10 log10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, SN
• FASTX-Toolkit – hdp://hannonlab.cshl.edu/fastx_toolkit/ • FastQC – hdp://www.bioinformaJcs.babraham.ac.uk/projects/ fastqc/ NGS Data Preprocessing Quality Control: How to tell if your data is clean S RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/ • B6-100K.fastq and Cast-100K.fastq
Quality Control: How to tell if your data is clean Good data § Consistent § High Quality Along the reads Bad data § High Variance § Quality Decrease with Length S RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/ • B6-100K.fastq and Cast-100K.fastq
NGS Data Preprocessing K-mer content counts the enrichment of every 5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base posi'on
Tradeoffs to preprocessing data • Signal/noise -> Preprocessing can remove low- quality “noise”, but the cost is informaJon loss. – Some uniformly low-quality reads map uniquely to the genome. – Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site. – Duplicated reads or just highly expressed genes? – Most aligners can take quality scores into consideraJon. – Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S
Some reads will align equally well to mulJple locaJons. “MulJreads” ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA 100bp Read ✓ ✗ ✗ 1 read 3 valid alignments Only 1 alignment is correct S
Align to Genome or Transcriptome? Genome Transcriptome Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misadributed to a known isoform. Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes N
Long Short 200 Medium 100 50 1 2 3 RelaJve abundance for these genes, f1 , f2 , f3 350 300 200 150 Unique MulJreads MulJreads: Reads Mapping to MulJple Genes/Transcripts N
Approach 1: Ignore MulJreads Long Short 200 Medium 100 50 1 2 3 RelaJve abundance for these genes, f1 , f2 , f3 350 300 200 150 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008 N
Approach 1: Ignore MulJreads Long Short 200 Medium 100 50 1 2 3 350 300 200 150 • Over-esJmates the abundance of genes with unique reads • Under-esJmates the abundance of genes with mulJreads • Not an opJon at all, if interested in isoform expression N
Approach 2: EM algorithm based allocaJon of MulJreads Long Short 200 Medium 100 50 1 2 3 RelaJve abundance for these genes, f1 , f2 , f3 350 300 200 150 RSEM, Cufflinks, isoEM, MMSEQ & eXpress N
Conclusions for quanJtaJon • EM approaches are currently the best opJon. • Isoform-level esJmates are sJlll challenging and will become easier as read length increases. • K-mer counJng methods (Salmon, Kallisto) are very fast – they can be run easily on your own PC – and are reasonably accurate. N
Expression Abundance: Counts, RPKM/FPKM, TPM Long Gene Short Gene Long Gene Short Gene Sample 1 Sample 2 FPKM Number of Fragments Matched to a Gene / Kilo base Total matched reads in Millions
Large pool, small sample problem • Typical RNA library esJmated to contain 2.4 x 1012 molecules. McIntyre et al 2011 • Typical sequencing run = 25 million reads/sample. • This means that only 0.00001 (1/1000th of a percent) of RNA molecules are sampled in a given run. • High abundance transcripts are sampled more frequently. Example: Albumin = 13% of all reads in liver RNA-seq samples. • Sampling errors affect low-abundance transcripts most. S
Alb Alb Low1 Low1 Sample 1 Sample 2 Highly expressed genes that are differenJally expressed can cause lowly expressed genes that are not actually differenJally expressed to appear that way. S
NormalizaJon of raw counts • Wrong way to normalize data – Normalizing to the total number of mapped reads (e.g. FPKM). Top 10 highly expressed genes soak up 20% of reads in the liver. FPKM is widely used, and problemaJc. • Beder ways to measure data – Normalize to upper quarJle (75th %) of non-zero counts, median of scaled counts (DESeq), or the weighted trimmed mean of the log expression raJos (EdgeR). S
DifferenJal Expression Analysis Over-esJmaJon of Under-esJmaJon of ˆ2 g ˆ2 g Too conservaJve Too sensiJve (Many false posiJves) tg = ˆ µg,1 ˆ µg,2 s ˆ2 g,1 N1 + ˆ2 g,2 N2 DESEQ2, edgeR, Voom, & CuffDiff T-test Normal Cancer Expression
MulJple TesJng CorrecJon and False Discovery rate XKCD Significant 2012 IgNobel prize in Neuroscience for “finding Brain acJvity signal in dead salmon using fMRI” N
Things to consider… • DifferenJal expression of highly expressed and well annotated genes? – Smaller sample depth; more biological replicates – No need for paired end reads; shorter reads (50bp) may be sufficient. – Beder to have 20 million 50bp reads than 10 million 100bp reads. • Looking for novel genes/splicing/isoforms? – More read depth, paired-end reads from longer fragments. N
Example 2 • How to quanJfy gene expression in a species that has not been sequenced or annotated? – MulJstep strategy using mulJple sequencing technologies.
Acknowledgements • KB Choi • Gary Churchill • Ron Korstanje/ Karen Svenson/ Elissa Chesler • Joel Graber • Doug Hinerfeld • Anuj Srivastava • Churchill Lab – Dan Gau • Al Simons and Mad Hibbs • Lisa Somes, Steve Ciciode, mouse room staff at JAX • Gene Expression Technologies group at JAX