Introduction to RNA-seq: From experimental design to gene quantitation

RNA-seq: From (good) experimental design to (accurate) gene expression abundance.
Steve Munger Narayanan Raghupathy The Jackson Laboratory 21st Century Mouse GeneJcs 11 August 2016

Outline General overview of RNA-seq analysis. •  IntroducJon to RNA-seq
•  The importance of a good experimental design •  Quality Control •  Read alignment •  QuanJfying isoform and gene expression •  NormalizaJon of expression esJmates

RNA-seq: Sequencing Transcriptomes ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA
AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCA AGCTAATC AGCTAATCCTAG CTCA mRNA

ApplicaJons of RNA-seq Technology Diﬀeren'al Gene expression analysis GSM71019.CEL GSM71020.CEL
GSM71021.CEL GSM71022.CEL GSM71023.CEL GSM71024.CEL GSM71025.CEL GSM71026.CEL GSM71028.CEL GSM71029.CEL GSM71030.CEL GSM71031.CEL GSM71032.CEL GSM71033.CEL GSM71034.CEL GSM71035.CEL 213087_s_at 218488_at 204607_at 212282_at 218070_s_at 222029_x_at 222240_s_at 217714_x_at 218514_at 202980_s_at 208886_at 220230_s_at 204702_s_at 204807_at 218445_at 208656_s_at 212980_at 214220_s_at 211152_s_at 221927_s_at 203353_s_at 221563_at 222016_s_at 201820_at 205401_at 210007_s_at 211834_s_at 204199_at 76897_s_at 215471_s_at 213506_at 203355_s_at 221496_s_at 217536_x_at 220586_at 203610_s_at 212926_at 206788_s_at 214657_s_at 218470_at 214484_s_at 207821_s_at 212686_at 208165_s_at 204156_at 213320_at 210281_s_at 202223_at 219281_at 218535_s_at 200706_s_at 217388_s_at 214889_at 219924_s_at 211732_x_at 204732_s_at 216342_x_at 221476_s_at 212039_x_at 200038_s_at 213377_x_at 208645_s_at 213227_at 218654_s_at 212995_x_at 202901_x_at 220386_s_at 200606_at 202543_s_at 212804_s_at 216100_s_at 212911_at 205588_s_at 204739_at 201447_at 219003_s_at 203991_s_at 209704_at 202504_at 207163_s_at 200752_s_at 221577_x_at 200660_at 218771_at 201609_x_at 211725_s_at 202417_at 201669_s_at 40562_at 209345_s_at 222221_x_at 204431_at 202715_at 219278_at 203782_s_at 204178_s_at 218419_s_at 34726_at 209113_s_at 220597_s_at 209607_x_at 207643_s_at 204842_x_at 201251_at 203847_s_at 214005_at 33322_i_at 213478_at 202856_s_at 217733_s_at 207688_s_at 202241_at 203231_s_at 213848_at 214684_at 211063_s_at 218092_s_at 205263_at 207030_s_at 201881_s_at 219646_at 203518_at 201804_x_at 213923_at 213940_s_at 203556_at 203528_at 213241_at 221878_at 217881_s_at 212141_at 212072_s_at 219649_at 213282_at 209989_at 206683_at 207180_s_at 213455_at 203186_s_at 209111_at 216071_x_at 218795_at 212547_at 209445_x_at 209675_s_at 202669_s_at 219023_at 202724_s_at 213480_at 218570_at 202891_at 203952_at 211098_x_at 202770_s_at 212652_s_at 204569_at 212959_s_at 213315_x_at 211928_at 222231_s_at 266_s_at 201555_at 210983_s_at 219862_s_at 216977_x_at 211501_s_at 214096_s_at 211948_x_at 202261_at 210243_s_at 218450_at 213708_s_at 212717_at 213090_s_at 203269_at 201751_at 212227_x_at 211761_s_at 207253_s_at 217950_at 212722_s_at 204640_s_at 204147_s_at 209323_at 220147_s_at 214172_x_at 201064_s_at 203468_at 213564_x_at 216973_s_at 204779_s_at 201678_s_at 202943_s_at 221768_at 213271_s_at 209964_s_at 221502_at 202736_s_at 201267_s_at 201643_x_at 201394_s_at 202951_at 212453_at 204493_at 217718_s_at 200633_at 207721_x_at 217491_x_at 200985_s_at 218482_at 201119_s_at 218213_s_at 209048_s_at 218046_s_at 200942_s_at 205077_s_at 219384_s_at 207785_s_at 214356_s_at 213794_s_at 202451_at 200074_s_at 221488_s_at 217816_s_at 201212_at 202143_s_at 208630_at 212498_at 212967_x_at 218189_s_at 209565_at 201906_s_at 200022_at 201846_s_at 201565_s_at 209513_s_at 201651_s_at 213518_at 215038_s_at 218929_at 221781_s_at 211529_x_at 209341_s_at 218179_s_at 208883_at 209181_s_at 202658_at 202304_at 220199_s_at 201424_s_at 204020_at 201916_s_at 212037_at 218288_s_at 219242_at 216449_x_at 208152_s_at 212248_at 202829_s_at 206562_s_at 204427_s_at 209226_s_at 212396_s_at 212247_at 218160_at 203097_s_at 221428_s_at 208758_at 210648_x_at 209102_s_at 202324_s_at 203964_at 201179_s_at 211684_s_at 214363_s_at 217982_s_at 217758_s_at 217848_s_at 201157_s_at 201931_at 217746_s_at 217845_x_at 201240_s_at 201726_at 200870_at 217959_s_at 209248_at 201573_s_at 209249_s_at 218286_s_at 218171_at 212584_at 202798_at 215548_s_at 201546_at 208616_s_at 218313_s_at 200737_at 201487_at 221844_x_at 202303_x_at 218228_s_at 200698_at 200680_x_at 209063_x_at 208882_s_at 221547_at 201658_at 211090_s_at 203529_at 202583_s_at 218640_s_at 219030_at 201558_at 203024_s_at 200640_at 209076_s_at 201477_s_at 200693_at 200910_at 208969_at 201384_s_at 202243_s_at 211047_x_at 200046_at 212202_s_at 201379_s_at 214911_s_at 202200_s_at 201798_s_at 204389_at 215177_s_at 213617_s_at 212407_at 215084_s_at 218947_s_at 218721_s_at 203319_s_at 201270_x_at 215555_at 220035_at 215209_at 221621_at 208469_s_at 215483_at 212153_at 209489_at 215504_x_at 205461_at 221950_at 215529_x_at 65591_at 49878_at 206527_at 218382_s_at 204080_at 207598_x_at 205169_at 213623_at 206182_at 205875_s_at 218018_at 202247_s_at 214040_s_at 208997_s_at 208161_s_at 209696_at 209528_s_at 211136_s_at 200879_s_at 203890_s_at 221810_at 209878_s_at 213705_at 207247_s_at 219460_s_at 222361_at 204636_at 214279_s_at 212003_at 32062_at 220770_s_at 203366_at 203068_at 221953_s_at 211013_x_at 203668_at 205372_at 201332_s_at 218821_at 218500_at 217419_x_at 210269_s_at 215760_s_at 209875_s_at 213472_at 209929_s_at 212575_at 212173_at 210163_at 218775_s_at 205393_s_at 222139_at 40665_at 204752_x_at 215773_x_at 218841_at 205627_at 203585_at 204885_s_at 206576_s_at 207397_s_at 217999_s_at 217371_s_at 204848_x_at 217502_at 204082_at 213215_at 204837_at 203171_s_at 212853_at 209705_at 210278_s_at 213421_x_at 212788_x_at 64488_at 200911_s_at 205296_at 204866_at 216228_s_at 210757_x_at 212235_at 219371_s_at 213418_at 207828_s_at 214710_s_at 203276_at 209070_s_at 204677_at 202117_at 201531_at 204344_s_at 217513_at 210170_at 206157_at 220068_at 206227_at 204796_at 213381_at 205544_s_at 206742_at 218502_s_at 209879_at 203666_at 213338_at 219213_at 209760_at 209827_s_at 204951_at 208195_at 209900_s_at 203299_s_at 208092_s_at 203665_at 219230_at 45297_at 209695_at 209343_at 214433_s_at 218573_at 213413_at 216620_s_at 212608_s_at 214212_x_at 203799_at 200953_s_at 208792_s_at 204083_s_at 203705_s_at 201431_s_at 205804_s_at 204655_at 209613_s_at 219478_at 200982_s_at 202531_at 220532_s_at 221725_at 217781_s_at 201998_at 211796_s_at 209795_at 205237_at 202833_s_at 203320_at 213888_s_at 203066_at 205119_s_at 205159_at 203921_at 219421_at 213553_x_at 217350_at 218321_x_at 204257_at 202894_at 217157_x_at 217281_x_at 205378_s_at 212315_s_at 204594_s_at 211399_at 204187_at 206997_s_at 221335_x_at 214562_at 216668_at 214052_x_at 48117_at 205024_s_at 218355_at 213008_at 204887_s_at 212789_at 219555_s_at 205336_at 214283_at 36475_at 205102_at 200831_s_at 211468_s_at 218835_at 216074_x_at 205776_at 202925_s_at 203998_s_at 207007_at 217659_at 219510_at 43511_s_at 222291_at 213707_s_at 214048_at 207912_s_at 220470_at 212283_at 216613_at 216948_at 215331_at 211490_at 203398_s_at 210360_s_at 222198_at 218372_at 216454_at 213042_s_at 212938_at 218483_s_at 210867_at 220221_at 216216_at 219794_at 220491_at 221068_at 208201_at 215863_at 217014_s_at 213520_at 207959_s_at 216808_at 205468_s_at 217411_s_at 206170_at 209745_at 205955_at 222081_at 217072_at 219096_at 204381_at 214136_at 211419_s_at 216605_s_at 208345_s_at 214300_s_at 208604_s_at 201246_s_at 209730_at 203280_at 214002_at 219424_at 214891_at 215799_at 204746_s_at 213667_at 217612_at 203933_at 205674_x_at 215882_at 219005_at 206278_at 209280_at 215741_x_at 221680_s_at 201625_s_at 205576_at 220956_s_at 218060_s_at 221054_s_at 204343_at 206972_s_at 208272_at 217368_at 220188_at 208494_at 215344_at 220694_at 222160_at 216906_at 222368_at 207153_s_at 216406_at 217124_at 206538_at 204433_s_at 219783_at 206931_at 210320_s_at 204035_at 208077_at 217778_at 205306_x_at 213644_at 206178_at 215147_at 216293_at 217405_x_at 222112_at 200952_s_at 215173_at 211164_at 214593_at 210037_s_at 222188_at 220404_at 218266_s_at 206731_at 59375_at 219784_at 211109_at 206794_at 214418_at 219989_s_at 207976_at 206560_s_at 212654_at 217535_at 209597_s_at 215100_at 214983_at 213816_s_at 212963_at 214652_at 205520_at 216188_at 210416_s_at 215451_s_at 218621_at 211584_s_at 214922_at 220177_s_at 206718_at 205295_at 205500_at 208019_at 214914_at 220785_at 211239_s_at 221233_s_at 215514_at 219320_at 216978_x_at 220340_at 221025_x_at 216030_s_at 216443_at 211520_s_at 216803_at 217621_at 213683_at 217253_at 215713_at 213563_s_at 211253_x_at 217406_at 220657_at 221167_s_at 207517_at 210626_at 213717_at 220222_at 202807_s_at 214054_at 205151_s_at 206356_s_at 207155_at 207315_at 207471_at 209749_s_at 208135_at 217180_at 211057_at 209253_at 208903_at 222225_at 208215_x_at 206537_at 217444_at 220010_at 216479_at 207379_at 211540_s_at 217390_x_at 220570_at 213231_at 203004_s_at 221051_s_at 211485_s_at 208603_s_at 217156_at 220499_at 207488_at 210642_at 217314_at 211248_s_at 205056_s_at 208397_x_at 211175_at 217055_x_at 215085_x_at 215267_s_at 206725_x_at 206747_at 208443_x_at 220640_at 214308_s_at 219813_at 217332_at 208213_s_at 214350_at 220503_at 220752_at 219987_at 214015_at 207456_at 214559_at 216198_at 207784_at 220497_at 1255_g_at 216530_at 220822_at 214410_at 210254_at 219656_at 210393_at 211162_x_at 211816_x_at 220223_at 220811_at 215881_x_at 211892_s_at 207068_at 207501_s_at 216578_at 220783_at 211977_at 211916_s_at 207209_at 214750_at 205277_at 202936_s_at 215057_at 216217_at 219745_at 220908_at 206720_at 32029_at 216346_at 206739_at 215996_at 204429_s_at 208172_s_at 221302_at 221199_at 205344_at 222183_x_at 219859_at 219693_at 219950_s_at 220904_at 215019_x_at 204498_s_at 220347_at 215738_at 206801_at 207569_at 216301_at 207444_at 207937_x_at 210363_s_at 208556_at 221074_at 207504_at 216930_at 217305_s_at 216866_s_at 221236_s_at 210923_at 205043_at 221714_s_at 208241_at 204539_s_at 210704_at 207449_s_at 211192_s_at 215271_at 206617_s_at 212906_at 221160_s_at 216690_at 203866_at 210388_at 216357_at 59705_at 215715_at 215902_at 217121_at 213692_s_at 220029_at 206427_s_at 220906_at 220852_at 216445_at 205929_at 214558_at 216159_s_at 220957_at 210221_at 217039_x_at 215571_at 221448_s_at 210872_x_at 208203_x_at 207175_at 215002_at 216777_at 222178_s_at 217170_at 220537_at 210227_at 216932_at 214354_x_at 220701_at 214990_at 206310_at 205897_at 207330_at 216676_x_at 221414_s_at 216672_s_at 210422_x_at 220181_x_at 207964_x_at 222053_at 216102_at 210197_at 215448_at 210504_at 209400_at 220286_at 208193_at 208108_s_at 210789_x_at 217222_at 217275_at 206521_s_at 216833_x_at 215487_x_at 214824_at 207729_at 222305_at 220247_at 207445_s_at 217409_at 216368_s_at 206228_at 204129_at 210929_s_at 215420_at 216057_at 222345_at 221994_at 219948_x_at 210193_at 217648_at 220353_at 215459_at 214262_at 221912_s_at 216162_at 208065_at 219435_at 214436_at 208218_s_at 206553_at 206846_s_at 212948_at 221560_at 206279_at 218441_s_at 212475_at 207545_s_at 214144_at 203063_at 202694_at 203400_s_at 204423_at 205347_s_at 205046_at 212949_at 213054_at 219403_s_at 214576_at 214604_at 206714_at 204969_s_at 215985_at 218906_x_at 35617_at 213811_x_at 220969_s_at 216080_s_at 207741_x_at 218892_at 214035_x_at 214297_at 200696_s_at 206374_at 221889_at 212707_s_at 207765_s_at 91952_at 205023_at 209690_s_at 205187_at 220326_s_at 215304_at 211691_x_at 202045_s_at 221939_at 211514_at 208874_x_at 202366_at 41657_at 204856_at 40149_at 205441_at 201749_at 222191_s_at 205194_at 206497_at 205323_s_at 219131_at 219380_x_at 222003_s_at 220744_s_at 210230_at 220338_at 219687_at 214252_s_at 201010_s_at 218298_s_at 217142_at 216806_at 222360_at 213760_s_at 209598_at 211701_s_at 205940_at 221338_at 210000_s_at 219633_at 211546_x_at 215259_s_at 207631_at 221026_s_at 210663_s_at 215619_at 208434_at 214731_at Normal Cancer

ApplicaJons of RNA-seq Technology Novel exon discovery Annotated gene Evidence
from RNA-seq

ApplicaJons of RNA-seq Technology Novel exon discovery

ApplicaJons of RNA-seq Technology exon1 exon2 exon3 Alterna've splicing Isoform
1 Isoform 2

ApplicaJons of RNA-seq Technology Mom Dad TAGATGCTCA AGCTAATCCTAG TAGATGCTCA AGCTAATCCTAG
A G A A A ATGCTCA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGCTA A G G G ATGCTCA AGCTATCC ATGCTCA AGCTATCCT ATGCTCA AGCTA A A A ATGCTCA TAGATGCTCA AGCTA GCTCA AGCTAAT TGCTCA AGCTAA AGCTA A Allele-Speciﬁc gene Expression (ASE) PreferenJal expression of one allele over the other.

RNA-seq Work Flow Aligned Reads QuanJﬁed isoform and gene expression
Sequencing Reads (SE or PE) RNA isolaJon/ Library Prep Study Design N

mRNA mRNA aZer fragmentaJon cDNA Adaptors ligated to cDNA Single/
Paired End Sequencing RNA-Seq Total RNA N

Know your applicaJon – Design your experiment accordingly •  How
many reads? Read depth •  Single-end or Paired-end sequencing? •  Read length? •  How many samples? N

RNA-seq Experimental design •  DiﬀerenJal expression of highly expressed and
well annotated genes? –  Smaller sample depth; more biological replicates –  No need for paired end reads; shorter reads (50bp) may be suﬃcient. –  Beder to have 20 million 50bp reads than 10 million 100bp reads. •  Looking for novel genes/splicing/isoforms? – More read depth, paired-end reads from longer fragments. N

Good Experimental Design MulJplexing ReplicaJon RandomizaJon N Illumina ﬂowcell

Two Illumina Lanes Bad Design RNA-Seq Experimental Design: RandomizaJon Experimental
Group 2 Experimental Group 1 N

Two Illumina Lanes Bad Design RNA-Seq Experimental Design: RandomizaJon Experimental
Group 2 Experimental Group 1 N Beder Design Mouse ENCODE reanalysis: hdp://f1000research.com/arJcles/4-121/v1

Sequencing Reads (SE or PE) RNA isolaJon/ Library Prep Study Design N

mRNA mRNA aZer fragmentaJon cDNA Adaptors ligated to cDNA Single/
Paired End Sequencing RNA-Seq Total RNA N

Index Sequence @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 X-Y Coordinate in ﬂowcell Flowcell lane and
Jle number Instrument: run/ﬂowcell id The member of a pair Millions and millions of reads… @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>C@CCCACCCDCCIJ Phred Score: Q = -10 log10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, SN

•  FASTX-Toolkit –  hdp://hannonlab.cshl.edu/fastx_toolkit/ •  FastQC –  hdp://www.bioinformaJcs.babraham.ac.uk/projects/ fastqc/ NGS
Data Preprocessing Quality Control: How to tell if your data is clean S RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/ •  B6-100K.fastq and Cast-100K.fastq

Quality Control: How to tell if your data is clean
Good data §  Consistent §  High Quality Along the reads Bad data §  High Variance §  Quality Decrease with Length S RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/ •  B6-100K.fastq and Cast-100K.fastq

NGS Data Preprocessing Per sequence quality distribu'on Y= number of
reads X= Mean sequence quality bad data Average data S

NGS Data Preprocessing Per sequence quality distribu'on Y= number of
reads X= Mean sequence quality bad data Average data Good data S

Quality Control: Sequence Content Across Bases S

NGS Data Preprocessing K-mer content counts the enrichment of every
5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base posi'on

K-mer content Most samples

NGS Data Preprocessing Duplicated sequences Good: non-unique sequences make up
less than 20% Bad: non-unique sequences make >50% S

Tradeoﬀs to preprocessing data •  Signal/noise -> Preprocessing can remove
low- quality “noise”, but the cost is informaJon loss. –  Some uniformly low-quality reads map uniquely to the genome. –  Trimming reads to remove lower quality ends can adversely aﬀect alignment, especially if aligning to the genome and the read spans a splice site. –  Duplicated reads or just highly expressed genes? –  Most aligners can take quality scores into consideraJon. –  Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S

Sequencing Reads (SE or PE) RNA isolaJon/ Library Prep Study Design S

Alignment 101 ACATGCTGCGGA ACATGCTGCGGA 100bp Read Chr 1 Chr 2
Chr 3 S

The perfect read: 1 read = 1 unique alignment. ACATGCTGCGGA
ACATGCTGCGGA 100bp Read ✓ Chr 1 Chr 2 Chr 3 S

Some reads will align equally well to mulJple locaJons. “MulJreads”
ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA 100bp Read ✓ ✗ ✗ 1 read 3 valid alignments Only 1 alignment is correct S

Aligning Millions of Short Sequence Reads Gene A Gene B
Aligners: BowJe, GSNAP, STAR, BWA, BLAT, HISAT2, BowJe2, Kallisto, Salmon N

Align to Genome or Transcriptome? Genome Transcriptome Advantages: Can align
novel isoforms. Disadvantages: Diﬃcult, Spurious alignments, spliced alignment, gene families, pseudo genes N

Align to Genome or Transcriptome? Genome Transcriptome Advantages: Easy, Focused
to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misadributed to a known isoform. Advantages: Can align novel isoforms. Disadvantages: Diﬃcult, Spurious alignments, spliced alignment, gene families, pseudo genes N

Output of most aligners: Bam/Sam ﬁle of reads and genome
posiJons N

VisualizaJon of alignment data (BAM/SAM) Genome browsers – IGV and
UCSC Integra've Genome Viewer (IGV) hdp://soZware.broadinsJtute.org/soZware/igv/download RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/ •  DO.chr1XY.sorted.bam and DO.chr1XY.sorted.bam.bai

IGV is your friend. Read color = strand SNP Coverage
density plot

Example genes to look at in IGV 1.  Tsn 2. 
Gorab 3.  Fmo1, Fmo2, Fmo3, Fmo4, Fmo6 4.  Ids 5.  Zfx 6.  Ssty1, Ssty2

Aligned Reads to Gene Abundance Aligned Reads QuanJﬁed isoform and
gene expression 100bp Reads Total RNA N

Aligned Reads to Gene Abundance: Challenges Long Short Many approaches
to quanJfy expression abundance N

Long Short 200 Medium 100 50 1000 reads 1 2
3 RelaJve abundance for these genes, f1 , f2 , f3 Aligned Reads to Gene Abundance: Challenges N

Long Short 200 Medium 100 50 1 2 3 RelaJve
abundance for these genes, f1 , f2 , f3 400 400 200 Aligned Reads to Gene Abundance: Challenges N

Long Short 200 Medium 100 50 1 2 3 RelaJve
abundance for these genes, f1 , f2 , f3 350 300 200 150 Unique MulJreads MulJreads: Reads Mapping to MulJple Genes/Transcripts N

Approach 1: Ignore MulJreads Long Short 200 Medium 100 50
1 2 3 RelaJve abundance for these genes, f1 , f2 , f3 350 300 200 150 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008 N

Approach 1: Ignore MulJreads Long Short 200 Medium 100 50
1 2 3 350 300 200 150 •  Over-esJmates the abundance of genes with unique reads •  Under-esJmates the abundance of genes with mulJreads •  Not an opJon at all, if interested in isoform expression N

Approach 2: EM algorithm based allocaJon of MulJreads Long Short
200 Medium 100 50 1 2 3 RelaJve abundance for these genes, f1 , f2 , f3 350 300 200 150 RSEM, Cuﬄinks, isoEM, MMSEQ & eXpress N

Long gene 2 N Approach 2: EM algorithm based allocaJon
of MulJreads gene 1 9 reads 1 read

Long gene 2 N Approach 2: EM algorithm based allocaJon
of MulJreads gene 1 9 reads 1 read 0.9 0.1

ACATGCTGCGGA 100bp Read Chr 2 The rise of Pseduo-alignment a.k.a
alignment-free methods Transcriptome K-mers Sailﬁsh, Salmon, and Kallisto

ACATGCTGCGGA 100bp Read Running Jme in minutes Expression quanJtaJon for
30 Million Reads Kallisto: K-mer based pseudo-alignment

Conclusions for quanJtaJon •  EM approaches are currently the best
opJon. •  Isoform-level esJmates are sJlll challenging and will become easier as read length increases. •  K-mer counJng methods (Salmon, Kallisto) are very fast – they can be run easily on your own PC – and are reasonably accurate. N

Expression Abundance: Counts, RPKM/FPKM, TPM Long Gene Short Gene Long
Gene Short Gene Sample 1 Sample 2 FPKM Number of Fragments Matched to a Gene / Kilo base Total matched reads in Millions

NORMALIZATION A speed bump on the road from raw counts
to diﬀerenJal expression. S

Large pool, small sample problem •  Typical RNA library esJmated
to contain 2.4 x 1012 molecules. McIntyre et al 2011 •  Typical sequencing run = 25 million reads/sample. •  This means that only 0.00001 (1/1000th of a percent) of RNA molecules are sampled in a given run. •  High abundance transcripts are sampled more frequently. Example: Albumin = 13% of all reads in liver RNA-seq samples. •  Sampling errors aﬀect low-abundance transcripts most. S

A ﬁnite pool of reads. S

Alb Low1 Sample 1 Alb Low1 Sample 2 Perfect world:
All transcripts counted. S

Alb Low1 Sample 1 Real world: More reads taken up
by highly expressed genes means less reads available for lowly expressed genes. S

Alb Alb Low1 Low1 Sample 1 Sample 2 Highly expressed
genes that are diﬀerenJally expressed can cause lowly expressed genes that are not actually diﬀerenJally expressed to appear that way. S

NormalizaJon of raw counts •  Wrong way to normalize data
– Normalizing to the total number of mapped reads (e.g. FPKM). Top 10 highly expressed genes soak up 20% of reads in the liver. FPKM is widely used, and problemaJc. •  Beder ways to measure data – Normalize to upper quarJle (75th %) of non-zero counts, median of scaled counts (DESeq), or the weighted trimmed mean of the log expression raJos (EdgeR). S

DifferenJal Expression Analysis Over-esJmaJon of Under-esJmaJon of ˆ2 g ˆ2
g Too conservaJve Too sensiJve (Many false posiJves) tg = ˆ µg,1 ˆ µg,2 s ˆ2 g,1 N1 + ˆ2 g,2 N2 DESEQ2, edgeR, Voom, & CuffDiff T-test Normal Cancer Expression

MulJple TesJng CorrecJon and False Discovery rate XKCD Signiﬁcant 2012
IgNobel prize in Neuroscience for “ﬁnding Brain acJvity signal in dead salmon using fMRI” N

Single Cell RNA-seq Technologies Fluidigm C1 Chip 96 cells /
800 Cells DropSeq: 40,000 cells 10X Genomics: 48,000 cells

Summary ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA
AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCA AGCTAATC AGCTAATCCTAG CTCA RNA

Summary ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA
AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCA AGCTAATC AGCTAATCCTAG CTCA RNA Experimental Design RNA-seq analysis pipeline As sequences get longer, alignment and isoform quanJtaJon becomes easier!

Resources Aligner –  BowJe 2 hdp://bowJe-bio.sourceforge.net/bowJe2/index.shtml –  GSNAP hdp://research-pub.gene.com/gmap/ Transcript
Discovery/AnnotaJon - STAR hdps://github.com/alexdobin/STAR/releases - Tophat hdp://tophat.cbcb.umd.edu/ Transcript Abundance –  Kallisto hdp://pachterlab.github.io/kallisto/ –  RSEM hdp://deweylab.biostat.wisc.edu/rsem/ –  EMASE hdps://github.com/churchill-lab/emase DiﬀerenJal Expression –  DESeq hdp://www-huber.embl.de/users/anders/DESeq/ –  edgeR hdp://bioconductor.org/packages/release/bioc/html/edgeR.html –  EBSeq hdps://www.biostat.wisc.edu/~kendzior/EBSEQ/

Example 1 DiﬀerenJal expression in my mutant mouse compared to
wild type. What genes are up- or down-regulated?

Things to consider… •  DiﬀerenJal expression of highly expressed and
well annotated genes? –  Smaller sample depth; more biological replicates –  No need for paired end reads; shorter reads (50bp) may be suﬃcient. –  Beder to have 20 million 50bp reads than 10 million 100bp reads. •  Looking for novel genes/splicing/isoforms? – More read depth, paired-end reads from longer fragments. N

Example 2 •  How to quanJfy gene expression in a
species that has not been sequenced or annotated? – MulJstep strategy using mulJple sequencing technologies.

Example 3 •  How to quanJfy single cell gene expression
in a heterogeneous human tumor?

Any other applicaJons you are interested in? Steve Munger [email protected]
Narayanan Raghupathy [email protected]

Acknowledgements •  KB Choi •  Gary Churchill •  Ron Korstanje/
Karen Svenson/ Elissa Chesler •  Joel Graber •  Doug Hinerfeld •  Anuj Srivastava •  Churchill Lab – Dan Gau •  Al Simons and Mad Hibbs •  Lisa Somes, Steve Ciciode, mouse room staﬀ at JAX •  Gene Expression Technologies group at JAX

Introduction to RNA-seq: From experimental desi...

Introduction to RNA-seq: From experimental design to gene quantitation

More Decks by Steve Munger

Other Decks in Research

Featured

Transcript