$30 off During Our Annual Pro Sale. View Details »

Introduction to RNA-seq: From experimental design to gene quantitation

Steve Munger
August 11, 2016

Introduction to RNA-seq: From experimental design to gene quantitation

Steve Munger

August 11, 2016
Tweet

More Decks by Steve Munger

Other Decks in Research

Transcript

  1. RNA-seq: From (good) experimental design
    to (accurate) gene expression abundance.
    Steve Munger
    Narayanan Raghupathy
    The Jackson Laboratory
    21st Century Mouse GeneJcs
    11 August 2016

    View Slide

  2. Outline
    General overview of RNA-seq analysis.
    •  IntroducJon to RNA-seq
    •  The importance of a good experimental design
    •  Quality Control
    •  Read alignment
    •  QuanJfying isoform and gene expression
    •  NormalizaJon of expression esJmates

    View Slide

  3. RNA-seq: Sequencing Transcriptomes
    ATGCTCA AGCTA
    TAGATGCTCA AGCTA
    ATGCTCA AGCTAATC
    ATGCTCA AGCTA
    AGTAGATGCTCA AGCTA
    ATGCTCA AGCTA
    ATGCTCA AGCTA
    ATGCTCA AGCTA
    TAGATGCTCA AGCTAATC
    AGCTAATCCTAG
    CTCA
    mRNA

    View Slide

  4. ApplicaJons of RNA-seq Technology
    Differen'al Gene expression analysis
    GSM71019.CEL
    GSM71020.CEL
    GSM71021.CEL
    GSM71022.CEL
    GSM71023.CEL
    GSM71024.CEL
    GSM71025.CEL
    GSM71026.CEL
    GSM71028.CEL
    GSM71029.CEL
    GSM71030.CEL
    GSM71031.CEL
    GSM71032.CEL
    GSM71033.CEL
    GSM71034.CEL
    GSM71035.CEL
    213087_s_at
    218488_at
    204607_at
    212282_at
    218070_s_at
    222029_x_at
    222240_s_at
    217714_x_at
    218514_at
    202980_s_at
    208886_at
    220230_s_at
    204702_s_at
    204807_at
    218445_at
    208656_s_at
    212980_at
    214220_s_at
    211152_s_at
    221927_s_at
    203353_s_at
    221563_at
    222016_s_at
    201820_at
    205401_at
    210007_s_at
    211834_s_at
    204199_at
    76897_s_at
    215471_s_at
    213506_at
    203355_s_at
    221496_s_at
    217536_x_at
    220586_at
    203610_s_at
    212926_at
    206788_s_at
    214657_s_at
    218470_at
    214484_s_at
    207821_s_at
    212686_at
    208165_s_at
    204156_at
    213320_at
    210281_s_at
    202223_at
    219281_at
    218535_s_at
    200706_s_at
    217388_s_at
    214889_at
    219924_s_at
    211732_x_at
    204732_s_at
    216342_x_at
    221476_s_at
    212039_x_at
    200038_s_at
    213377_x_at
    208645_s_at
    213227_at
    218654_s_at
    212995_x_at
    202901_x_at
    220386_s_at
    200606_at
    202543_s_at
    212804_s_at
    216100_s_at
    212911_at
    205588_s_at
    204739_at
    201447_at
    219003_s_at
    203991_s_at
    209704_at
    202504_at
    207163_s_at
    200752_s_at
    221577_x_at
    200660_at
    218771_at
    201609_x_at
    211725_s_at
    202417_at
    201669_s_at
    40562_at
    209345_s_at
    222221_x_at
    204431_at
    202715_at
    219278_at
    203782_s_at
    204178_s_at
    218419_s_at
    34726_at
    209113_s_at
    220597_s_at
    209607_x_at
    207643_s_at
    204842_x_at
    201251_at
    203847_s_at
    214005_at
    33322_i_at
    213478_at
    202856_s_at
    217733_s_at
    207688_s_at
    202241_at
    203231_s_at
    213848_at
    214684_at
    211063_s_at
    218092_s_at
    205263_at
    207030_s_at
    201881_s_at
    219646_at
    203518_at
    201804_x_at
    213923_at
    213940_s_at
    203556_at
    203528_at
    213241_at
    221878_at
    217881_s_at
    212141_at
    212072_s_at
    219649_at
    213282_at
    209989_at
    206683_at
    207180_s_at
    213455_at
    203186_s_at
    209111_at
    216071_x_at
    218795_at
    212547_at
    209445_x_at
    209675_s_at
    202669_s_at
    219023_at
    202724_s_at
    213480_at
    218570_at
    202891_at
    203952_at
    211098_x_at
    202770_s_at
    212652_s_at
    204569_at
    212959_s_at
    213315_x_at
    211928_at
    222231_s_at
    266_s_at
    201555_at
    210983_s_at
    219862_s_at
    216977_x_at
    211501_s_at
    214096_s_at
    211948_x_at
    202261_at
    210243_s_at
    218450_at
    213708_s_at
    212717_at
    213090_s_at
    203269_at
    201751_at
    212227_x_at
    211761_s_at
    207253_s_at
    217950_at
    212722_s_at
    204640_s_at
    204147_s_at
    209323_at
    220147_s_at
    214172_x_at
    201064_s_at
    203468_at
    213564_x_at
    216973_s_at
    204779_s_at
    201678_s_at
    202943_s_at
    221768_at
    213271_s_at
    209964_s_at
    221502_at
    202736_s_at
    201267_s_at
    201643_x_at
    201394_s_at
    202951_at
    212453_at
    204493_at
    217718_s_at
    200633_at
    207721_x_at
    217491_x_at
    200985_s_at
    218482_at
    201119_s_at
    218213_s_at
    209048_s_at
    218046_s_at
    200942_s_at
    205077_s_at
    219384_s_at
    207785_s_at
    214356_s_at
    213794_s_at
    202451_at
    200074_s_at
    221488_s_at
    217816_s_at
    201212_at
    202143_s_at
    208630_at
    212498_at
    212967_x_at
    218189_s_at
    209565_at
    201906_s_at
    200022_at
    201846_s_at
    201565_s_at
    209513_s_at
    201651_s_at
    213518_at
    215038_s_at
    218929_at
    221781_s_at
    211529_x_at
    209341_s_at
    218179_s_at
    208883_at
    209181_s_at
    202658_at
    202304_at
    220199_s_at
    201424_s_at
    204020_at
    201916_s_at
    212037_at
    218288_s_at
    219242_at
    216449_x_at
    208152_s_at
    212248_at
    202829_s_at
    206562_s_at
    204427_s_at
    209226_s_at
    212396_s_at
    212247_at
    218160_at
    203097_s_at
    221428_s_at
    208758_at
    210648_x_at
    209102_s_at
    202324_s_at
    203964_at
    201179_s_at
    211684_s_at
    214363_s_at
    217982_s_at
    217758_s_at
    217848_s_at
    201157_s_at
    201931_at
    217746_s_at
    217845_x_at
    201240_s_at
    201726_at
    200870_at
    217959_s_at
    209248_at
    201573_s_at
    209249_s_at
    218286_s_at
    218171_at
    212584_at
    202798_at
    215548_s_at
    201546_at
    208616_s_at
    218313_s_at
    200737_at
    201487_at
    221844_x_at
    202303_x_at
    218228_s_at
    200698_at
    200680_x_at
    209063_x_at
    208882_s_at
    221547_at
    201658_at
    211090_s_at
    203529_at
    202583_s_at
    218640_s_at
    219030_at
    201558_at
    203024_s_at
    200640_at
    209076_s_at
    201477_s_at
    200693_at
    200910_at
    208969_at
    201384_s_at
    202243_s_at
    211047_x_at
    200046_at
    212202_s_at
    201379_s_at
    214911_s_at
    202200_s_at
    201798_s_at
    204389_at
    215177_s_at
    213617_s_at
    212407_at
    215084_s_at
    218947_s_at
    218721_s_at
    203319_s_at
    201270_x_at
    215555_at
    220035_at
    215209_at
    221621_at
    208469_s_at
    215483_at
    212153_at
    209489_at
    215504_x_at
    205461_at
    221950_at
    215529_x_at
    65591_at
    49878_at
    206527_at
    218382_s_at
    204080_at
    207598_x_at
    205169_at
    213623_at
    206182_at
    205875_s_at
    218018_at
    202247_s_at
    214040_s_at
    208997_s_at
    208161_s_at
    209696_at
    209528_s_at
    211136_s_at
    200879_s_at
    203890_s_at
    221810_at
    209878_s_at
    213705_at
    207247_s_at
    219460_s_at
    222361_at
    204636_at
    214279_s_at
    212003_at
    32062_at
    220770_s_at
    203366_at
    203068_at
    221953_s_at
    211013_x_at
    203668_at
    205372_at
    201332_s_at
    218821_at
    218500_at
    217419_x_at
    210269_s_at
    215760_s_at
    209875_s_at
    213472_at
    209929_s_at
    212575_at
    212173_at
    210163_at
    218775_s_at
    205393_s_at
    222139_at
    40665_at
    204752_x_at
    215773_x_at
    218841_at
    205627_at
    203585_at
    204885_s_at
    206576_s_at
    207397_s_at
    217999_s_at
    217371_s_at
    204848_x_at
    217502_at
    204082_at
    213215_at
    204837_at
    203171_s_at
    212853_at
    209705_at
    210278_s_at
    213421_x_at
    212788_x_at
    64488_at
    200911_s_at
    205296_at
    204866_at
    216228_s_at
    210757_x_at
    212235_at
    219371_s_at
    213418_at
    207828_s_at
    214710_s_at
    203276_at
    209070_s_at
    204677_at
    202117_at
    201531_at
    204344_s_at
    217513_at
    210170_at
    206157_at
    220068_at
    206227_at
    204796_at
    213381_at
    205544_s_at
    206742_at
    218502_s_at
    209879_at
    203666_at
    213338_at
    219213_at
    209760_at
    209827_s_at
    204951_at
    208195_at
    209900_s_at
    203299_s_at
    208092_s_at
    203665_at
    219230_at
    45297_at
    209695_at
    209343_at
    214433_s_at
    218573_at
    213413_at
    216620_s_at
    212608_s_at
    214212_x_at
    203799_at
    200953_s_at
    208792_s_at
    204083_s_at
    203705_s_at
    201431_s_at
    205804_s_at
    204655_at
    209613_s_at
    219478_at
    200982_s_at
    202531_at
    220532_s_at
    221725_at
    217781_s_at
    201998_at
    211796_s_at
    209795_at
    205237_at
    202833_s_at
    203320_at
    213888_s_at
    203066_at
    205119_s_at
    205159_at
    203921_at
    219421_at
    213553_x_at
    217350_at
    218321_x_at
    204257_at
    202894_at
    217157_x_at
    217281_x_at
    205378_s_at
    212315_s_at
    204594_s_at
    211399_at
    204187_at
    206997_s_at
    221335_x_at
    214562_at
    216668_at
    214052_x_at
    48117_at
    205024_s_at
    218355_at
    213008_at
    204887_s_at
    212789_at
    219555_s_at
    205336_at
    214283_at
    36475_at
    205102_at
    200831_s_at
    211468_s_at
    218835_at
    216074_x_at
    205776_at
    202925_s_at
    203998_s_at
    207007_at
    217659_at
    219510_at
    43511_s_at
    222291_at
    213707_s_at
    214048_at
    207912_s_at
    220470_at
    212283_at
    216613_at
    216948_at
    215331_at
    211490_at
    203398_s_at
    210360_s_at
    222198_at
    218372_at
    216454_at
    213042_s_at
    212938_at
    218483_s_at
    210867_at
    220221_at
    216216_at
    219794_at
    220491_at
    221068_at
    208201_at
    215863_at
    217014_s_at
    213520_at
    207959_s_at
    216808_at
    205468_s_at
    217411_s_at
    206170_at
    209745_at
    205955_at
    222081_at
    217072_at
    219096_at
    204381_at
    214136_at
    211419_s_at
    216605_s_at
    208345_s_at
    214300_s_at
    208604_s_at
    201246_s_at
    209730_at
    203280_at
    214002_at
    219424_at
    214891_at
    215799_at
    204746_s_at
    213667_at
    217612_at
    203933_at
    205674_x_at
    215882_at
    219005_at
    206278_at
    209280_at
    215741_x_at
    221680_s_at
    201625_s_at
    205576_at
    220956_s_at
    218060_s_at
    221054_s_at
    204343_at
    206972_s_at
    208272_at
    217368_at
    220188_at
    208494_at
    215344_at
    220694_at
    222160_at
    216906_at
    222368_at
    207153_s_at
    216406_at
    217124_at
    206538_at
    204433_s_at
    219783_at
    206931_at
    210320_s_at
    204035_at
    208077_at
    217778_at
    205306_x_at
    213644_at
    206178_at
    215147_at
    216293_at
    217405_x_at
    222112_at
    200952_s_at
    215173_at
    211164_at
    214593_at
    210037_s_at
    222188_at
    220404_at
    218266_s_at
    206731_at
    59375_at
    219784_at
    211109_at
    206794_at
    214418_at
    219989_s_at
    207976_at
    206560_s_at
    212654_at
    217535_at
    209597_s_at
    215100_at
    214983_at
    213816_s_at
    212963_at
    214652_at
    205520_at
    216188_at
    210416_s_at
    215451_s_at
    218621_at
    211584_s_at
    214922_at
    220177_s_at
    206718_at
    205295_at
    205500_at
    208019_at
    214914_at
    220785_at
    211239_s_at
    221233_s_at
    215514_at
    219320_at
    216978_x_at
    220340_at
    221025_x_at
    216030_s_at
    216443_at
    211520_s_at
    216803_at
    217621_at
    213683_at
    217253_at
    215713_at
    213563_s_at
    211253_x_at
    217406_at
    220657_at
    221167_s_at
    207517_at
    210626_at
    213717_at
    220222_at
    202807_s_at
    214054_at
    205151_s_at
    206356_s_at
    207155_at
    207315_at
    207471_at
    209749_s_at
    208135_at
    217180_at
    211057_at
    209253_at
    208903_at
    222225_at
    208215_x_at
    206537_at
    217444_at
    220010_at
    216479_at
    207379_at
    211540_s_at
    217390_x_at
    220570_at
    213231_at
    203004_s_at
    221051_s_at
    211485_s_at
    208603_s_at
    217156_at
    220499_at
    207488_at
    210642_at
    217314_at
    211248_s_at
    205056_s_at
    208397_x_at
    211175_at
    217055_x_at
    215085_x_at
    215267_s_at
    206725_x_at
    206747_at
    208443_x_at
    220640_at
    214308_s_at
    219813_at
    217332_at
    208213_s_at
    214350_at
    220503_at
    220752_at
    219987_at
    214015_at
    207456_at
    214559_at
    216198_at
    207784_at
    220497_at
    1255_g_at
    216530_at
    220822_at
    214410_at
    210254_at
    219656_at
    210393_at
    211162_x_at
    211816_x_at
    220223_at
    220811_at
    215881_x_at
    211892_s_at
    207068_at
    207501_s_at
    216578_at
    220783_at
    211977_at
    211916_s_at
    207209_at
    214750_at
    205277_at
    202936_s_at
    215057_at
    216217_at
    219745_at
    220908_at
    206720_at
    32029_at
    216346_at
    206739_at
    215996_at
    204429_s_at
    208172_s_at
    221302_at
    221199_at
    205344_at
    222183_x_at
    219859_at
    219693_at
    219950_s_at
    220904_at
    215019_x_at
    204498_s_at
    220347_at
    215738_at
    206801_at
    207569_at
    216301_at
    207444_at
    207937_x_at
    210363_s_at
    208556_at
    221074_at
    207504_at
    216930_at
    217305_s_at
    216866_s_at
    221236_s_at
    210923_at
    205043_at
    221714_s_at
    208241_at
    204539_s_at
    210704_at
    207449_s_at
    211192_s_at
    215271_at
    206617_s_at
    212906_at
    221160_s_at
    216690_at
    203866_at
    210388_at
    216357_at
    59705_at
    215715_at
    215902_at
    217121_at
    213692_s_at
    220029_at
    206427_s_at
    220906_at
    220852_at
    216445_at
    205929_at
    214558_at
    216159_s_at
    220957_at
    210221_at
    217039_x_at
    215571_at
    221448_s_at
    210872_x_at
    208203_x_at
    207175_at
    215002_at
    216777_at
    222178_s_at
    217170_at
    220537_at
    210227_at
    216932_at
    214354_x_at
    220701_at
    214990_at
    206310_at
    205897_at
    207330_at
    216676_x_at
    221414_s_at
    216672_s_at
    210422_x_at
    220181_x_at
    207964_x_at
    222053_at
    216102_at
    210197_at
    215448_at
    210504_at
    209400_at
    220286_at
    208193_at
    208108_s_at
    210789_x_at
    217222_at
    217275_at
    206521_s_at
    216833_x_at
    215487_x_at
    214824_at
    207729_at
    222305_at
    220247_at
    207445_s_at
    217409_at
    216368_s_at
    206228_at
    204129_at
    210929_s_at
    215420_at
    216057_at
    222345_at
    221994_at
    219948_x_at
    210193_at
    217648_at
    220353_at
    215459_at
    214262_at
    221912_s_at
    216162_at
    208065_at
    219435_at
    214436_at
    208218_s_at
    206553_at
    206846_s_at
    212948_at
    221560_at
    206279_at
    218441_s_at
    212475_at
    207545_s_at
    214144_at
    203063_at
    202694_at
    203400_s_at
    204423_at
    205347_s_at
    205046_at
    212949_at
    213054_at
    219403_s_at
    214576_at
    214604_at
    206714_at
    204969_s_at
    215985_at
    218906_x_at
    35617_at
    213811_x_at
    220969_s_at
    216080_s_at
    207741_x_at
    218892_at
    214035_x_at
    214297_at
    200696_s_at
    206374_at
    221889_at
    212707_s_at
    207765_s_at
    91952_at
    205023_at
    209690_s_at
    205187_at
    220326_s_at
    215304_at
    211691_x_at
    202045_s_at
    221939_at
    211514_at
    208874_x_at
    202366_at
    41657_at
    204856_at
    40149_at
    205441_at
    201749_at
    222191_s_at
    205194_at
    206497_at
    205323_s_at
    219131_at
    219380_x_at
    222003_s_at
    220744_s_at
    210230_at
    220338_at
    219687_at
    214252_s_at
    201010_s_at
    218298_s_at
    217142_at
    216806_at
    222360_at
    213760_s_at
    209598_at
    211701_s_at
    205940_at
    221338_at
    210000_s_at
    219633_at
    211546_x_at
    215259_s_at
    207631_at
    221026_s_at
    210663_s_at
    215619_at
    208434_at
    214731_at
    Normal Cancer

    View Slide

  5. ApplicaJons of RNA-seq Technology
    Novel exon discovery
    Annotated gene
    Evidence from
    RNA-seq

    View Slide

  6. ApplicaJons of RNA-seq Technology
    Novel exon discovery

    View Slide

  7. ApplicaJons of RNA-seq Technology
    exon1 exon2 exon3
    Alterna've splicing
    Isoform 1
    Isoform 2

    View Slide

  8. ApplicaJons of RNA-seq Technology
    Mom
    Dad
    TAGATGCTCA AGCTAATCCTAG
    TAGATGCTCA AGCTAATCCTAG
    A
    G
    A
    A
    A
    ATGCTCA
    TAGATGCTCA AGCTA
    ATGCTCA AGCTAATC
    ATGCTCA AGCTA
    AGCTA
    A
    G
    G
    G
    ATGCTCA AGCTATCC
    ATGCTCA AGCTATCCT
    ATGCTCA AGCTA
    A
    A
    A
    ATGCTCA
    TAGATGCTCA AGCTA
    GCTCA AGCTAAT
    TGCTCA AGCTAA
    AGCTA
    A
    Allele-Specific gene Expression (ASE)
    PreferenJal expression of one allele over the other.

    View Slide

  9. RNA-seq Work Flow
    Aligned Reads
    QuanJfied isoform and gene expression
    Sequencing Reads (SE or PE)
    RNA isolaJon/ Library Prep
    Study Design
    N

    View Slide

  10. mRNA
    mRNA aZer
    fragmentaJon
    cDNA
    Adaptors ligated
    to cDNA
    Single/ Paired End
    Sequencing
    RNA-Seq
    Total RNA
    N

    View Slide

  11. Know your applicaJon – Design your
    experiment accordingly
    •  How many reads? Read depth
    •  Single-end or Paired-end sequencing?
    •  Read length?
    •  How many samples?
    N

    View Slide

  12. RNA-seq Experimental design
    •  DifferenJal expression of highly expressed and
    well annotated genes?
    –  Smaller sample depth; more biological replicates
    –  No need for paired end reads; shorter reads (50bp)
    may be sufficient.
    –  Beder to have 20 million 50bp reads than 10 million
    100bp reads.
    •  Looking for novel genes/splicing/isoforms?
    – More read depth, paired-end reads from longer
    fragments.
    N

    View Slide

  13. Good Experimental Design
    MulJplexing
    ReplicaJon
    RandomizaJon
    N
    Illumina flowcell

    View Slide

  14. Two Illumina Lanes
    Bad Design
    RNA-Seq Experimental Design: RandomizaJon
    Experimental Group 2
    Experimental Group 1
    N

    View Slide

  15. Two Illumina Lanes
    Bad Design
    RNA-Seq Experimental Design: RandomizaJon
    Experimental Group 2
    Experimental Group 1
    N
    Beder Design
    Mouse ENCODE reanalysis: hdp://f1000research.com/arJcles/4-121/v1

    View Slide

  16. RNA-seq Work Flow
    Aligned Reads
    QuanJfied isoform and gene expression
    Sequencing Reads (SE or PE)
    RNA isolaJon/ Library Prep
    Study Design
    N

    View Slide

  17. mRNA
    mRNA aZer
    fragmentaJon
    cDNA
    Adaptors ligated
    to cDNA
    Single/ Paired End
    Sequencing
    RNA-Seq
    Total RNA
    N

    View Slide

  18. Index Sequence
    @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1
    X-Y Coordinate in flowcell
    Flowcell lane and Jle number
    Instrument: run/flowcell id
    The member of a pair
    Millions and millions of reads…
    @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1
    TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG
    +
    CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>C@CCCACCCDCCIJ
    Phred Score:
    Q = -10 log10
    P
    10 indicates 1 in 10 chance of error
    20 indicates 1 in 100,
    30 indicates 1 in 1000,
    SN

    View Slide

  19. •  FASTX-Toolkit
    –  hdp://hannonlab.cshl.edu/fastx_toolkit/
    •  FastQC
    –  hdp://www.bioinformaJcs.babraham.ac.uk/projects/
    fastqc/
    NGS Data Preprocessing
    Quality Control: How to tell if your data is clean
    S
    RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/
    •  B6-100K.fastq and Cast-100K.fastq

    View Slide

  20. Quality Control: How to tell if your data is clean
    Good data
    §  Consistent
    §  High Quality Along the reads
    Bad data
    §  High Variance
    §  Quality Decrease with Length
    S
    RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/
    •  B6-100K.fastq and Cast-100K.fastq

    View Slide

  21. NGS Data Preprocessing
    Per sequence quality distribu'on
    Y= number of reads
    X= Mean sequence quality
    bad data
    Average data
    S

    View Slide

  22. NGS Data Preprocessing
    Per sequence quality distribu'on
    Y= number of reads
    X= Mean sequence quality
    bad data
    Average data
    Good data
    S

    View Slide

  23. Quality Control: Sequence Content Across Bases
    S

    View Slide

  24. NGS Data Preprocessing
    K-mer content
    counts the enrichment of every 5-mer
    within the sequence library
    Bad: If k-mer enrichment >= 10 fold at any
    individual base posi'on

    View Slide

  25. K-mer content
    Most samples

    View Slide

  26. NGS Data Preprocessing
    Duplicated sequences
    Good: non-unique sequences make
    up less than 20%
    Bad: non-unique sequences make
    >50%
    S

    View Slide

  27. Tradeoffs to preprocessing data
    •  Signal/noise -> Preprocessing can remove low-
    quality “noise”, but the cost is informaJon loss.
    –  Some uniformly low-quality reads map uniquely to the
    genome.
    –  Trimming reads to remove lower quality ends can
    adversely affect alignment, especially if aligning to the
    genome and the read spans a splice site.
    –  Duplicated reads or just highly expressed genes?
    –  Most aligners can take quality scores into
    consideraJon.
    –  Currently, we do not recommend preprocessing reads
    aside from removing uniformly low quality samples.
    S

    View Slide

  28. RNA-seq Work Flow
    Aligned Reads
    QuanJfied isoform and gene expression
    Sequencing Reads (SE or PE)
    RNA isolaJon/ Library Prep
    Study Design
    S

    View Slide

  29. Alignment 101
    ACATGCTGCGGA
    ACATGCTGCGGA
    100bp Read
    Chr 1
    Chr 2
    Chr 3
    S

    View Slide

  30. The perfect read: 1 read = 1 unique alignment.
    ACATGCTGCGGA
    ACATGCTGCGGA
    100bp Read

    Chr 1
    Chr 2
    Chr 3
    S

    View Slide

  31. Some reads will align equally well to mulJple
    locaJons. “MulJreads”
    ACATGCTGCGGA
    ACATGCTGCGGA
    ACATGCTGCGGA
    ACATGCTGCGGA
    100bp Read



    1 read
    3 valid alignments
    Only 1 alignment is correct
    S

    View Slide

  32. Aligning Millions of Short Sequence Reads
    Gene A Gene B
    Aligners: BowJe, GSNAP, STAR, BWA, BLAT,
    HISAT2, BowJe2, Kallisto, Salmon
    N

    View Slide

  33. Align to Genome or Transcriptome?
    Genome
    Transcriptome
    Advantages: Can align novel isoforms.
    Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes
    N

    View Slide

  34. Align to Genome or Transcriptome?
    Genome
    Transcriptome
    Advantages: Easy, Focused to the part of the genome that is known to be transcribed.
    Disadvantages: Reads that come from novel isoforms may not align at all or may be
    misadributed to a known isoform.
    Advantages: Can align novel isoforms.
    Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes
    N

    View Slide

  35. Output of most aligners: Bam/Sam file
    of reads and genome posiJons
    N

    View Slide

  36. VisualizaJon of alignment data (BAM/SAM)
    Genome browsers – IGV and UCSC
    Integra've Genome Viewer (IGV)
    hdp://soZware.broadinsJtute.org/soZware/igv/download
    RNA-seq Data: Zp://Zp.jax.org/dgau/MouseGen2016/
    •  DO.chr1XY.sorted.bam and DO.chr1XY.sorted.bam.bai

    View Slide

  37. IGV is your friend.
    Read color = strand
    SNP Coverage density plot

    View Slide

  38. Example genes to look at in IGV
    1.  Tsn
    2.  Gorab
    3.  Fmo1, Fmo2, Fmo3, Fmo4, Fmo6
    4.  Ids
    5.  Zfx
    6.  Ssty1, Ssty2

    View Slide

  39. Aligned Reads to Gene Abundance
    Aligned Reads
    QuanJfied isoform and gene expression
    100bp Reads
    Total RNA
    N

    View Slide

  40. Aligned Reads to Gene Abundance: Challenges
    Long Short
    Many approaches to quanJfy expression abundance
    N

    View Slide

  41. Long
    Short
    200
    Medium
    100
    50
    1000 reads
    1
    2
    3
    RelaJve abundance for these genes, f1
    , f2
    , f3
    Aligned Reads to Gene Abundance: Challenges
    N

    View Slide

  42. Long
    Short
    200
    Medium
    100
    50
    1
    2
    3
    RelaJve abundance for these genes, f1
    , f2
    , f3
    400
    400
    200
    Aligned Reads to Gene Abundance: Challenges
    N

    View Slide

  43. Long
    Short
    200
    Medium
    100
    50
    1
    2
    3
    RelaJve abundance for these genes, f1
    , f2
    , f3
    350
    300
    200
    150
    Unique
    MulJreads
    MulJreads: Reads Mapping to MulJple Genes/Transcripts
    N

    View Slide

  44. Approach 1: Ignore MulJreads
    Long
    Short
    200
    Medium
    100
    50
    1
    2
    3
    RelaJve abundance for these genes, f1
    , f2
    , f3
    350
    300
    200
    150
    Nagalakshmi et. al. Science. 2008
    Marioni, et. al. Genome Research 2008
    N

    View Slide

  45. Approach 1: Ignore MulJreads
    Long
    Short
    200
    Medium
    100
    50
    1
    2
    3
    350
    300
    200
    150
    •  Over-esJmates the abundance of genes with unique reads
    •  Under-esJmates the abundance of genes with mulJreads
    •  Not an opJon at all, if interested in isoform expression
    N

    View Slide

  46. Approach 2: EM algorithm based allocaJon of MulJreads
    Long
    Short
    200
    Medium
    100
    50
    1
    2
    3
    RelaJve abundance for these genes, f1
    , f2
    , f3
    350
    300
    200
    150
    RSEM, Cufflinks, isoEM, MMSEQ & eXpress N

    View Slide

  47. Long gene 2
    N
    Approach 2: EM algorithm based allocaJon of MulJreads
    gene 1
    9 reads 1 read

    View Slide

  48. Long gene 2
    N
    Approach 2: EM algorithm based allocaJon of MulJreads
    gene 1
    9 reads 1 read
    0.9 0.1

    View Slide

  49. ACATGCTGCGGA
    100bp Read
    Chr 2
    The rise of Pseduo-alignment a.k.a alignment-free
    methods
    Transcriptome
    K-mers
    Sailfish, Salmon, and Kallisto

    View Slide

  50. ACATGCTGCGGA
    100bp Read
    Running Jme in minutes
    Expression quanJtaJon for
    30 Million Reads
    Kallisto: K-mer based pseudo-alignment

    View Slide

  51. Conclusions for quanJtaJon
    •  EM approaches are currently the best opJon.
    •  Isoform-level esJmates are sJlll challenging and will
    become easier as read length increases.
    •  K-mer counJng methods (Salmon, Kallisto) are very
    fast – they can be run easily on your own PC – and
    are reasonably accurate.
    N

    View Slide

  52. Expression Abundance: Counts, RPKM/FPKM, TPM
    Long Gene Short Gene
    Long Gene Short Gene
    Sample 1
    Sample 2
    FPKM
    Number of Fragments Matched to a Gene / Kilo base
    Total matched reads in Millions

    View Slide

  53. NORMALIZATION
    A speed bump on the road from raw counts to differenJal expression.
    S

    View Slide

  54. Large pool, small sample problem
    •  Typical RNA library esJmated to contain 2.4 x 1012
    molecules. McIntyre et al 2011
    •  Typical sequencing run = 25 million reads/sample.
    •  This means that only 0.00001 (1/1000th of a percent)
    of RNA molecules are sampled in a given run.
    •  High abundance transcripts are sampled more
    frequently.
    Example: Albumin = 13% of all reads in liver RNA-seq
    samples.
    •  Sampling errors affect low-abundance transcripts
    most.
    S

    View Slide

  55. A finite pool of reads.
    S

    View Slide

  56. Alb
    Low1
    Sample 1
    Alb
    Low1
    Sample 2
    Perfect world:
    All transcripts
    counted.
    S

    View Slide

  57. Alb
    Low1
    Sample 1 Real world: More reads
    taken up by highly
    expressed genes means less
    reads available for lowly
    expressed genes.
    S

    View Slide

  58. Alb Alb
    Low1 Low1
    Sample 1 Sample 2
    Highly expressed genes that
    are differenJally expressed can
    cause lowly expressed genes
    that are not actually
    differenJally expressed to
    appear that way.
    S

    View Slide

  59. NormalizaJon of raw counts
    •  Wrong way to normalize data
    – Normalizing to the total number of mapped reads
    (e.g. FPKM). Top 10 highly expressed genes soak
    up 20% of reads in the liver. FPKM is widely used,
    and problemaJc.
    •  Beder ways to measure data
    – Normalize to upper quarJle (75th %) of non-zero
    counts, median of scaled counts (DESeq), or the
    weighted trimmed mean of the log expression
    raJos (EdgeR).
    S

    View Slide

  60. DifferenJal Expression Analysis
    Over-esJmaJon of
    Under-esJmaJon of
    ˆ2
    g
    ˆ2
    g
    Too conservaJve
    Too sensiJve
    (Many false posiJves)
    tg =
    ˆ
    µg,1 ˆ
    µg,2
    s
    ˆ2
    g,1
    N1
    +
    ˆ2
    g,2
    N2
    DESEQ2, edgeR, Voom, & CuffDiff
    T-test
    Normal Cancer
    Expression

    View Slide

  61. MulJple TesJng CorrecJon and False Discovery rate
    XKCD Significant
    2012 IgNobel prize in
    Neuroscience for “finding
    Brain acJvity signal in dead salmon using fMRI”
    N

    View Slide

  62. Single Cell RNA-seq Technologies
    Fluidigm C1 Chip
    96 cells / 800 Cells
    DropSeq: 40,000 cells
    10X Genomics:
    48,000 cells

    View Slide

  63. Summary
    ATGCTCA AGCTA
    TAGATGCTCA AGCTA
    ATGCTCA AGCTAATC
    ATGCTCA AGCTA
    AGTAGATGCTCA AGCTA
    ATGCTCA AGCTA
    ATGCTCA AGCTA
    ATGCTCA AGCTA
    TAGATGCTCA AGCTAATC
    AGCTAATCCTAG
    CTCA
    RNA

    View Slide

  64. Summary
    ATGCTCA AGCTA
    TAGATGCTCA AGCTA
    ATGCTCA AGCTAATC
    ATGCTCA AGCTA
    AGTAGATGCTCA AGCTA
    ATGCTCA AGCTA
    ATGCTCA AGCTA
    ATGCTCA AGCTA
    TAGATGCTCA AGCTAATC
    AGCTAATCCTAG
    CTCA
    RNA
    Experimental Design
    RNA-seq analysis pipeline
    As sequences get longer, alignment and isoform
    quanJtaJon becomes easier!

    View Slide

  65. Resources
    Aligner
    –  BowJe 2 hdp://bowJe-bio.sourceforge.net/bowJe2/index.shtml
    –  GSNAP hdp://research-pub.gene.com/gmap/
    Transcript Discovery/AnnotaJon
    - STAR hdps://github.com/alexdobin/STAR/releases
    - Tophat hdp://tophat.cbcb.umd.edu/
    Transcript Abundance
    –  Kallisto hdp://pachterlab.github.io/kallisto/
    –  RSEM hdp://deweylab.biostat.wisc.edu/rsem/
    –  EMASE hdps://github.com/churchill-lab/emase
    DifferenJal Expression
    –  DESeq hdp://www-huber.embl.de/users/anders/DESeq/
    –  edgeR hdp://bioconductor.org/packages/release/bioc/html/edgeR.html
    –  EBSeq hdps://www.biostat.wisc.edu/~kendzior/EBSEQ/

    View Slide

  66. Example 1
    DifferenJal expression in my mutant mouse
    compared to wild type. What genes are up- or
    down-regulated?

    View Slide

  67. Things to consider…
    •  DifferenJal expression of highly expressed and
    well annotated genes?
    –  Smaller sample depth; more biological replicates
    –  No need for paired end reads; shorter reads (50bp)
    may be sufficient.
    –  Beder to have 20 million 50bp reads than 10 million
    100bp reads.
    •  Looking for novel genes/splicing/isoforms?
    – More read depth, paired-end reads from longer
    fragments.
    N

    View Slide

  68. Example 2
    •  How to quanJfy gene expression in a species
    that has not been sequenced or annotated?
    – MulJstep strategy using mulJple sequencing
    technologies.

    View Slide

  69. Example 3
    •  How to quanJfy single cell gene expression in
    a heterogeneous human tumor?

    View Slide

  70. Any other applicaJons you are
    interested in?
    Steve Munger
    [email protected]
    Narayanan Raghupathy
    [email protected]

    View Slide

  71. Acknowledgements
    •  KB Choi
    •  Gary Churchill
    •  Ron Korstanje/ Karen Svenson/ Elissa Chesler
    •  Joel Graber
    •  Doug Hinerfeld
    •  Anuj Srivastava
    •  Churchill Lab – Dan Gau
    •  Al Simons and Mad Hibbs
    •  Lisa Somes, Steve Ciciode, mouse room staff at JAX
    •  Gene Expression Technologies group at JAX

    View Slide