Slide 1

Slide 1 text

RNAseq I Read mapping and transcript reconstruction

Slide 2

Slide 2 text

the RNA world Transcriptome

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Why genes  in pieces? Licatalosi and Darnell 2010

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

RNA$Seq bioinformatic How,RNA$seq,data,is,generated Isolate*Transcript*RNA AAAAAA AAAAAA AAAAAA AAAAAA Fragment*cDNA Size*SelecOon Illumina*Sequencing*of*each*end CAAA AAAA GGAG CTGG GAAA Reverse*TranscripOon CAGG based on Illumina approach *strand-specific RNA-seq protocols exist for both Illumina and SOLiD Slide complements of Andrew McPherson

Slide 7

Slide 7 text

RNA isolation ‣ Treat your samples well ‣ Solubilization ‣ Recovery ‣ Normalization and Enrichment Guanidinium thiocyanate

Slide 8

Slide 8 text

Normalization & Enrichment Adiconis et al. 2013

Slide 9

Slide 9 text

Library preparation ‣ 1st strand synthesis ‣ 2nd strand synthesis ‣ Stranded libraries

Slide 10

Slide 10 text

oligo-dT vs. random priming

Slide 11

Slide 11 text

2nd strand synthesis ‣ RNA displacement ‣ NuGen ‣ SMART: oligo-dG/strand switching

Slide 12

Slide 12 text

Stranded libraries

Slide 13

Slide 13 text

GATC Bioitech

Slide 14

Slide 14 text

DSN-normalization

Slide 15

Slide 15 text

Single Cell RNAseq Saliba et al. 2014

Slide 16

Slide 16 text

RNA-seq data analysis • Can be analyzed in many different ways depending on goals of the experiment, what other data is available, et cetera

Slide 17

Slide 17 text

Align-then-assemble or de novo? NA-Seq data enable de novo reconstruction of the transcriptome. cognized ery1 and ation of ompared parallel as vastly quencing ranscript ntroduce t capture icing in describe multane- ification udy gene ferentia- a similar re called tomes of lete gene ntergenic has been cently it RNA-Seq reads Align reads to genome Assemble transcripts de novo Assemble transcripts from spliced alignments More abundant Less abundant Align transcripts to genome Genome

Slide 18

Slide 18 text

Align-then-assemble or de novo? NA-Seq data enable de novo reconstruction of the transcriptome. cognized ery1 and ation of ompared parallel as vastly quencing ranscript ntroduce t capture icing in describe multane- ification udy gene ferentia- a similar re called tomes of lete gene ntergenic has been cently it RNA-Seq reads Align reads to genome Assemble transcripts de novo Assemble transcripts from spliced alignments More abundant Less abundant Align transcripts to genome Genome

Slide 19

Slide 19 text

• Align-then-assemble: potentially more sensitive, but requires a reference genome, confounded by structural variation • de novo: likely to only capture highly expressed transcripts, but does not require a reference genome, robust to variation

Slide 20

Slide 20 text

Aligning RNA-seq reads to a genome Reads*in*RNA%seq Exon*A Exon*B Exon*A Exon*B transcript chromosome ? ? ? ? ? Exon*C Exon*D Exon*C Exon*D ? ? ? ? ? 7

Slide 21

Slide 21 text

Spliced mapping     a                 b      k                c F R o b E ( i A f a m t e t k w ( a r c r i t Exon-first is more efficient, likely more sensitive for shorter reads, but can produce erroneous alignments for duplicates and pseudogenes.

Slide 22

Slide 22 text

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 9 2009, pages 1105–1111 doi:10.1093/bioinformatics/btp120 Sequence analysis TopHat: discovering splice junctions with RNA-Seq Cole Trapnell1,∗, Lior Pachter2 and Steven L. Salzberg1 1Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742 and 2Department of Mathematics, University of California, Berkeley, CA 94720, USA Received on October 23, 2008; revised on February 24, 2009; accepted on February 26, 2009 Advance Access publication March 16, 2009 Associate Editor: Ivo Hofacker ABSTRACT Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm measurements of expression at comparable cost (Marioni et al., 2008). The major drawback of RNA-Seq over conventional EST sequencing is that the sequences themselves are much shorter, typically 25–50 nt versus several hundred nucleotides with older technologies. One of the critical steps in an RNA-Seq experiment is that of mapping the NGS ‘reads’ to the reference transcriptome. However, because the transcriptomes are incomplete even for well- studied species such as human and mouse, RNA-Seq analyses are forced to map to the reference genome as a proxy for the transcriptome. Mapping to the genome achieves two major objectives of RNA-Seq experiments: (1) Identification of novel transcripts from the locations of regions covered in the mapping. (2) Estimation of the abundance of the transcripts from their depth of coverage in the mapping. Because RNA-Seq reads are short, the first task is challenging.

Slide 23

Slide 23 text

similarities to uses a training the reference mapping phase purpose suffix e, fast aligner, on machines wer than other rt that Vmatch ur against the untime appears m; its authors to A.thaliana per CPU hour. package that g of RNA-Seq lian genome at an filtering out aligns all sites, a data layout . This strategy s non-junction (http://bowtie- ping program rence genome bioinformatics.oxfor Downloaded from Bowtie mapping does not allow gaps, reads spanning splice junctions won’t map

Slide 24

Slide 24 text

similarities to uses a training the reference mapping phase purpose suffix e, fast aligner, on machines wer than other rt that Vmatch ur against the untime appears m; its authors to A.thaliana per CPU hour. package that g of RNA-Seq lian genome at an filtering out aligns all sites, a data layout . This strategy s non-junction (http://bowtie- ping program rence genome bioinformatics.oxfor Downloaded from Extend islands ~50bp and identify GT-AG pairing sites between neighboring islands (within 20kb)

Slide 25

Slide 25 text

TopHat chr9: STS Markers 26559200 26559300 26559400 26559500 brain RNA STS Markers on Genetic and Radiation Hybrid Maps UCSC Gene Predictions Based on RefSeq, UniProt, GenBank, and Comparative Genomics RefSeq Genes Mouse mRNAs from GenBank B3gat1 B3gat1 B3gat1 B3gat1 B3gat1 AK082739 AK220561 AK044599 AK041316 AB055781 AK003020 BC034655 brain RNA 2.34 _ 0.04 _ Fig. 2. An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are present in the brain tissue RNA sample. The top track is the normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat to report a single island containing both exons. TopHat looks for introns within single islands in order to detect this junction. Use a coverage statistic to identify pairing sites within single islands

Slide 26

Slide 26 text

similarities to uses a training the reference mapping phase purpose suffix e, fast aligner, on machines wer than other rt that Vmatch ur against the untime appears m; its authors to A.thaliana per CPU hour. package that g of RNA-Seq lian genome at an filtering out aligns all sites, a data layout . This strategy s non-junction (http://bowtie- ping program rence genome bioinformatics.oxfor Downloaded from

Slide 27

Slide 27 text

nscript. Both isoforms are present in the brain tissue RNA sample. The top track is the for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat ons within single islands in order to detect this junction. rage of rted by ample, ct such oks for raction or each (1) ap, and 1000], nd. We high D ds with Fig. 3. The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small at Pitts Theology Library, Emory University on October 19, 20 bioinformatics.oxfordjournals.org oaded from Use putative splice junctions as seeds and search for matching unmapped reads

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Kim et al. 2015

Slide 31

Slide 31 text

Kim et al. 2015

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Transcriptome reconstruction

Slide 35

Slide 35 text

• We now have • Predicted exons expressed in the sample • Predicted splice junctions expressed in the sample • Does this tell us what isoforms are present?

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

L E T T E R S High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation1–3. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription (75 bp in this work versus 25 bp in our previous work) and pairs of reads from both ends of each RNA fragment can reduce uncertainty in assigning reads to alternative splice variants12. To produce use- ful transcript-level abundance estimates from paired-end RNA-Seq data, we developed a new algorithm that can identify complete novel transcripts and probabilistically assign reads to isoforms. For our initial demonstration of Cufflinks, we performed a time course of paired-end 75-bp RNA-Seq on a well-studied model of skeletal muscle development, the C2C12 mouse myoblast cell line13 (see Online Methods). Regulated RNA expression of key transcrip- tion factors drives myogenesis, and the execution of the differentia- tion process involves changes in expression of hundreds of genes14,15. Previous studies have not measured global transcript isoform expres- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Cole Trapnell1–3, Brian A Williams4, Geo Pertea2, Ali Mortazavi4, Gordon Kwan4, Marijke J van Baren5, Steven L Salzberg1,2, Barbara J Wold4 & Lior Pachter3,6,7

Slide 38

Slide 38 text

ons the multiple time point novel isoforms were tiled by high-identity a d b Map paired cDNA fragment sequences to genome TopHat Cufflinks Spliced fragment alignments Abundance estimation Assembly Mutually incompatible fragments m n an

Slide 39

Slide 39 text

Petrea et al. 2015

Slide 40

Slide 40 text

c d b e Cufflinks Abundance estimation Assembly Mutually incompatible fragments Transcript coverage and compatibility Fragment length distribution Overlap graph Maximum likelihood abundances Log-likelihood Minimum path cover Transcripts  3  3  1  1  2  2 rlap dge, ch red her e hs ents ere can ed), at imum ks set g the ated cripts ave come nces rom ment agment s Trapnell et al. 2010

Slide 41

Slide 41 text

Petrea et al. 2015

Slide 42

Slide 42 text

Long read RNAseq Tilgner et al. 2014

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content