PSU RNA-seq/Structure Workshop

RNA-seq/Structure A PSU Galaxy Workshop May 15, 2015 @galaxyproject /
#usegalaxy http://www.galaxyproject.org

RNA$Seq bioinformatic How,RNA$seq,data,is,generated Isolate*Transcript*RNA AAAAAA AAAAAA AAAAAA AAAAAA Fragment*cDNA Size*SelecOon
Illumina*Sequencing*of*each*end CAAA AAAA GGAG CTGG GAAA Reverse*TranscripOon CAGG based on Illumina approach *strand-specific RNA-seq protocols exist for both Illumina and SOLiD Slide complements of Andrew McPherson

Stranded libraries

Stranded libraries Levin et al. 2010

RNA-seq data analysis • Can be analyzed in many diﬀerent
ways depending on goals of the experiment, what other data is available, et cetera

Align-then-assemble or de novo? NA-Seq data enable de novo reconstruction
of the transcriptome. cognized ery1 and ation of ompared parallel as vastly quencing ranscript ntroduce t capture icing in describe multane- ification udy gene ferentia- a similar re called tomes of lete gene ntergenic has been cently it RNA-Seq reads Align reads to genome Assemble transcripts de novo Assemble transcripts from spliced alignments More abundant Less abundant Align transcripts to genome Genome

• Align-then-assemble: potentially more sensitive, but requires a reference genome,
confounded by structural variation • de novo: likely to only capture highly expressed transcripts, but does not require a reference genome, robust to variation

Aligning RNA-seq reads to a genome Reads*in*RNA%seq Exon*A Exon*B Exon*A
Exon*B transcript chromosome ? ? ? ? ? Exon*C Exon*D Exon*C Exon*D ? ? ? ? ? 7

Spliced mapping a
b k c F R o b E ( i A f a m t e t k w ( a r c r i t Exon-ﬁrst is more eﬃcient, likely more sensitive for shorter reads, but can produce erroneous alignments for duplicates and pseudogenes.

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 9 2009, pages 1105–1111
doi:10.1093/bioinformatics/btp120 Sequence analysis TopHat: discovering splice junctions with RNA-Seq Cole Trapnell1,∗, Lior Pachter2 and Steven L. Salzberg1 1Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742 and 2Department of Mathematics, University of California, Berkeley, CA 94720, USA Received on October 23, 2008; revised on February 24, 2009; accepted on February 26, 2009 Advance Access publication March 16, 2009 Associate Editor: Ivo Hofacker ABSTRACT Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm measurements of expression at comparable cost (Marioni et al., 2008). The major drawback of RNA-Seq over conventional EST sequencing is that the sequences themselves are much shorter, typically 25–50 nt versus several hundred nucleotides with older technologies. One of the critical steps in an RNA-Seq experiment is that of mapping the NGS ‘reads’ to the reference transcriptome. However, because the transcriptomes are incomplete even for well- studied species such as human and mouse, RNA-Seq analyses are forced to map to the reference genome as a proxy for the transcriptome. Mapping to the genome achieves two major objectives of RNA-Seq experiments: (1) Identification of novel transcripts from the locations of regions covered in the mapping. (2) Estimation of the abundance of the transcripts from their depth of coverage in the mapping. Because RNA-Seq reads are short, the first task is challenging.

similarities to uses a training the reference mapping phase purpose
sufﬁx e, fast aligner, on machines wer than other rt that Vmatch ur against the untime appears m; its authors to A.thaliana per CPU hour. package that g of RNA-Seq lian genome at an ﬁltering out aligns all sites, a data layout . This strategy s non-junction (http://bowtie- ping program rence genome bioinformatics.oxfor Downloaded from Bowtie mapping does not allow gaps, reads spanning splice junctions won’t map

sufﬁx e, fast aligner, on machines wer than other rt that Vmatch ur against the untime appears m; its authors to A.thaliana per CPU hour. package that g of RNA-Seq lian genome at an ﬁltering out aligns all sites, a data layout . This strategy s non-junction (http://bowtie- ping program rence genome bioinformatics.oxfor Downloaded from Extend islands ~50bp and identify GT-AG pairing sites between neighboring islands (within 20kb)

TopHat chr9: STS Markers 26559200 26559300 26559400 26559500 brain RNA
STS Markers on Genetic and Radiation Hybrid Maps UCSC Gene Predictions Based on RefSeq, UniProt, GenBank, and Comparative Genomics RefSeq Genes Mouse mRNAs from GenBank B3gat1 B3gat1 B3gat1 B3gat1 B3gat1 AK082739 AK220561 AK044599 AK041316 AB055781 AK003020 BC034655 brain RNA 2.34 _ 0.04 _ Fig. 2. An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are present in the brain tissue RNA sample. The top track is the normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat to report a single island containing both exons. TopHat looks for introns within single islands in order to detect this junction. Use a coverage statistic to identify pairing sites within single islands

sufﬁx e, fast aligner, on machines wer than other rt that Vmatch ur against the untime appears m; its authors to A.thaliana per CPU hour. package that g of RNA-Seq lian genome at an ﬁltering out aligns all sites, a data layout . This strategy s non-junction (http://bowtie- ping program rence genome bioinformatics.oxfor Downloaded from

nscript. Both isoforms are present in the brain tissue RNA
sample. The top track is the for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat ons within single islands in order to detect this junction. rage of rted by ample, ct such oks for raction or each (1) ap, and 1000], nd. We high D ds with Fig. 3. The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small at Pitts Theology Library, Emory University on October 19, 20 bioinformatics.oxfordjournals.org oaded from Use putative splice junctions as seeds and search for matching unmapped reads

Kim et al. 2015

Transcriptome reconstruction

• We now have • Predicted exons expressed in the
sample • Predicted splice junctions expressed in the sample • Does this tell us what isoforms are present?

L E T T E R S High-throughput mRNA sequencing
(RNA-Seq) promises simultaneous transcript discovery and abundance estimation1–3. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription (75 bp in this work versus 25 bp in our previous work) and pairs of reads from both ends of each RNA fragment can reduce uncertainty in assigning reads to alternative splice variants12. To produce use- ful transcript-level abundance estimates from paired-end RNA-Seq data, we developed a new algorithm that can identify complete novel transcripts and probabilistically assign reads to isoforms. For our initial demonstration of Cufflinks, we performed a time course of paired-end 75-bp RNA-Seq on a well-studied model of skeletal muscle development, the C2C12 mouse myoblast cell line13 (see Online Methods). Regulated RNA expression of key transcription factors drives myogenesis, and the execution of the differentiation process involves changes in expression of hundreds of genes14,15. Previous studies have not measured global transcript isoform expres- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Cole Trapnell1–3, Brian A Williams4, Geo Pertea2, Ali Mortazavi4, Gordon Kwan4, Marijke J van Baren5, Steven L Salzberg1,2, Barbara J Wold4 & Lior Pachter3,6,7

ons the multiple time point novel isoforms were tiled by
high-identity a d b Map paired cDNA fragment sequences to genome TopHat Cufflinks Spliced fragment alignments Abundance estimation Assembly Mutually incompatible fragments m n an

Petrea et al. 2015

c d b e Cufflinks Abundance estimation Assembly Mutually incompatible
fragments Transcript coverage and compatibility Fragment length distribution Overlap graph Maximum likelihood abundances Log-likelihood Minimum path cover Transcripts 3 3 1 1 2 2 rlap dge, ch red her e hs ents ere can ed), at imum ks set g the ated cripts ave come nces rom ment agment s Trapnell et al. 2010

Petrea et al. 2015

Estimating transcript levels

Petrea et al. 2015

• Expression values can be tabulated for individual gene loci,
transcripts, exons and splice junctions • Gene expression values typically reported in   RPKM/FPKM • Number of reads (fragments, for paired reads) per kb of exonic bases per million reads in the library • Compensates for variable library size and transcript length  

• Cuﬄinks: Maximum Likelihood • RSEM: Expectation maximization, works in
the absence of a reference genome • rQuant: Models biases in read distribution along the length of the transcript • StringTie: Flow network optimization • ...many others

Experimental design? Naomi Altman Francesca Chiaromonte Love your statistician!

The essence of reference-free RNAseq

Trinity

Secondary structure prediction with NGS

Dimethyl sulphate Wells et al. 2000

structure-Seq Ding et al. 2014

Ding et al. 2014

PARS-seq Wan et al. 2013

Wan et al. 2013 FRAG-seq

PSU RNA-seq/Structure Workshop

PSU RNA-seq/Structure Workshop

More Decks by Anton Nekrutenko

Other Decks in Research

Featured

Transcript