2015 BioSB RNA-seq Alignment Lecture

RNA-Seq Alignment & Quantification Wibowo Arindrarto BioSB RNA-Seq Data Analysis
Course 7 September 2015

Introduction Problem: How can we measure expression? 2 BioSB RNA-Seq
Data Analysis Course 2015 Image by OpenStax College / CC BY

RNA-seq: measure by sequencing transcripts Sequencing data also enables: ◦
allele-specific expression measurement ◦ novel transcript discovery ◦ variant calling Introduction 3 BioSB RNA-Seq Data Analysis Course 2015 AAACGAGCTATGCGATGCTAGCGAGCTAGGCGAGCGAGCGATCGTAGCTATCGATGCTGATCGTACTAGCTGA AGCUAUGCGAUGCUAGGCGAGCGAGCGAUCGUAGCAUCGUACUAGC AGCUAUGCGAUGCUAAUCGUACUAGC Genome Transcripts Reads (single-end) Reads (paired-end) Reads (long)

◦ Genome-wide but biased, amongst others: ◦ Amplification / PCR
bias ◦ Sampling bias ◦ Quantifications are relative ◦ Fixed pool of possible reads for a sequencing run ◦ Amount of input material may be different ◦ Read counts are raw data ◦ Different between sequencing runs ◦ Different between different transcripts Disclaimer 4 BioSB RNA-Seq Data Analysis Course 2015

Revised problem statement: ◦ Where did the reads originate from?
Introduction 5 BioSB RNA-Seq Data Analysis Course 2015 AGCUAUGCGAUGCUAGGCGAGCGAGCGAUCGUAGCAUCGUACUAGC AGCUAUGCGAUGCUAAUCGUACUAGC AAACGAGCTATGCGATGCTAGCGAGCTAGGCGAGCGAGCGATCGTAGCTATCGATGCTGATCGTACTAGCTGA

Revised problem statement: ◦ Where did the reads originate from?
Introduction 6 BioSB RNA-Seq Data Analysis Course 2015 AGCUAUGCGAUGCUAGGCGAGCGAGCGAUCGUAGCAUCGUACUAGC AGCUAUGCGAUGCUAAUCGUACUAGC AAACGAGCTATGCGATGCTAGCGAGCTAGGCGAGCGAGCGATCGTAGCTATCGATGCTGATCGTACTAGCTGA

◦ Genome ◦ Novel transcript discovery ◦ Need to split
reads across exon-exon junction ◦ Transcriptome ◦ Easier to resolve transcript variants ◦ Restricted to known transcripts ◦ Different mapping challenge: exons used multiple times In practice not so clear cut: ◦ Genome information can be augmented with transcript / splice junction annotations during alignment and/or transcript assembly Choosing the Reference 7 BioSB RNA-Seq Data Analysis Course 2015

Alignment Problem == String Search 8 BioSB RNA-Seq Data Analysis
Course 2015

Additional Complexities 9 BioSB RNA-Seq Data Analysis Course 2015 ◦
Scale ◦ Reads are small but many (millions - hundreds of milions) ◦ One huge reference (MBs ~ GBs) ◦ Mismatches + indels ◦ Technical variation ◦ Population / organism-specific variation ◦ Exon-exon junctions (when aligning to genome)

◦ Data structure to speed up search ◦ Mostly done
once per reference Using Indices 10 BioSB RNA-Seq Data Analysis Course 2015 Image by dcmaster / CC BY

Several approaches in constructing indices: ◦ Hash table-based ◦ Suffix
array ◦ Burrows-Wheeler transform The secret sauce is in the details: ◦ How to map over splice junctions? ◦ How to score for mismatches + indels? The (Split) Read Aligners 11 BioSB RNA-Seq Data Analysis Course 2015

Idea: break reference into k-mers & store each word in
a hash table Hash Table 12 BioSB RNA-Seq Data Analysis Course 2015

Idea: break reference into k-mers & store each word in
a hash table Hash Table 13 BioSB RNA-Seq Data Analysis Course 2015 AAACGAG AAAC AACG ACGA CGAG Genome 4-mer (every 1 bp) 8-mer (every 3 bp) AGCTTTAGCTTTAGCAC AGCTTTAG TTTAGCTT AGCTTTAG TTTAGCAC

Hash Table 14 BioSB RNA-Seq Data Analysis Course 2015 AGCTTTAG
TTTAGCTT TTTAGCAC 0,6 3 9 k-mer & offset hash table h1 h2 h3 3 0,6 9 hash func

Hash Table 15 BioSB RNA-Seq Data Analysis Course 2015 AGCTTTAG
TTTAGCTT TTTAGCAC 0,6 3 9 k-mer & offset hash table h1 h2 h3 3 0,6 9 hash func

GSNAP ◦ Indexes 15-mers every 3 bp ◦ Breaks reads
into k-mers and compare locations of each to build potential spliced reads ◦ Merges and filters k-mers back into alignments ◦ Novel splice junctions: probabilistic model ◦ Known splice junctions: user input Hash Table-based Aligners 16 BioSB RNA-Seq Data Analysis Course 2015 Publication: Wu (2010) - doi:10.1093/bioinformatics/btq057 Website: http://research-pub.gene.com/gmap/

Idea: break reference into suffixes and store the sorted suffix
indices in an array Suffix Array 17 BioSB RNA-Seq Data Analysis Course 2015

Idea: break reference into suffixes and store the sorted suffix
indices in an array Suffix Array 18 BioSB RNA-Seq Data Analysis Course 2015 0 ATGCACTGTC$ 1 TGCACTGTC$ 2 GCACTGTC$ 3 CACTGTC$ 4 ACTGTC$ 5 CTGTC$ 6 TGTC$ 7 GTC$ 8 TC$ 9 C$ 10 $ ATGCACTGTC 10 $ 4 ACTGTC$ 0 ATGCACTGTC$ 9 C$ 3 CACTGTC$ 5 CTGTC$ 2 GCACTGTC$ 7 GTC$ 8 TC$ 1 TGCACTGTC$ 6 TGTC$ genome generate all possible suffixes sort

Suffix Array 19 BioSB RNA-Seq Data Analysis Course 2015 suffix
array [ 10, 4, 0, 9, 3, 5, 2, 7, 8, 1, 6 ] TGT T 10 $ 4 ACTGTC$ 0 ATGCACTGTC$ 9 C$ 3 CACTGTC$ 5 CTGTC$ 2 GCACTGTC$ 7 GTC$ 8 TC$ 1 TGCACTGTC$ 6 TGTC$ TG TGT query index

STAR ◦ Maps as long as possible, then maps remaining
◦ Clusters and scores all split segments ◦ Also accepts user-supplied splice locations ◦ Fast but prohibitive memory requirement … and GSNAP ◦ Optional additional index (since 2013) ◦ Compresses the suffix array Suffix Array-based Aligners 20 BioSB RNA-Seq Data Analysis Course 2015 Publication: Dobin (2012) - doi:10.1093/bioinformatics/bts635 Website: https://github.com/alexdobin/STAR

Idea: permute reference into a compact, searchable, and compressible index
Burrows-Wheeler Transform + FM Index 21 BioSB RNA-Seq Data Analysis Course 2015

Idea: permute reference into a compact, searchable, and compressible index
Burrows-Wheeler Transform + FM Index 22 BioSB RNA-Seq Data Analysis Course 2015 CELLULAR CELLULAR$ ELLULAR$C LLULAR$CE LULAR$CEL ULAR$CELL LAR$CELLU AR$CELLUL R$CELLULA $CELLULAR $CELLULAR AR$CELLUL CELLULAR$ ELLULAR$C LAR$CELLU LLULAR$CE LULAR$CEL R$CELLULA ULAR$CELL $R AL C$ EC LU LE LL RA UL rotate sort first-last genome Burrows-Wheeler matrix * compression * aux. arrays + FM Index

Tophat2 ◦ Uses bowtie to map single-exon reads ◦ Tries
to split unmapped reads across exons ◦ Maps reads from full exons ◦ Can optionally align to a transcriptome reference first BWT-based Aligners 23 BioSB RNA-Seq Data Analysis Course 2015 Publication: Kim (2013) - doi:10.1186/gb-2013-14-4-r36 Website: https://ccb.jhu.edu/software/tophat/index.shtml

Tophat2 ◦ Uses bowtie to map single-exon reads ◦ Tries
to split unmapped reads across exons ◦ Maps reads from full exons ◦ Can optionally align to a transcriptome reference first HISAT ◦ Main FM index + overlapping indices across reference ◦ Maps spliced reads like STAR, utilizes hierarchical index ◦ Fast + low space requirements BWT-based Aligners 24 BioSB RNA-Seq Data Analysis Course 2015 Publication: Kim (2013) - doi:10.1186/gb-2013-14-4-r36 Website: https://ccb.jhu.edu/software/tophat/index.shtml Publication: Kim (2015) - doi:10.1038/nmeth.3317 Website: https://ccb.jhu.edu/software/hisat/index.shtml

The Aligner Zoo 25 BioSB RNA-Seq Data Analysis Course 2015

◦ Technical advantages ◦ Recent review: Engström (2013) - doi:10.1038/nmeth.2722
Some Considerations 26 BioSB RNA-Seq Data Analysis Course 2015

◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing Some Considerations 27 BioSB RNA-Seq Data Analysis Course 2015

◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing ◦ Community + documentation ◦ Easier to help yourself ◦ Easier to find people who can also help ◦ Higher chance to spot & fix bugs Some Considerations 28 BioSB RNA-Seq Data Analysis Course 2015

◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing ◦ Community + documentation ◦ Easier to help yourself ◦ Easier to find people who can also help ◦ Higher chance to spot & fix bugs ◦ Organism annotation status ◦ Less-annotated species should perhaps rely on aligners which are less reliant user-supplied annotation Some Considerations 29 BioSB RNA-Seq Data Analysis Course 2015

◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing ◦ Community + documentation ◦ Easier to help yourself ◦ Easier to find people who can also help ◦ Higher chance to spot & fix bugs ◦ Organism annotation status ◦ Less-annotated species should perhaps rely on aligners which are less reliant user-supplied annotation ◦ In-house computing resources ◦ Memory, HDD space, cores, etc. Some Considerations 30 BioSB RNA-Seq Data Analysis Course 2015

Problem: How to reach quantification from sequence data? Quantification 31
BioSB RNA-Seq Data Analysis Course 2015 sequence data genome alignment transcriptome alignment transcript assembly quantification de-novo assembly

◦ How to resolve relative abundance from read data? via
Transcriptome Alignment 32 BioSB RNA-Seq Data Analysis Course 2015 Probability of read from transcript Relative transcript abundance Transcript A Transcript B

Expectation-Maximization (EM) 33 BioSB RNA-Seq Data Analysis Course 2015 Image
adapted from Pachter (2011) - arXiv:1104.3889v2 transcript abundance (second step) = (P a,red + P c,red + P d,red + P e,red ) / (P red + P blue + P green ) = (0.33 + 0.5 + 1 + 0.5) / (2.33 + 1.33 + 1.33) = 0.47 ◦ Alignment to transcriptome: same underlying principle ◦ Iteration until values converge ◦ In example: all transcripts have the same length

RSEM ◦ Aligns to transcriptome using bowtie or STAR ◦
Can visualize counts on transcripts eXpress ◦ Streaming implementation of EM ◦ Can accept results from bowtie as it aligns EM-based Tools 34 BioSB RNA-Seq Data Analysis Course 2015 Publication: Li (2011) - doi:10.1186/1471-2105-12-323 Website: http://deweylab.biostat.wisc.edu/rsem/ Publication: Roberts (2013) - doi:10.1038/nmeth.2251 Website: http://bio.math.berkeley.edu/eXpress/overview.html

via Genome Alignment 35 BioSB RNA-Seq Data Analysis Course 2015
Cufflinks: build transcript model first Image adapted from Trapnell (2010) Graph model: ◦ Nodes: sequenced fragments ◦ Edge: compatible splice patterns ◦ Goal: find minimum number of transcripts to explain graph ◦ Can use known annotations to assist assembly Publication: Trapnell (2010) - doi:10.1038 /nbt.1621 Website: https://cole-trapnell-lab.github. io/cufflinks/

via Genome Alignment 36 BioSB RNA-Seq Data Analysis Course 2015
StringTie: build model while estimating abundance Image adapted from Pertea (2015) - doi:10.1038/nbt.3122 Publication: Pertea (2015) - doi:10.1038/nbt.3122 Website: http://ccb.jhu. edu/software/stringtie/

via Genome Alignment - Counts Only 37 BioSB RNA-Seq Data
Analysis Course 2015 No abundance estimation - straight to comparisons ◦ Count fragments mapping to features (genes, exons, etc.) ◦ Model distribution feature-wise & compare across samples Image adapted from Finotello (2014) - doi:10.1093/bfgp/elu035

DEXSeq ◦ Negative binomial model ◦ Specialized for differential exon
usage analysis edgeR ◦ Negative binomial model ◦ Estimates biological and technical variation limma ◦ Linear model via voom transformation ◦ Around since microarray, still relevant for RNA-seq Fragment Count-based Tools 38 BioSB RNA-Seq Data Analysis Course 2015 Publication: Robinson (2009) - doi:10.1093 /bioinformatics/btp616 Website: http://deweylab.biostat.wisc.edu/rsem/ Publication: Ritchie (2015) - doi:10.1093/nar/gkv007 Website: http://bioconductor.org/packages/release/bioc/html/limma.html Publication: Anders (2012) - doi:10.1101/gr. 133744.111 Website: http://bioconductor. org/packages/release/bioc/html/DEXSeq.html

Without Alignment 39 BioSB RNA-Seq Data Analysis Course 2015 Idea:
Estimate transcript coverage using k-mer counts in reads and compare to index ◦ From hours to minutes ◦ Limited to known transcripts ◦ Claim: surprisingly comparable

Without Alignment 40 BioSB RNA-Seq Data Analysis Course 2015 Idea:
Estimate transcript coverage using k-mer counts in reads and compare to index ◦ From hours to minutes ◦ Limited to known transcripts ◦ Claim: surprisingly comparable Kallisto ◦ de Bruijn graphs for index ◦ Claims higher accuracy than Sailfish Sailfish ◦ First paper with this approach ◦ k-mer index + aux. arrays Publication: Patro (2014) - doi:10.1038/nbt.2862 Website: https://www.cs.cmu. edu/~ckingsf/software/sailfish Publication: Bray (2015) - arXiv:1505.02710v2 Website: https://github.com/pachterlab/kallisto

Acknowledgments Peter-Bram ‘t Hoen Jeroen Laros Sander van der Zeeuw

2015 BioSB RNA-seq Alignment Lecture

2015 BioSB RNA-seq Alignment Lecture

More Decks by Wibowo Arindrarto

Other Decks in Science

Featured

Transcript