Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2015 BioSB RNA-seq Alignment Lecture

2015 BioSB RNA-seq Alignment Lecture

Wibowo Arindrarto

September 07, 2015
Tweet

More Decks by Wibowo Arindrarto

Other Decks in Science

Transcript

  1. Introduction Problem: How can we measure expression? 2 BioSB RNA-Seq

    Data Analysis Course 2015 Image by OpenStax College / CC BY
  2. RNA-seq: measure by sequencing transcripts Sequencing data also enables: ◦

    allele-specific expression measurement ◦ novel transcript discovery ◦ variant calling Introduction 3 BioSB RNA-Seq Data Analysis Course 2015 AAACGAGCTATGCGATGCTAGCGAGCTAGGCGAGCGAGCGATCGTAGCTATCGATGCTGATCGTACTAGCTGA AGCUAUGCGAUGCUAGGCGAGCGAGCGAUCGUAGCAUCGUACUAGC AGCUAUGCGAUGCUAAUCGUACUAGC Genome Transcripts Reads (single-end) Reads (paired-end) Reads (long)
  3. ◦ Genome-wide but biased, amongst others: ◦ Amplification / PCR

    bias ◦ Sampling bias ◦ Quantifications are relative ◦ Fixed pool of possible reads for a sequencing run ◦ Amount of input material may be different ◦ Read counts are raw data ◦ Different between sequencing runs ◦ Different between different transcripts Disclaimer 4 BioSB RNA-Seq Data Analysis Course 2015
  4. Revised problem statement: ◦ Where did the reads originate from?

    Introduction 5 BioSB RNA-Seq Data Analysis Course 2015 AGCUAUGCGAUGCUAGGCGAGCGAGCGAUCGUAGCAUCGUACUAGC AGCUAUGCGAUGCUAAUCGUACUAGC AAACGAGCTATGCGATGCTAGCGAGCTAGGCGAGCGAGCGATCGTAGCTATCGATGCTGATCGTACTAGCTGA
  5. Revised problem statement: ◦ Where did the reads originate from?

    Introduction 6 BioSB RNA-Seq Data Analysis Course 2015 AGCUAUGCGAUGCUAGGCGAGCGAGCGAUCGUAGCAUCGUACUAGC AGCUAUGCGAUGCUAAUCGUACUAGC AAACGAGCTATGCGATGCTAGCGAGCTAGGCGAGCGAGCGATCGTAGCTATCGATGCTGATCGTACTAGCTGA
  6. ◦ Genome ◦ Novel transcript discovery ◦ Need to split

    reads across exon-exon junction ◦ Transcriptome ◦ Easier to resolve transcript variants ◦ Restricted to known transcripts ◦ Different mapping challenge: exons used multiple times In practice not so clear cut: ◦ Genome information can be augmented with transcript / splice junction annotations during alignment and/or transcript assembly Choosing the Reference 7 BioSB RNA-Seq Data Analysis Course 2015
  7. Additional Complexities 9 BioSB RNA-Seq Data Analysis Course 2015 ◦

    Scale ◦ Reads are small but many (millions - hundreds of milions) ◦ One huge reference (MBs ~ GBs) ◦ Mismatches + indels ◦ Technical variation ◦ Population / organism-specific variation ◦ Exon-exon junctions (when aligning to genome)
  8. ◦ Data structure to speed up search ◦ Mostly done

    once per reference Using Indices 10 BioSB RNA-Seq Data Analysis Course 2015 Image by dcmaster / CC BY
  9. Several approaches in constructing indices: ◦ Hash table-based ◦ Suffix

    array ◦ Burrows-Wheeler transform The secret sauce is in the details: ◦ How to map over splice junctions? ◦ How to score for mismatches + indels? The (Split) Read Aligners 11 BioSB RNA-Seq Data Analysis Course 2015
  10. Idea: break reference into k-mers & store each word in

    a hash table Hash Table 12 BioSB RNA-Seq Data Analysis Course 2015
  11. Idea: break reference into k-mers & store each word in

    a hash table Hash Table 13 BioSB RNA-Seq Data Analysis Course 2015 AAACGAG AAAC AACG ACGA CGAG Genome 4-mer (every 1 bp) 8-mer (every 3 bp) AGCTTTAGCTTTAGCAC AGCTTTAG TTTAGCTT AGCTTTAG TTTAGCAC
  12. Hash Table 14 BioSB RNA-Seq Data Analysis Course 2015 AGCTTTAG

    TTTAGCTT TTTAGCAC 0,6 3 9 k-mer & offset hash table h1 h2 h3 3 0,6 9 hash func
  13. Hash Table 15 BioSB RNA-Seq Data Analysis Course 2015 AGCTTTAG

    TTTAGCTT TTTAGCAC 0,6 3 9 k-mer & offset hash table h1 h2 h3 3 0,6 9 hash func
  14. GSNAP ◦ Indexes 15-mers every 3 bp ◦ Breaks reads

    into k-mers and compare locations of each to build potential spliced reads ◦ Merges and filters k-mers back into alignments ◦ Novel splice junctions: probabilistic model ◦ Known splice junctions: user input Hash Table-based Aligners 16 BioSB RNA-Seq Data Analysis Course 2015 Publication: Wu (2010) - doi:10.1093/bioinformatics/btq057 Website: http://research-pub.gene.com/gmap/
  15. Idea: break reference into suffixes and store the sorted suffix

    indices in an array Suffix Array 17 BioSB RNA-Seq Data Analysis Course 2015
  16. Idea: break reference into suffixes and store the sorted suffix

    indices in an array Suffix Array 18 BioSB RNA-Seq Data Analysis Course 2015 0 ATGCACTGTC$ 1 TGCACTGTC$ 2 GCACTGTC$ 3 CACTGTC$ 4 ACTGTC$ 5 CTGTC$ 6 TGTC$ 7 GTC$ 8 TC$ 9 C$ 10 $ ATGCACTGTC 10 $ 4 ACTGTC$ 0 ATGCACTGTC$ 9 C$ 3 CACTGTC$ 5 CTGTC$ 2 GCACTGTC$ 7 GTC$ 8 TC$ 1 TGCACTGTC$ 6 TGTC$ genome generate all possible suffixes sort
  17. Suffix Array 19 BioSB RNA-Seq Data Analysis Course 2015 suffix

    array [ 10, 4, 0, 9, 3, 5, 2, 7, 8, 1, 6 ] TGT T 10 $ 4 ACTGTC$ 0 ATGCACTGTC$ 9 C$ 3 CACTGTC$ 5 CTGTC$ 2 GCACTGTC$ 7 GTC$ 8 TC$ 1 TGCACTGTC$ 6 TGTC$ TG TGT query index
  18. STAR ◦ Maps as long as possible, then maps remaining

    ◦ Clusters and scores all split segments ◦ Also accepts user-supplied splice locations ◦ Fast but prohibitive memory requirement … and GSNAP ◦ Optional additional index (since 2013) ◦ Compresses the suffix array Suffix Array-based Aligners 20 BioSB RNA-Seq Data Analysis Course 2015 Publication: Dobin (2012) - doi:10.1093/bioinformatics/bts635 Website: https://github.com/alexdobin/STAR
  19. Idea: permute reference into a compact, searchable, and compressible index

    Burrows-Wheeler Transform + FM Index 21 BioSB RNA-Seq Data Analysis Course 2015
  20. Idea: permute reference into a compact, searchable, and compressible index

    Burrows-Wheeler Transform + FM Index 22 BioSB RNA-Seq Data Analysis Course 2015 CELLULAR CELLULAR$ ELLULAR$C LLULAR$CE LULAR$CEL ULAR$CELL LAR$CELLU AR$CELLUL R$CELLULA $CELLULAR $CELLULAR AR$CELLUL CELLULAR$ ELLULAR$C LAR$CELLU LLULAR$CE LULAR$CEL R$CELLULA ULAR$CELL $R AL C$ EC LU LE LL RA UL rotate sort first-last genome Burrows-Wheeler matrix * compression * aux. arrays + FM Index
  21. Tophat2 ◦ Uses bowtie to map single-exon reads ◦ Tries

    to split unmapped reads across exons ◦ Maps reads from full exons ◦ Can optionally align to a transcriptome reference first BWT-based Aligners 23 BioSB RNA-Seq Data Analysis Course 2015 Publication: Kim (2013) - doi:10.1186/gb-2013-14-4-r36 Website: https://ccb.jhu.edu/software/tophat/index.shtml
  22. Tophat2 ◦ Uses bowtie to map single-exon reads ◦ Tries

    to split unmapped reads across exons ◦ Maps reads from full exons ◦ Can optionally align to a transcriptome reference first HISAT ◦ Main FM index + overlapping indices across reference ◦ Maps spliced reads like STAR, utilizes hierarchical index ◦ Fast + low space requirements BWT-based Aligners 24 BioSB RNA-Seq Data Analysis Course 2015 Publication: Kim (2013) - doi:10.1186/gb-2013-14-4-r36 Website: https://ccb.jhu.edu/software/tophat/index.shtml Publication: Kim (2015) - doi:10.1038/nmeth.3317 Website: https://ccb.jhu.edu/software/hisat/index.shtml
  23. ◦ Technical advantages ◦ Recent review: Engström (2013) - doi:10.1038/nmeth.2722

    Some Considerations 26 BioSB RNA-Seq Data Analysis Course 2015
  24. ◦ Technical advantages ◦ Recent review: Engström (2013) - doi:10.1038/nmeth.2722

    ◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing Some Considerations 27 BioSB RNA-Seq Data Analysis Course 2015
  25. ◦ Technical advantages ◦ Recent review: Engström (2013) - doi:10.1038/nmeth.2722

    ◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing ◦ Community + documentation ◦ Easier to help yourself ◦ Easier to find people who can also help ◦ Higher chance to spot & fix bugs Some Considerations 28 BioSB RNA-Seq Data Analysis Course 2015
  26. ◦ Technical advantages ◦ Recent review: Engström (2013) - doi:10.1038/nmeth.2722

    ◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing ◦ Community + documentation ◦ Easier to help yourself ◦ Easier to find people who can also help ◦ Higher chance to spot & fix bugs ◦ Organism annotation status ◦ Less-annotated species should perhaps rely on aligners which are less reliant user-supplied annotation Some Considerations 29 BioSB RNA-Seq Data Analysis Course 2015
  27. ◦ Technical advantages ◦ Recent review: Engström (2013) - doi:10.1038/nmeth.2722

    ◦ Developer support ◦ Not just one-off publication item ◦ There will be bugs that need fixing ◦ Community + documentation ◦ Easier to help yourself ◦ Easier to find people who can also help ◦ Higher chance to spot & fix bugs ◦ Organism annotation status ◦ Less-annotated species should perhaps rely on aligners which are less reliant user-supplied annotation ◦ In-house computing resources ◦ Memory, HDD space, cores, etc. Some Considerations 30 BioSB RNA-Seq Data Analysis Course 2015
  28. Problem: How to reach quantification from sequence data? Quantification 31

    BioSB RNA-Seq Data Analysis Course 2015 sequence data genome alignment transcriptome alignment transcript assembly quantification de-novo assembly
  29. ◦ How to resolve relative abundance from read data? via

    Transcriptome Alignment 32 BioSB RNA-Seq Data Analysis Course 2015 Probability of read from transcript Relative transcript abundance Transcript A Transcript B
  30. Expectation-Maximization (EM) 33 BioSB RNA-Seq Data Analysis Course 2015 Image

    adapted from Pachter (2011) - arXiv:1104.3889v2 transcript abundance (second step) = (P a,red + P c,red + P d,red + P e,red ) / (P red + P blue + P green ) = (0.33 + 0.5 + 1 + 0.5) / (2.33 + 1.33 + 1.33) = 0.47 ◦ Alignment to transcriptome: same underlying principle ◦ Iteration until values converge ◦ In example: all transcripts have the same length
  31. RSEM ◦ Aligns to transcriptome using bowtie or STAR ◦

    Can visualize counts on transcripts eXpress ◦ Streaming implementation of EM ◦ Can accept results from bowtie as it aligns EM-based Tools 34 BioSB RNA-Seq Data Analysis Course 2015 Publication: Li (2011) - doi:10.1186/1471-2105-12-323 Website: http://deweylab.biostat.wisc.edu/rsem/ Publication: Roberts (2013) - doi:10.1038/nmeth.2251 Website: http://bio.math.berkeley.edu/eXpress/overview.html
  32. via Genome Alignment 35 BioSB RNA-Seq Data Analysis Course 2015

    Cufflinks: build transcript model first Image adapted from Trapnell (2010) Graph model: ◦ Nodes: sequenced fragments ◦ Edge: compatible splice patterns ◦ Goal: find minimum number of transcripts to explain graph ◦ Can use known annotations to assist assembly Publication: Trapnell (2010) - doi:10.1038 /nbt.1621 Website: https://cole-trapnell-lab.github. io/cufflinks/
  33. via Genome Alignment 36 BioSB RNA-Seq Data Analysis Course 2015

    StringTie: build model while estimating abundance Image adapted from Pertea (2015) - doi:10.1038/nbt.3122 Publication: Pertea (2015) - doi:10.1038/nbt.3122 Website: http://ccb.jhu. edu/software/stringtie/
  34. via Genome Alignment - Counts Only 37 BioSB RNA-Seq Data

    Analysis Course 2015 No abundance estimation - straight to comparisons ◦ Count fragments mapping to features (genes, exons, etc.) ◦ Model distribution feature-wise & compare across samples Image adapted from Finotello (2014) - doi:10.1093/bfgp/elu035
  35. DEXSeq ◦ Negative binomial model ◦ Specialized for differential exon

    usage analysis edgeR ◦ Negative binomial model ◦ Estimates biological and technical variation limma ◦ Linear model via voom transformation ◦ Around since microarray, still relevant for RNA-seq Fragment Count-based Tools 38 BioSB RNA-Seq Data Analysis Course 2015 Publication: Robinson (2009) - doi:10.1093 /bioinformatics/btp616 Website: http://deweylab.biostat.wisc.edu/rsem/ Publication: Ritchie (2015) - doi:10.1093/nar/gkv007 Website: http://bioconductor.org/packages/release/bioc/html/limma.html Publication: Anders (2012) - doi:10.1101/gr. 133744.111 Website: http://bioconductor. org/packages/release/bioc/html/DEXSeq.html
  36. Without Alignment 39 BioSB RNA-Seq Data Analysis Course 2015 Idea:

    Estimate transcript coverage using k-mer counts in reads and compare to index ◦ From hours to minutes ◦ Limited to known transcripts ◦ Claim: surprisingly comparable
  37. Without Alignment 40 BioSB RNA-Seq Data Analysis Course 2015 Idea:

    Estimate transcript coverage using k-mer counts in reads and compare to index ◦ From hours to minutes ◦ Limited to known transcripts ◦ Claim: surprisingly comparable Kallisto ◦ de Bruijn graphs for index ◦ Claims higher accuracy than Sailfish Sailfish ◦ First paper with this approach ◦ k-mer index + aux. arrays Publication: Patro (2014) - doi:10.1038/nbt.2862 Website: https://www.cs.cmu. edu/~ckingsf/software/sailfish Publication: Bray (2015) - arXiv:1505.02710v2 Website: https://github.com/pachterlab/kallisto