Slide 1

Slide 1 text

Alignment Methods NGS Data Analysis Course 29 August 2016 Wibowo Arindrarto Sequencing Analysis Support Core Leiden University Medical Center

Slide 2

Slide 2 text

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 - 1300 bp • reference sequence region (usually a region in the genome) 1/18 NGS Data Analysis Course 29-08-2016

Slide 3

Slide 3 text

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 - 1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location 1/18 NGS Data Analysis Course 29-08-2016

Slide 4

Slide 4 text

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 - 1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location Where do these reads come from? 1/18 NGS Data Analysis Course 29-08-2016

Slide 5

Slide 5 text

String Matching Problem Insight Nucleotide bases can be represented by characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our ref- erence sequence becomes a long piece of string. 2/18 NGS Data Analysis Course 29-08-2016

Slide 6

Slide 6 text

String Matching Problem Insight Nucleotide bases can be represented by characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our ref- erence sequence becomes a long piece of string. Where can the reads be found within the reference? 2/18 NGS Data Analysis Course 29-08-2016

Slide 7

Slide 7 text

String Matching Problem Insight Nucleotide bases can be represented by characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our ref- erence sequence becomes a long piece of string. Where can the reads be found within the reference? in other words, we want to solve a string matching problem 2/18 NGS Data Analysis Course 29-08-2016

Slide 8

Slide 8 text

String Matching Problem Insight Not so far from your daily lives ... 3/18 NGS Data Analysis Course 29-08-2016

Slide 9

Slide 9 text

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing regions 4/18 NGS Data Analysis Course 29-08-2016

Slide 10

Slide 10 text

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. 4/18 NGS Data Analysis Course 29-08-2016

Slide 11

Slide 11 text

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. We need to find approximate instead of exact matches. 4/18 NGS Data Analysis Course 29-08-2016

Slide 12

Slide 12 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, find the most similar region. 5/18 NGS Data Analysis Course 29-08-2016

Slide 13

Slide 13 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. 5/18 NGS Data Analysis Course 29-08-2016

Slide 14

Slide 14 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. 5/18 NGS Data Analysis Course 29-08-2016

Slide 15

Slide 15 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. Characteristics • Guaranteed to find optimal alignment. • Finds local regions of similarities. • Generalization of Needleman-Wunsch, which finds global similarities. 5/18 NGS Data Analysis Course 29-08-2016

Slide 16

Slide 16 text

Smith-Waterman Algorithm Adapted from http://www.langmead-lab.org/teaching-materials/ 6/18 NGS Data Analysis Course 29-08-2016

Slide 17

Slide 17 text

Local vs Global Alignment 1 Sequence 1: MAHGPSTYRWSKR 2 Sequence 2: MGPSTYVKR 3 4 ---------------- 5 Global Alignment 6 ---------------- 7 5’ MAHGPSTYRWSKR 3’ 8 | ||||| || 9 5’ M--GPSTY --VKR 3’ 10 11 --------------- 12 Local Alignment 13 --------------- 14 5’ GPSTY 3’ 15 ||||| 16 5’ GPSTY 3’ Listing 1: Alignments 7/18 NGS Data Analysis Course 29-08-2016

Slide 18

Slide 18 text

More Complications Reference sequences are long. • Scoring matrix size will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. 8/18 NGS Data Analysis Course 29-08-2016

Slide 19

Slide 19 text

More Complications Reference sequences are long. • Scoring matrix size will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. How do we reduce space requirement and speed up computa- tion? 8/18 NGS Data Analysis Course 29-08-2016

Slide 20

Slide 20 text

Aligning with Indices Insight Our reference sequence can be transformed into another structure more suitable for alignment of sequencing reads. 9/18 NGS Data Analysis Course 29-08-2016

Slide 21

Slide 21 text

Aligning with Indices Insight Our reference sequence can be transformed into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. 9/18 NGS Data Analysis Course 29-08-2016

Slide 22

Slide 22 text

Aligning with Indices Insight Our reference sequence can be transformed into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. Modern aligners utilize various index data structures: • Hash table • Suffix array • FM index 9/18 NGS Data Analysis Course 29-08-2016

Slide 23

Slide 23 text

Aligning with Indices • Use index to find candidates • Use Smith-Waterman on candidate locations 10/18 NGS Data Analysis Course 29-08-2016

Slide 24

Slide 24 text

More Complications? Highly Similar Regions Some reads can be mapped to multiple locations still. How to ensure we map to the correct location? 11/18 NGS Data Analysis Course 29-08-2016

Slide 25

Slide 25 text

More Complications? Highly Similar Regions Some reads can be mapped to multiple locations still. How to ensure we map to the correct location? • Use paired-end reads • Use longer reads • Discard them / define a maximum multi alignment limit 11/18 NGS Data Analysis Course 29-08-2016

Slide 26

Slide 26 text

More Complications? Highly Similar Regions 12/18 NGS Data Analysis Course 29-08-2016

Slide 27

Slide 27 text

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes contain introns and our reads span over them. How do we know the correct intron locations? 13/18 NGS Data Analysis Course 29-08-2016

Slide 28

Slide 28 text

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes contain introns and our reads span over them. How do we know the correct intron locations? • Align to the transcriptome (at the cost of ignoring novel exons) • Use split read aligners when mapping to the genome 13/18 NGS Data Analysis Course 29-08-2016

Slide 29

Slide 29 text

More Complications? Splicing (RNA-seq) 14/18 NGS Data Analysis Course 29-08-2016

Slide 30

Slide 30 text

More Complications? Circular Genomes Bacterial and mitochondrial genomes are almost always circular. How do we make alignment work with circular references? • Trick: extend reference by adding N bases to the end • Use an aligner that can handle circular references (e.g. GMAP) 15/18 NGS Data Analysis Course 29-08-2016

Slide 31

Slide 31 text

Which Aligner? Regular read aligners • BWA (MEM) • Bowtie2 • BLASR Split read aligners • HISAT2 • STAR • GSNAP 16/18 NGS Data Analysis Course 29-08-2016

Slide 32

Slide 32 text

Which Aligner? Practical Considerations • Technical advantages • Use tools with good developer support • Use tools with good documentation 17/18 NGS Data Analysis Course 29-08-2016

Slide 33

Slide 33 text

Acknowledgements Martijn Vermaat Jeroen Laros