2016 Alignment Methods - Speaker Deck

Slide 1

Slide 1 text

Alignment Methods NGS Data Analysis Course 29 August 2016 Wibowo Arindrarto Sequencing Analysis Support Core Leiden University Medical Center

Slide 2

Slide 2 text

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 - 1300 bp • reference sequence region (usually a region in the genome) 1/18 NGS Data Analysis Course 29-08-2016

Slide 3

Slide 3 text

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 - 1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location 1/18 NGS Data Analysis Course 29-08-2016

Slide 4

Slide 4 text

Slide 5

Slide 5 text

String Matching Problem Insight Nucleotide bases can be represented by characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our reference sequence becomes a long piece of string. 2/18 NGS Data Analysis Course 29-08-2016

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

String Matching Problem Insight Not so far from your daily lives ... 3/18 NGS Data Analysis Course 29-08-2016

Slide 9

Slide 9 text

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing regions 4/18 NGS Data Analysis Course 29-08-2016

Slide 10

Slide 10 text

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. 4/18 NGS Data Analysis Course 29-08-2016

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, ﬁnd the most similar region. 5/18 NGS Data Analysis Course 29-08-2016

Slide 13

Slide 13 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, ﬁnd the most similar region. • Score for matches, penalize mismatches and gaps. 5/18 NGS Data Analysis Course 29-08-2016

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Smith-Waterman Algorithm Basic ideas • Given two strings and a scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. Characteristics • Guaranteed to find optimal alignment. • Finds local regions of similarities. • Generalization of Needleman-Wunsch, which finds global similarities. 5/18 NGS Data Analysis Course 29-08-2016

Slide 16

Slide 16 text

Smith-Waterman Algorithm Adapted from http://www.langmead-lab.org/teaching-materials/ 6/18 NGS Data Analysis Course 29-08-2016

Slide 17

Slide 17 text

Local vs Global Alignment 1 Sequence 1: MAHGPSTYRWSKR 2 Sequence 2: MGPSTYVKR 3 4 ---------------- 5 Global Alignment 6 ---------------- 7 5’ MAHGPSTYRWSKR 3’ 8 | ||||| || 9 5’ M--GPSTY --VKR 3’ 10 11 --------------- 12 Local Alignment 13 --------------- 14 5’ GPSTY 3’ 15 ||||| 16 5’ GPSTY 3’ Listing 1: Alignments 7/18 NGS Data Analysis Course 29-08-2016

Slide 18

Slide 18 text

More Complications Reference sequences are long. • Scoring matrix size will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. 8/18 NGS Data Analysis Course 29-08-2016

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Aligning with Indices Insight Our reference sequence can be transformed into another structure more suitable for alignment of sequencing reads. 9/18 NGS Data Analysis Course 29-08-2016

Slide 21

Slide 21 text

Aligning with Indices Insight Our reference sequence can be transformed into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. 9/18 NGS Data Analysis Course 29-08-2016

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Aligning with Indices • Use index to ﬁnd candidates • Use Smith-Waterman on candidate locations 10/18 NGS Data Analysis Course 29-08-2016

Slide 24

Slide 24 text

More Complications? Highly Similar Regions Some reads can be mapped to multiple locations still. How to ensure we map to the correct location? 11/18 NGS Data Analysis Course 29-08-2016

Slide 25

Slide 25 text

More Complications? Highly Similar Regions Some reads can be mapped to multiple locations still. How to ensure we map to the correct location? • Use paired-end reads • Use longer reads • Discard them / deﬁne a maximum multi alignment limit 11/18 NGS Data Analysis Course 29-08-2016

Slide 26

Slide 26 text

More Complications? Highly Similar Regions 12/18 NGS Data Analysis Course 29-08-2016

Slide 27

Slide 27 text

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes contain introns and our reads span over them. How do we know the correct intron locations? 13/18 NGS Data Analysis Course 29-08-2016

Slide 28

Slide 28 text

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes contain introns and our reads span over them. How do we know the correct intron locations? • Align to the transcriptome (at the cost of ignoring novel exons) • Use split read aligners when mapping to the genome 13/18 NGS Data Analysis Course 29-08-2016

Slide 29

Slide 29 text

More Complications? Splicing (RNA-seq) 14/18 NGS Data Analysis Course 29-08-2016

Slide 30

Slide 30 text

More Complications? Circular Genomes Bacterial and mitochondrial genomes are almost always circular. How do we make alignment work with circular references? • Trick: extend reference by adding N bases to the end • Use an aligner that can handle circular references (e.g. GMAP) 15/18 NGS Data Analysis Course 29-08-2016

Slide 31

Slide 31 text

Which Aligner? Regular read aligners • BWA (MEM) • Bowtie2 • BLASR Split read aligners • HISAT2 • STAR • GSNAP 16/18 NGS Data Analysis Course 29-08-2016

Slide 32

Slide 32 text

Which Aligner? Practical Considerations • Technical advantages • Use tools with good developer support • Use tools with good documentation 17/18 NGS Data Analysis Course 29-08-2016

Slide 33

Slide 33 text

Acknowledgements Martijn Vermaat Jeroen Laros