2016 Alignment Methods

Alignment Methods NGS Data Analysis Course 29 August 2016 Wibowo
Arindrarto Sequencing Analysis Support Core Leiden University Medical Center

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -
1300 bp • reference sequence region (usually a region in the genome) 1/18 NGS Data Analysis Course 29-08-2016

1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location 1/18 NGS Data Analysis Course 29-08-2016

1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location Where do these reads come from? 1/18 NGS Data Analysis Course 29-08-2016

String Matching Problem Insight Nucleotide bases can be represented by
characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our reference sequence becomes a long piece of string. 2/18 NGS Data Analysis Course 29-08-2016

characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our reference sequence becomes a long piece of string. Where can the reads be found within the reference? 2/18 NGS Data Analysis Course 29-08-2016

characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our reference sequence becomes a long piece of string. Where can the reads be found within the reference? in other words, we want to solve a string matching problem 2/18 NGS Data Analysis Course 29-08-2016

String Matching Problem Insight Not so far from your daily
lives ... 3/18 NGS Data Analysis Course 29-08-2016

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing
regions 4/18 NGS Data Analysis Course 29-08-2016

regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. 4/18 NGS Data Analysis Course 29-08-2016

regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. We need to ﬁnd approximate instead of exact matches. 4/18 NGS Data Analysis Course 29-08-2016

Smith-Waterman Algorithm Basic ideas • Given two strings and a
scoring scheme, ﬁnd the most similar region. 5/18 NGS Data Analysis Course 29-08-2016

scoring scheme, ﬁnd the most similar region. • Score for matches, penalize mismatches and gaps. 5/18 NGS Data Analysis Course 29-08-2016

scoring scheme, ﬁnd the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. 5/18 NGS Data Analysis Course 29-08-2016

scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. Characteristics • Guaranteed to find optimal alignment. • Finds local regions of similarities. • Generalization of Needleman-Wunsch, which finds global similarities. 5/18 NGS Data Analysis Course 29-08-2016

Smith-Waterman Algorithm Adapted from http://www.langmead-lab.org/teaching-materials/ 6/18 NGS Data Analysis Course
29-08-2016

Local vs Global Alignment 1 Sequence 1: MAHGPSTYRWSKR 2 Sequence
2: MGPSTYVKR 3 4 ---------------- 5 Global Alignment 6 ---------------- 7 5’ MAHGPSTYRWSKR 3’ 8 | ||||| || 9 5’ M--GPSTY --VKR 3’ 10 11 --------------- 12 Local Alignment 13 --------------- 14 5’ GPSTY 3’ 15 ||||| 16 5’ GPSTY 3’ Listing 1: Alignments 7/18 NGS Data Analysis Course 29-08-2016

More Complications Reference sequences are long. • Scoring matrix size
will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. 8/18 NGS Data Analysis Course 29-08-2016

More Complications Reference sequences are long. • Scoring matrix size
will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. How do we reduce space requirement and speed up computa- tion? 8/18 NGS Data Analysis Course 29-08-2016

Aligning with Indices Insight Our reference sequence can be transformed
into another structure more suitable for alignment of sequencing reads. 9/18 NGS Data Analysis Course 29-08-2016

into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. 9/18 NGS Data Analysis Course 29-08-2016

into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. Modern aligners utilize various index data structures: • Hash table • Suﬃx array • FM index 9/18 NGS Data Analysis Course 29-08-2016

Aligning with Indices • Use index to ﬁnd candidates •
Use Smith-Waterman on candidate locations 10/18 NGS Data Analysis Course 29-08-2016

More Complications? Highly Similar Regions Some reads can be mapped
to multiple locations still. How to ensure we map to the correct location? 11/18 NGS Data Analysis Course 29-08-2016

More Complications? Highly Similar Regions Some reads can be mapped
to multiple locations still. How to ensure we map to the correct location? • Use paired-end reads • Use longer reads • Discard them / deﬁne a maximum multi alignment limit 11/18 NGS Data Analysis Course 29-08-2016

More Complications? Highly Similar Regions 12/18 NGS Data Analysis Course
29-08-2016

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes
contain introns and our reads span over them. How do we know the correct intron locations? 13/18 NGS Data Analysis Course 29-08-2016

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes
contain introns and our reads span over them. How do we know the correct intron locations? • Align to the transcriptome (at the cost of ignoring novel exons) • Use split read aligners when mapping to the genome 13/18 NGS Data Analysis Course 29-08-2016

More Complications? Splicing (RNA-seq) 14/18 NGS Data Analysis Course 29-08-2016

More Complications? Circular Genomes Bacterial and mitochondrial genomes are almost
always circular. How do we make alignment work with circular references? • Trick: extend reference by adding N bases to the end • Use an aligner that can handle circular references (e.g. GMAP) 15/18 NGS Data Analysis Course 29-08-2016

Which Aligner? Regular read aligners • BWA (MEM) • Bowtie2
• BLASR Split read aligners • HISAT2 • STAR • GSNAP 16/18 NGS Data Analysis Course 29-08-2016

Which Aligner? Practical Considerations • Technical advantages • Use tools
with good developer support • Use tools with good documentation 17/18 NGS Data Analysis Course 29-08-2016

Acknowledgements Martijn Vermaat Jeroen Laros

2016 Alignment Methods

2016 Alignment Methods

Wibowo Arindrarto

More Decks by Wibowo Arindrarto

Other Decks in Science

Featured

Transcript

Alignment Methods NGS Data Analysis Course 29 August 2016 Wibowo

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -

Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -

String Matching Problem Insight Nucleotide bases can be represented by

String Matching Problem Insight Nucleotide bases can be represented by

String Matching Problem Insight Nucleotide bases can be represented by

String Matching Problem Insight Not so far from your daily

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing

Complications Sequencing-related • Errors occur during sequencing: misread bases, missing

Smith-Waterman Algorithm Basic ideas • Given two strings and a

Smith-Waterman Algorithm Basic ideas • Given two strings and a

Smith-Waterman Algorithm Basic ideas • Given two strings and a

Smith-Waterman Algorithm Basic ideas • Given two strings and a

Smith-Waterman Algorithm Adapted from http://www.langmead-lab.org/teaching-materials/ 6/18 NGS Data Analysis Course

Local vs Global Alignment 1 Sequence 1: MAHGPSTYRWSKR 2 Sequence

More Complications Reference sequences are long. • Scoring matrix size

More Complications Reference sequences are long. • Scoring matrix size

Aligning with Indices Insight Our reference sequence can be transformed

Aligning with Indices Insight Our reference sequence can be transformed

Aligning with Indices Insight Our reference sequence can be transformed

Aligning with Indices • Use index to ﬁnd candidates •

More Complications? Highly Similar Regions Some reads can be mapped

More Complications? Highly Similar Regions Some reads can be mapped

More Complications? Highly Similar Regions 12/18 NGS Data Analysis Course

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes

More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes

More Complications? Splicing (RNA-seq) 14/18 NGS Data Analysis Course 29-08-2016

More Complications? Circular Genomes Bacterial and mitochondrial genomes are almost

Which Aligner? Regular read aligners • BWA (MEM) • Bowtie2

Which Aligner? Practical Considerations • Technical advantages • Use tools

Acknowledgements Martijn Vermaat Jeroen Laros