Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2016 Alignment Methods

2016 Alignment Methods

Lecture about sequence alignment methods given in a 2016 next-generation sequencing course

Wibowo Arindrarto

August 29, 2016
Tweet

More Decks by Wibowo Arindrarto

Other Decks in Science

Transcript

  1. Alignment Methods NGS Data Analysis Course 29 August 2016 Wibowo

    Arindrarto Sequencing Analysis Support Core Leiden University Medical Center
  2. Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -

    1300 bp • reference sequence region (usually a region in the genome) 1/18 NGS Data Analysis Course 29-08-2016
  3. Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -

    1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location 1/18 NGS Data Analysis Course 29-08-2016
  4. Introduction Motivation Sanger Sequencing: • tens of sequences, 500 -

    1300 bp • reference sequence region (usually a region in the genome) Next Generation Sequencing: • short read: tens of millions of sequences, 200 - 300 bp • long read: tens of thousands of sequences, 2.000 - 8.000 bp • unknown origin of reads location Where do these reads come from? 1/18 NGS Data Analysis Course 29-08-2016
  5. String Matching Problem Insight Nucleotide bases can be represented by

    characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our ref- erence sequence becomes a long piece of string. 2/18 NGS Data Analysis Course 29-08-2016
  6. String Matching Problem Insight Nucleotide bases can be represented by

    characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our ref- erence sequence becomes a long piece of string. Where can the reads be found within the reference? 2/18 NGS Data Analysis Course 29-08-2016
  7. String Matching Problem Insight Nucleotide bases can be represented by

    characters: Adenine, Thymine, Guanine, Cytosine. Our sequencing reads become short pieces of strings, our ref- erence sequence becomes a long piece of string. Where can the reads be found within the reference? in other words, we want to solve a string matching problem 2/18 NGS Data Analysis Course 29-08-2016
  8. String Matching Problem Insight Not so far from your daily

    lives ... 3/18 NGS Data Analysis Course 29-08-2016
  9. Complications Sequencing-related • Errors occur during sequencing: misread bases, missing

    regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. 4/18 NGS Data Analysis Course 29-08-2016
  10. Complications Sequencing-related • Errors occur during sequencing: misread bases, missing

    regions Reference-related • Our best reference genome still has unknown regions. • Genetic variations exist between organisms. We need to find approximate instead of exact matches. 4/18 NGS Data Analysis Course 29-08-2016
  11. Smith-Waterman Algorithm Basic ideas • Given two strings and a

    scoring scheme, find the most similar region. 5/18 NGS Data Analysis Course 29-08-2016
  12. Smith-Waterman Algorithm Basic ideas • Given two strings and a

    scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. 5/18 NGS Data Analysis Course 29-08-2016
  13. Smith-Waterman Algorithm Basic ideas • Given two strings and a

    scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. 5/18 NGS Data Analysis Course 29-08-2016
  14. Smith-Waterman Algorithm Basic ideas • Given two strings and a

    scoring scheme, find the most similar region. • Score for matches, penalize mismatches and gaps. • Strategy: compare optimal alignments of all substrings and pick the highest-scoring one. Characteristics • Guaranteed to find optimal alignment. • Finds local regions of similarities. • Generalization of Needleman-Wunsch, which finds global similarities. 5/18 NGS Data Analysis Course 29-08-2016
  15. Local vs Global Alignment 1 Sequence 1: MAHGPSTYRWSKR 2 Sequence

    2: MGPSTYVKR 3 4 ---------------- 5 Global Alignment 6 ---------------- 7 5’ MAHGPSTYRWSKR 3’ 8 | ||||| || 9 5’ M--GPSTY --VKR 3’ 10 11 --------------- 12 Local Alignment 13 --------------- 14 5’ GPSTY 3’ 15 ||||| 16 5’ GPSTY 3’ Listing 1: Alignments 7/18 NGS Data Analysis Course 29-08-2016
  16. More Complications Reference sequences are long. • Scoring matrix size

    will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. 8/18 NGS Data Analysis Course 29-08-2016
  17. More Complications Reference sequences are long. • Scoring matrix size

    will be huge but most cells will be unused for traceback. • Remember this needs to be done for all reads in both orientations. How do we reduce space requirement and speed up computa- tion? 8/18 NGS Data Analysis Course 29-08-2016
  18. Aligning with Indices Insight Our reference sequence can be transformed

    into another structure more suitable for alignment of sequencing reads. 9/18 NGS Data Analysis Course 29-08-2016
  19. Aligning with Indices Insight Our reference sequence can be transformed

    into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. 9/18 NGS Data Analysis Course 29-08-2016
  20. Aligning with Indices Insight Our reference sequence can be transformed

    into another structure more suitable for alignment of sequencing reads. Analogy: book index • Information more or less preserved. • Looking up words become much quicker. Modern aligners utilize various index data structures: • Hash table • Suffix array • FM index 9/18 NGS Data Analysis Course 29-08-2016
  21. Aligning with Indices • Use index to find candidates •

    Use Smith-Waterman on candidate locations 10/18 NGS Data Analysis Course 29-08-2016
  22. More Complications? Highly Similar Regions Some reads can be mapped

    to multiple locations still. How to ensure we map to the correct location? 11/18 NGS Data Analysis Course 29-08-2016
  23. More Complications? Highly Similar Regions Some reads can be mapped

    to multiple locations still. How to ensure we map to the correct location? • Use paired-end reads • Use longer reads • Discard them / define a maximum multi alignment limit 11/18 NGS Data Analysis Course 29-08-2016
  24. More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes

    contain introns and our reads span over them. How do we know the correct intron locations? 13/18 NGS Data Analysis Course 29-08-2016
  25. More Complications? Splicing (RNA-seq) RNA-seq reads (from cDNAs) from eukaryotes

    contain introns and our reads span over them. How do we know the correct intron locations? • Align to the transcriptome (at the cost of ignoring novel exons) • Use split read aligners when mapping to the genome 13/18 NGS Data Analysis Course 29-08-2016
  26. More Complications? Circular Genomes Bacterial and mitochondrial genomes are almost

    always circular. How do we make alignment work with circular references? • Trick: extend reference by adding N bases to the end • Use an aligner that can handle circular references (e.g. GMAP) 15/18 NGS Data Analysis Course 29-08-2016
  27. Which Aligner? Regular read aligners • BWA (MEM) • Bowtie2

    • BLASR Split read aligners • HISAT2 • STAR • GSNAP 16/18 NGS Data Analysis Course 29-08-2016
  28. Which Aligner? Practical Considerations • Technical advantages • Use tools

    with good developer support • Use tools with good documentation 17/18 NGS Data Analysis Course 29-08-2016