Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Genome Informatics (overview)

Genome Informatics (overview)

A proposed upper level course for Biology majors

Barry Grant

March 08, 2017
Tweet

More Decks by Barry Grant

Other Decks in Education

Transcript

  1. Genome Informatics: A proposed upper level course for Biology majors

    Barry Grant University of Michigan http://thegrantlab.org
  2. Motivation Why this course and why it is interesting &

    worthwhile for students. Learning Objectives What the students need to learn to succeed in this course. Course Structure Proposed lecture topics and specific leaning goals. Teaching Methods How the course will be conducted and why. Teaching Philosophy My style and how it contributes to the learning environment.
  3. Motivation Why this course and why it is interesting &

    worthwhile for students. Learning Objectives What the students need to learn to succeed in this course. Course Structure Proposed lecture topics and specific leaning goals. Teaching Methods How the course will be conducted and why. Teaching Philosophy My style and how it contributes to the learning environment.
  4. Why this course? Offers an introduction to genomics and supporting

    bioinformatics concepts and resources. Covers modern hot topics and the intimate coupling of informatics with biology - highlighting the impact of genomics on science and society! Designed for biology majors with no programing experience. Provides a hook for increasing computational and data science competencies in the biosciences - valuable high demand translational skills!
  5. At the end of this course students will: • Understand

    the process by which genomes are currently sequenced and the bioinformatics processing and analysis required for their interpretation. • Be familiar with the research objectives of genomics related sub-disciplines, including Transcriptomics, Genome wide association studies (GWAS) and Comparative genomics*. • Be able to use online bioinformatics resources including major genomic databases, genome browsers and select quality control and analysis tools.
  6. In short, students will develop a solid foundational knowledge of

    genomics and be able to evaluate genomic information using online bioinformatic tools and resources.
  7. Specific Learning Goals: Teaching toward the 14 specific learning goals

    below is expected to occupy 60%-70% of class time. The remaining course content is at the discretion of the instructor with student body input. This includes student selected topics for peer presentation as well as two student selected guest lectures from industry based genomic scientists. All students who receive a passing grade should be able to: 1. Appreciate and describe in general terms the rapid advances in sequencing technologies and the new areas of investigation that these advances have made accessible. 2. Productively use major bioinformatics resources for human and model organism genomic data at NCBI, EBI and UCSC. 3. Understand that sequence alignment is the most fundamental operation underlying genome informatics and indeed much of bioinformatics. 4. Be able to describe how dynamic programming works for pairwise sequence alignment and appreciate the differences between global and local alignment along with their major application areas. 5. Understand how genomes are annotated and genes predicted using bioinformatics approaches. 6. Employ bioinformatics methods via the Galaxy server framework and use online tools to interpret gene lists and annotate potential gene functions. 7. Understand how single-nucleotide polymorphisms (SNPs) and indels are identified. Appreciate the existence of structural variants and copy number variations along with their potential significance. 8. Be able to describe how transcripts are analyzed and their abundance quantified by RNA- Seq and related approaches. 9. Appreciate how transcription factor binding and histone modifications can be studied by ChIP-Seq. 10.Justify the value of studying many genomes and comparing the genomes of different organisms. http://thegrantlab.org/ucsd/
  8. Course Structure http://thegrantlab.org/ucsd/ Derived from specific learning goals Winter 2018

    BIMM 104: Genome Informatics Lectures (TuTh) 10:30 - 11:50 am Tu, 01/09 Welcome to Genome Informatics (Course Introduction, instructional approach, leaning goals & expectations) 1 Th, 01/11 What is a genome? (Primer on key concepts and vocabulary including genome replication, genes, exons/introns/splicing, transcription, nucleosomes and repetitive sequences) 2 Tu, 01/16 Sequencing technologies past, present and future. (Sanger, Shotgun, PacBio, Illumina, toward the $500 human genome) 3 Th, 01/18 Major bioinformatics resources for genomics. (Databases, tools and visualization resources from NCBI, EBI & UCSC). 4 Tu, 01/23 Alignment method foundations of genomic analysis 1 (Classic Needleman-Wunsch, Smith-Waterman and BLAST heuristic approaches) 5 Th, 01/25 Alignment method foundations of genomic analysis 2 (Short read aligners, indexing and working with high-throughput sequencing data) 6 Tu, 01/30 Genomic analysis workflows (The Galaxy platform for quality control and analysis; FASTQ, SAM and BAM file formats; Sample workflow with FASTQC and bowtie2) 7 Th, 02/01 Genome assembly (The Genome Reference Consortium; De novo genome sequencing and genome assembly) 8 Tu, 02/06 Genome annotation (Genome annotation, gene finding and functional annotation) 9 Th, 02/08 Genome re-sequencing and variation (Aligning to reference genomes; SNP and indel calling; Structural and Copy Number Variations; Germline vs Somatic variants; Population vs Personal variants) 10 Tu, 02/13 Mid Term: Find a gene project assignment (Principles of database searching and sequence analysis) 11
  9. Each unit has pre-class screencasts, learning goals, common misconceptions and

    assessment rubrics http://thegrantlab.org/ucsd/ Unit title:
 Alignment method foundations of genomic analysis (2 Lectures) Lecture number(s): 4 & 5 Pre-class material: Screen-cast: Alignment fundamentals: Why, how and where? Reading assignment: Sean Eddy’s “What is dynamic programming?” Nature Biotechnology 22(7) 2004, 909-910. Unit learning goals: At the end of this unit students will: 1. Understand that sequence alignment is the most fundamental operation underlying genome informatics and indeed much of bioinformatics. 2. Be able to describe in general terms how dynamic programming works for pairwise sequence alignment. 3. Appreciate the differences between global and local alignment along with their major application areas. 4. Understand why heuristic approaches become necessary for large database searches and many genomic applications. 5. Appreciate that even when optimal solutions can be obtained they are not necessarily unique or reflective of the biologically correct alignment. Terminology: Algorithm Gaps Ortholog/Paralog BLAST Global alignment Percent identity BWA Heuristic PSI-BLAST Cufflinks HMMER Scoring scheme Database searching Homologue Sequence identity
  10. This makes expectations explicit and demonstrates what success looks like

    http://thegrantlab.org/ucsd/ Potential student misconceptions: 1. Alignment tools are all the same in terms of performance, output and underlying assumptions. 2. There is one unique highest scoring alignment answer independent of the chosen scoring scheme. 3. Alignment output should not be inspected manually (i.e. trust the black box). Homework grading This unit’s homework consists of both (1) an online knowledge assessment quiz and (2) a Needleman-Wunsch dynamic programming assessment exercise. Both components contribute 50% to this unit’s grade. Scoring Rubric for homework #2 Muddy Point Assessment Step Fulfilled Assessment Criteria Points 1 Setup labeled alignment matrix 10 2 Include initial column and row for GAPs 10 3 All alignment matrix elements filled in 10 4 Evidence for correct use of scoring scheme 10 5 Direction arrows drawn between all cells 10 6 Evidence of multiple arrows to a given cell if appropriate 10 D 7 Correct optimal score position in matrix used 10 C 8 Correct optimal score obtained for given scoring scheme 10 B 9 Traceback path(s) clearly highlighted 10 A 10 Correct alignment(s) yielding optimal score listed 10 A+ (100 Total points)
  11. http://thegrantlab.org/ucsd/ Winter 2018 BIMM 104: Genome Informatics Lectures (TuTh) 10:30

    - 11:50 am Tu, 01/09 Welcome to Genome Informatics (Course Introduction, instructional approach, leaning goals & expectations) 1 Th, 01/11 What is a genome? (Primer on key concepts and vocabulary including genome replication, genes, exons/introns/splicing, transcription, nucleosomes and repetitive sequences) 2 Tu, 01/16 Sequencing technologies past, present and future. (Sanger, Shotgun, PacBio, Illumina, toward the $500 human genome) 3 Th, 01/18 Major bioinformatics resources for genomics. (Databases, tools and visualization resources from NCBI, EBI & UCSC). 4 Tu, 01/23 Alignment method foundations of genomic analysis 1 (Classic Needleman-Wunsch, Smith-Waterman and BLAST heuristic approaches) 5 Th, 01/25 Alignment method foundations of genomic analysis 2 (Short read aligners, indexing and working with high-throughput sequencing data) 6 Tu, 01/30 Genomic analysis workflows (The Galaxy platform for quality control and analysis; FASTQ, SAM and BAM file formats; Sample workflow with FASTQC and bowtie2) 7 Th, 02/01 Genome assembly (The Genome Reference Consortium; De novo genome sequencing and genome assembly) 8 Tu, 02/06 Genome annotation (Genome annotation, gene finding and functional annotation) 9 Th, 02/08 Genome re-sequencing and variation (Aligning to reference genomes; SNP and indel calling; Structural and Copy Number Variations; Germline vs Somatic variants; Population vs Personal variants) 10 Tu, 02/13 Mid Term: Find a gene project assignment (Principles of database searching and sequence analysis) 11 Th, 02/15 Genome wide association studies (GWAS) (Odds ratios, Manhattan plots, SNP arrays and imputation) 12 Tu, 02/20 Transcriptomics (RNA-Seq aligners, counts and FPKMs, differential expression tests) 13 Why?
  12. ALIGNMENT FOUNDATIONS • Why… ‣ Why compare biological sequences? •

    What… ‣ Alignment view of sequence changes during evolution (matches, mismatches and gaps) • How… ‣ Dot matrices ‣ Dynamic programing - Global alignment - Local alignment ‣ BLAST heuristic approach ‣ Substitutions matrices ‣ PSI-BLAST & HMMER
  13. ALIGNMENT FOUNDATIONS • Why… ‣ Why compare biological sequences? •

    What… ‣ Alignment view of sequence changes during evolution (matches, mismatches and gaps) • How… ‣ Dot matrices ‣ Dynamic programing - Global alignment - Local alignment ‣ BLAST heuristic approach ‣ Substitutions matrices ‣ PSI-BLAST & HMMER Menu for Today
  14. C A T C A C C G C C

    A G T C T Seq1: Seq2: Basic Idea: Display one sequence above another with spaces (termed gaps) inserted in both to reveal similarity of nucleo;des or amino acids. [Screencast Material]
  15. C A T C A C C G C C

    A G | | | T C T Seq1: Seq2: Basic Idea: Display one sequence above another with spaces (termed gaps) inserted in both to reveal similarity of nucleo;des or amino acids. mismatch match Two types of character correspondence
  16. C A T - T C C A - C

    - T C G C C A G | | | | | Seq1: Seq2: Basic Idea: Display one sequence above another with spaces (termed gaps) inserted in both to reveal similarity of nucleo;des or amino acids. match mismatch gaps Add gaps to increase number of matches
  17. C A T - T C C A - C

    - T C G C C A G | | | | | Seq1: Seq2: Basic Idea: Display one sequence above another with spaces (termed gaps) inserted in both to reveal similarity of nucleo;des or amino acids. match mismatch inser;on dele;on indels } } muta4on Gaps represent ‘indels’ mismatch represent muta;ons
  18. Why compare biological sequences? • To obtain func4onal or mechanis4c

    insight about a sequence by inference from another poten;ally beFer characterized sequence • To find whether two (or more) genes or proteins are evolu4onarily related • To find structurally or func4onally similar regions within sequences (e.g. cataly;c sites, binding sites for other molecules, etc.) • Many prac;cal bioinforma;cs applica;ons…
  19. Prac;cal applica;ons include... • Similarity searching of databases – Protein

    structure predic;on, annota;on, etc... • Assembly of sequence reads into a longer construct such as a genomic sequence • Mapping sequencing reads to a known genome – "Resequencing", looking for differences from reference genome - SNPs, indels (inser;ons or dele;ons) – Mapping transcrip;on factor binding sites via ChIP-Seq (chroma;n immuno-precipita;on sequencing) – PreFy much all next-gen sequencing data analysis
  20. Prac;cal applica;ons include... • Similarity searching of databases – Protein

    structure predic;on, annota;on, etc... • Assembly of sequence reads into a longer construct such as a bacterial genome • Mapping sequencing reads to a known genome – "Resequencing", looking for differences from reference genome - SNPs, indels (inser;ons or dele;ons) – Mapping transcrip;on factor binding sites via ChIP-Seq (chroma;n immuno-precipita;on sequencing) – PreFy much all next-gen sequencing data analysis N.B. Pairwise sequence alignment is arguably the most fundamental opera;on of bioinforma;cs!
  21. ALIGNMENT FOUNDATIONS • Why… • Why compare biological sequences? •

    What… ‣ Alignment view of sequence changes during evolution (matches, mismatches and gaps) • How… ‣ Dot matrices ‣ Dynamic programing - Global alignment - Local alignment ‣ BLAST heuristic approach ‣ Substitutions matrices ‣ PSI-BLAST & HMMER Menu for Today
  22. Sequence changes during evolu;on CTCGTTA CATGTTA CACTGTA There are three

    major types of sequence change that can occur during evolu;on. – Muta;ons/Subs;tu;ons – Dele;ons – Inser;ons (B) (A) (C) Time Common Ancestor Recent Species
  23. Muta;ons, dele;ons and inser;ons CTCGTTA CATGTTA CACTGTA Mutation CACGTTA There

    are three major types of sequence change that can occur during evolu;on. – Muta4ons/Subs4tu4ons – Dele;ons – Inser;ons CTCGTTA CACGTTA Likely occurred prior to speciation
  24. Muta;ons, dele;ons and inser;ons CTCGTTA CATGTTA CACTGTA Mutation CACGTTA CACGTTA

    CACGTTA There are three major types of sequence change that can occur during evolu;on. – Muta;ons/Subs;tu;ons – Dele;ons – Inser;ons CTCGTTA CACGTTA (speciation)
  25. Muta;ons, dele;ons and inser;ons CTCGTTA CATGTTA CACTGTA Mutation Deletion CACGTTA

    CACGTTA CACGTTA X CACTTA There are three major types of sequence change that can occur during evolu;on. – Muta;ons/Subs;tu;ons – Dele4ons – Inser;ons CACGTTA CACTTA CTCGTTA CACGTTA
  26. Muta;ons, dele;ons and inser;ons CTCGTTA CATGTTA CACTGTA Mutation Deletion Insertion

    CACGTTA CACGTTA CACGTTA X CACTTA There are three major types of sequence change that can occur during evolu;on. – Muta;ons/Subs;tu;ons – Dele;ons – Inser4ons CACGTTA CACTTA CACTTA CACTGTA CTCGTTA CACGTTA
  27. Muta;ons, dele;ons and inser;ons CTCGTTA CATGTTA CACTGTA Mutation CACGTTA CACGTTA

    There are three major types of sequence change that can occur during evolu;on. – Muta4ons/Subs4tu4ons – Dele;ons – Inser;ons Mutation CACGTTA CATGTTA CTCGTTA CACGTTA
  28. Alignment view Alignments are great tools to visualize sequence similarity

    and evolu;onary changes in homologous sequences. – Mismatches represent muta;ons/subs;tu;ons – Gaps represent inser;ons and dele;ons (indels) CTCGTTA CATGTTA CACTGTA (B) (A) (C) CATGT-TA CAC-TGTA ||: | || (A) (B) Match 5 Mismatch 1 Gap 2 Substitution Indels
  29. Alterna;ve alignments • Unfortunately, finding the correct alignment is difficult

    if we do not know the evolu;onary history of the two sequences Q. Which of these 3 possible alignments is best? CATGTTA CACTGTA ||:::|| CA-TGTTA CACTGT-A || ||| | CATGT-TA CAC-TGTA ||: | || 1. 2. 3.
  30. Alterna;ve alignments • One way to judge alignments is to

    compare their number of matches, inser;ons, dele;ons and muta;ons CATGTTA CACTGTA ||:::|| CA-TGTTA CACTGT-A || ||| | CATGT-TA CAC-TGTA ||: | || 4 matches 3 mismatches 0 gaps 6 matches 0 mismatches 2 gaps 5 matches 1 mismatches 2 gaps
  31. Scoring alignments • We can assign a score for each

    match (+3), mismatch (+1) and indel (-1) to iden;fy the op4mal alignment for this scoring scheme CATGTTA CACTGTA ||:::|| CA-TGTTA CACTGT-A || ||| | CATGT-TA CAC-TGTA ||: | || 4 (+3) 3 (+1) 0 (-1) = 15 6 (+3) 0 (+1) 2 (-1) = 16 5 (+3) 1 (+1) 2 (-1) = 14
  32. Op;mal alignments • Biologists oben prefer parsimonious alignments, where the

    number of postulated sequence changes is minimized. CATGTTA CACTGTA ||:::|| CA-TGTTA CACTGT-A || ||| | CATGT-TA CAC-TGTA ||: | || 4 matches 3 mismatches 0 gaps 6 matches 0 mismatches 2 gaps 5 matches 1 mismatches 2 gaps
  33. Op;mal alignments • Biologists oben prefer parsimonious alignments, where the

    number of postulated sequence changes is minimized. CATGTTA CACTGTA ||:::|| CA-TGTTA CACTGT-A || ||| | CATGT-TA CAC-TGTA ||: | || 4 matches 3 mismatches 0 gaps 6 matches 0 mismatches 2 gaps 5 matches 1 mismatches 2 gaps
  34. || || || Op;mal alignments • Biologists oben prefer parsimonious

    alignments, where the number of postulated sequence changes is minimized. CATGTTA CACTGTA ||:::|| CA-TGTTA CACTG-TA CATGT-TA CAC-TGTA ||: | || 4 matches 3 mismatches 0 gaps 6 matches 0 mismatches 2 gaps 5 matches 1 mismatches 2 gaps || || ||
  35. Op;mal alignments • Biologists oben prefer parsimonious alignments, where the

    number of postulated sequence changes is minimized. CATGTTA CACTGTA ||:::|| CA-TGTTA CACTGT-A || ||| | CATGT-TA CAC-TGTA ||: | || 4 matches 3 mismatches 0 gaps 6 matches 0 mismatches 2 gaps 5 matches 1 mismatches 2 gaps Warning: There may be more than one op/mal alignment and these may not reflect the true evolu;onary history of our sequences!
  36. ALIGNMENT FOUNDATIONS • Why… • Why compare biological sequences? •

    What… • Alignment view of sequence changes during evolution (matches, mismatches and gaps) • How… ‣ Dot matrices ‣ Dynamic programing - Global alignment - Local alignment ‣ BLAST heuristic approach ‣ Substitutions matrices ‣ PSI-BLAST & HMMER Menu for Today
  37. ALIGNMENT FOUNDATIONS • Why… • Why compare biological sequences? •

    What… • Alignment view of sequence changes during evolution (matches, mismatches and gaps) • How… ‣ Dot matrices ‣ Dynamic programing - Global alignment - Local alignment ‣ BLAST heuristic approach ‣ Substitutions matrices ‣ PSI-BLAST & HMMER Menu for Today How do we compute the op;mal alignment between two sequences?
  38. Dot plots: simple graphical approach • Place one sequence on

    the ver;cal axis of a 2D grid (or matrix) and the other on the horizontal A C C G G A C A C G
  39. Dot plots: simple graphical approach • Now simply put dots

    where the horizontal and ver;cal sequence values match A C C G G A C A C G
  40. Dot plots: simple graphical approach • Diagonal runs of dots

    indicate matched segments of sequence A C C G G A C A C G
  41. Dot plots: simple graphical approach Q. What would the dot

    matrix of a two iden;cal sequences look like? A C C G G A C G C G
  42. Dot plots: window size and match stringency Solu4on: use a

    window and a threshold – compare character by character within a window – require certain frac;on of matches within window in order to display it with a dot. • You have to choose window size and stringency Window = 3 Stringency = 3 A C C G G A C A C G A C C G G A C A C G Filter
  43. Dot plots: window size and match stringency Solu4on: use a

    window and a threshold – compare character by character within a window – require certain frac;on of matches within window in order to display it with a dot. • You have to choose window size and stringency Window = 3 Stringency = 2 A C C G G A C A C G Filter A C C G G A C A C G
  44. Window size = 5 bases A dot plot simply puts

    a dot where two sequences match. In this example, dots are placed in the plot if 5 bases in a row match perfectly. Requiring a 5 base perfect match is a heuristic – only look at regions that have a certain degree of identity. Do you expect evolutionarily related sequences to have more word matches (matches in a row over a certain length) than random or unrelated sequences?
  45. Web site used: http://www.vivo.colostate.edu/molkit/dnadot/ This is a dot plot of

    the same sequence pair. Now 7 bases in a row must match for a dot to be place. Noise is reduced. Using windows of a certain length is very similar to using words (kmers) of N characters in the heuristic alignment search tools Bigger window (kmer) fewer matches to consider Window size = 7 bases
  46. Web site used: http://www.vivo.colostate.edu/molkit/dnadot/ Only diagonals can be followed. Downward

    or rightward paths represent insertion or deletions (gaps in one sequence or the other). Ungapped alignments indels
  47. Uses for dot matrices • Visually assessing the similarity of

    two protein or two nucleic acid sequences • Finding local repeat sequences within a larger sequence by comparing a sequence to itself – Repeats appear as a set of diagonal runs stacked ver;cally and/or horizontally
  48. Human LDL receptor protein sequence (Genbank P01130) W = 1

    S = 1 Repeats (Figure from Mount, “Bioinformatics sequence and genome analysis”)
  49. Repeats (Figure from Mount, “Bioinformatics sequence and genome analysis”) Human

    LDL receptor protein sequence (Genbank P01130) W = 23 S = 7
  50. ALIGNMENT FOUNDATIONS • Why… • Why compare biological sequences? •

    What… • Alignment view of sequence changes during evolution (matches, mismatches and gaps) • How… ‣ Dot matrices ‣ Dynamic programing - Global alignment - Local alignment ‣ BLAST heuristic approach ‣ Substitutions matrices ‣ PSI-BLAST & HMMER Menu for Today
  51. 56 D P L E D P M E D

    P L E D 6 -1 -4 2 P -1 7 -3 -1 M -3 -2 2 -2 E -2 -1 -3 5 D P L E D 6 -1 -4 2 P -1 7 -3 -1 M -3 -2 2 -2 E -2 -1 -3 5 (2) (3) (1) Needleman, S.B. & Wunsch, C.D. (1970) “A general method applicable to the search for similarities in the amino acid sequences of two proteins.” J. Mol. Biol. 48:443-453. The Dynamic Programming Algorithm • The dynamic programming algorithm can be thought of an extension to the dot plot approach – One sequence is placed down the side of a grid and another across the top – Instead of placing a dot in the grid, we compute a score for each posi;on – Finding the op;mal alignment corresponds to finding the path through the grid with the best possible score
  52. Algorithm of Needleman and Wunsch • The Needleman–Wunsch approach to

    global sequence alignment has three basic steps: (1) seong up a 2D-grid (or alignment matrix), (2) scoring the matrix, and (3) iden;fying the op4mal path through the matrix D P L E D P M E D P L E D 6 -1 -4 2 P -1 7 -3 -1 M -3 -2 2 -2 E -2 -1 -3 5 D P L E D 6 -1 -4 2 P -1 7 -3 -1 M -3 -2 2 -2 E -2 -1 -3 5 (2) (3) (1) Needleman, S.B. & Wunsch, C.D. (1970) “A general method applicable to the search for similarities in the amino acid sequences of two proteins.” J. Mol. Biol. 48:443-453.
  53. Scoring the alignment matrix • Start by filling in the

    first row and column – these are all indels (gaps). – Each step you take you will add the gap penalty to the score (Si,j) accumulated in the previous cell - D P L E - 0 -2 -4 -6 -8 D -2 P -4 M -6 E -8 i j Sequence 1 Sequence 2 Scores: match = +1, mismatch = -1, gap = -2
  54. Scoring the alignment matrix • Start by filling in the

    first row and column – these are all indels (gaps). – Each step you take you will add the gap penalty to the score (Si,j) accumulated in the previous cell - D P L E - 0 -2 -4 -6 -8 D -2 P -4 M -6 E -8 i j Sequence 1 Sequence 2 Seq1: DPME Seq2: ---- Scores: match = +1, mismatch = -1, gap = -2 Si+4 = (-2) + (-2) + (-2) + (-2)
  55. Scoring the alignment matrix • Then go to the empty

    corner cell (upper leb). It has filled in values in up, leb and diagonal direc;ons – Now can ask which of the three direc;ons gives the highest score? – keep track of this score and direc;on - D P L E - 0 -2 -4 -6 -8 D -2 P -4 M -6 E -8 i j ? j-1 j i-1 S(i-1, j-1) S(i-1, j) i S(i, j-1) S(i, j) Scores: match = +1, mismatch = -1, gap = -2 2 3 1
  56. Scoring the alignment matrix • Then go to the empty

    corner cell (upper leb). It has filled in values in up, leb and diagonal direc;ons – Now can ask which of the three direc;ons gives the highest score? – keep track of this score and direc;on - D P L E - 0 -2 -4 -6 -8 D -2 P -4 M -6 E -8 i j ? Scores: match = +1, mismatch = -1, gap = -2 S(i, j) = Max {S(i-1, j-1) + (mis)match S(i-1, j) + gap penalty S(i, j-1) + gap penalty 1 2 3
  57. Scoring the alignment matrix • Then go to the empty

    corner cell (upper leb). It has filled in values in up, leb and diagonal direc;ons – Now can ask which direc;on gives the highest score – keep track of direc;on and score - D P L E - 0 -2 -4 -6 -8 D -2 1 P -4 M -6 E -8 i j 1 Scores: match = +1, mismatch = -1, gap = -2 1 2 3 (-2)+(-2) = -4 (-2)+(-2) = -4 (0)+(+1) = +1 Alignment D D <= (D-D) match!
  58. Scoring the alignment matrix • At each step, the score

    in the current cell is determine by the scores in the neighboring cells – The maximal score and the direc;on that gave that score is stored (we will use these later to determine the op;mal alignment) - D P L E - 0 -2 -4 -6 -8 D -2 1 -1 P -4 M -6 E -8 j Scores: match = +1, mismatch = -1, gap = -2 1 2 3 (-4)+(-2) = -6 (1)+(-2) = -1 i -1 (-2)+(-1) = -3 Alignment D- DP <= (D-P) mismatch!
  59. Scoring the alignment matrix • We will con;nue to store

    the alignment score (Si,j ) for all possible alignments in the alignment matrix. - D P L E - 0 -2 -4 -6 -8 D -2 1 -1 -3 P -4 M -6 E -8 j 1 2 3 (-6)+(-2) = -8 (-1)+(-2) = -3 i -3 (-4)+(-1) = -5 Alignment D-- DPL Scores: match = +1, mismatch = -1, gap = -2 <= (D-L) mismatch
  60. Scoring the alignment matrix • For the highlighted cell, the

    corresponding score (Si,j ) refers to the score of the op;mal alignment of the first i characters from sequence1, and the first j characters from sequence2. - D P L E - 0 -2 -4 -6 -8 D -2 1 -1 -3 -5 P -4 -1 2 0 M -6 E -8 j Scores: match = +1, mismatch = -1, indel = -2 1 2 3 (-3)+(-2) = -5 (2)+(-2) = 0 i 0 (-1)+(-1) = -2 Alignment DP- DPL
  61. Scoring the alignment matrix • At each step, the score

    in the current cell is determine by the scores in the neighboring cells – The maximal score and the direc;on that gave that score is stored - D P L E - 0 -2 -4 -6 -8 D -2 1 -1 -3 -5 P -4 -1 2 0 -2 M -6 -3 0 1 E -8 j Scores: match = +1, mismatch = -1, indel = -2 1 2 3 (0)+(-2) = -2 (0)+(-2) = -2 i 1 (2)+(-1) = 0 Alignment DPM DPL <= mismatch
  62. Scoring the alignment matrix • The score of the best

    alignment of the en;re sequences corresponds to Sn,m – (where n and m are the length of the sequences) - D P L E - 0 -2 -4 -6 -8 D -2 1 -1 -3 -5 P -4 -1 2 0 -2 M -6 -3 0 1 -1 E -8 -5 -2 -1 2 j=m Scores: match = +1, mismatch = -1, indel = -2 1 2 3 (-1)+(-2) = -3 (-1)+(-2) = -3 i=n 2 (+1)+(+1) = +2 Alignment DPME DPLE
  63. Scoring the alignment matrix • To find the best alignment,

    we retrace the arrows star;ng from the boFom right cell – N.B. The op;mal alignment score and alignment are dependent on the chosen scoring system - D P L E - 0 -2 -4 -6 -8 D -2 1 -1 -3 -5 P -4 -1 2 0 -2 M -6 -3 0 1 -1 E -8 -5 -2 -1 2 Scores: match = +1, mismatch = -1, indel = -2 Alignment DPME DPLE
  64. - C A T G T T A - 0

    -2 -4 -6 -8 -10 -12 -14 C -2 1 -1 -3 -5 -7 -9 -11 A -4 -1 2 0 -2 -4 -6 -8 C -6 -3 0 1 -1 -3 -5 -7 T -8 -5 -2 1 0 0 -2 -4 G -10 -7 -4 -1 2 0 -1 -3 T -12 -9 -6 -3 0 3 1 -1 A -14 -11 -8 -5 -2 1 2 2 • What is the op;mal score for the alignment of these sequences and how do we find the op;mal alignment? Ques;ons:
  65. - C A T G T T A - 0

    -2 -4 -6 -8 -10 -12 -14 C -2 1 -1 -3 -5 -7 -9 -11 A -4 -1 2 0 -2 -4 -6 -8 C -6 -3 0 1 -1 -3 -5 -7 T -8 -5 -2 1 0 0 -2 -4 G -10 -7 -4 -1 2 0 -1 -3 T -12 -9 -6 -3 0 3 1 -1 A -14 -11 -8 -5 -2 1 2 2 • What is the op;mal score for the alignment of these sequences and how do we find the op;mal alignment? Ques;ons:
  66. - C A T G T T A - 0

    -2 -4 -6 -8 -10 -12 -14 C -2 1 -1 -3 -5 -7 -9 -11 A -4 -1 2 0 -2 -4 -6 -8 C -6 -3 0 1 -1 -3 -5 -7 T -8 -5 -2 1 0 0 -2 -4 G -10 -7 -4 -1 2 0 -1 -3 T -12 -9 -6 -3 0 3 1 -1 A -14 -11 -8 -5 -2 1 2 2 • To find the best alignment we retrace the arrows star;ng from the boFom right cell Ques;ons:
  67. - C A T G T T A - 0

    -2 -4 -6 -8 -10 -12 -14 C -2 1 -1 -3 -5 -7 -9 -11 A -4 -1 2 0 -2 -4 -6 -8 C -6 -3 0 1 -1 -3 -5 -7 T -8 -5 -2 1 0 0 -2 -4 G -10 -7 -4 -1 2 0 -1 -3 T -12 -9 -6 -3 0 3 1 -1 A -14 -11 -8 -5 -2 1 2 2 • Some;mes more than one alignment can result in the same op;mal score Alignment CACTGT-A CA-TGTTA CACTG-TA CA-TGTTA More than one alignment possible
  68. - C A T G T T A - 0

    -3 -6 -9 -12 -15 -18 -21 C -3 1 -2 -5 -8 -11 -14 -17 A -6 -2 2 -1 -4 -7 -10 -13 C -9 -5 -1 1 -2 -5 -8 -11 T -12 -8 -4 0 0 -1 -4 -7 G -15 -11 -7 -3 1 -1 -2 -5 T -18 -14 -10 -6 -2 2 0 -3 A -21 -17 -13 -9 -5 -1 1 1 • Here we increase the gap penalty from -2 to -3 The alignment and score are dependent on the scoring system Alignment CACTGT-A CA-TGTTA CACTG-TA CA-TGTTA CACTGTA CATGTTA
  69. - C A T G T T A - 0

    -3 -6 -9 -12 -15 -18 -21 C -3 1 -2 -5 -8 -11 -14 -17 A -6 -2 2 -1 -4 -7 -10 -13 C -9 -5 -1 1 -2 -5 -8 -11 T -12 -8 -4 0 0 -1 -4 -7 G -15 -11 -7 -3 1 -1 -2 -5 T -18 -14 -10 -6 -2 2 0 -3 A -21 -17 -13 -9 -5 -1 1 1 • Here we increase the gap penalty from -2 to -3 The alignment and score are dependent on the scoring system Alignment CACTGT-A CA-TGTTA CACTG-TA CA-TGTTA CACTGTA CATGTTA Key point: Op;mal alignment solu;ons and their scores are not necessarily unique and depend on the scoring system!
  70. Global vs local alignments • Needleman-Wunsch is a global alignment

    algorithm – Resul;ng alignment spans the complete sequences end to end – This is appropriate for closely related sequences that are similar in length • For many prac;cal applica;ons we require local alignments – Local alignments highlight sub- regions (e.g. protein domains) in the two sequences that align well Global Local
  71. Smith-Waterman local alignment algorithm • Three main modifica;ons to Needleman-Wunsch:

    – Allow a node to start at 0 – The score for a par;cular cell cannot be nega;ve • if all other score op;ons produce a nega;ve value, then a zero must be inserted in the cell – Record the highest- scoring node, and trace back from there S(i, j) = Max {S(i-1, j-1) + (mis)match S(i-1, j) - gap penalty S(i, j-1) - gap penalty 0 1 2 3 4 j-1 j i-1 S(i-1, j-1) S(i-1, j) i S(i, j-1) S(i, j) 2 3 1
  72. Summary of key points • Sequence alignment is a fundamental

    operation underlying much of bioinformatics. • Even when optimal solutions can be obtained they are not necessarily unique or reflective of the biologically correct alignment. • Dynamic programming is a classic approach for solving the pairwise alignment problem. • Global and local alignment, and their major application areas. • Heuristic approaches are necessary for large database searches and many genomic applications.
  73. Check out the online: Screencast: “Alignment fundamentals” Reading: Sean Eddy’s

    “What is dynamic programming?” Unit learning goals document: Homework: (1) Quiz, (2) Alignment Exercise. FOR NEXT CLASS…
  74. Homework Grading Both (1) quiz questions and (2) alignment exercise

    carry equal weights (i.e. 50% each). (Homework 2) Assessment Criteria Points / Setup labeled alignment matrix 1 Include initial column and row for GAPs 1 All alignment matrix elements scored (i.e. filled in) 1 Evidence for correct use of scoring scheme 1 Direction arrows drawn between all cells 1 Evidence of multiple arrows to a given cell if appropriate 1 D Correct optimal score position in matrix used 1 C Correct optimal score obtained for given scoring scheme 1 B Traceback path(s) clearly highlighted 1 A Correct alignment(s) yielding optimal score listed 1 A+
  75. Who are our students & what are their motivations? What

    should our students learn? What are students learning? Which approaches increase learning? Adopt what works • Blended learning • Flipped classroom • Peer instruction • Hands-on approach • Rubric grading • Concept inventories • Muddy point assessment • Group activities & presentations • Video capture & pre- class screen-casts • Creating publicly available resources • etc… … incrementally add new approaches over course iterations! Collaboratively develop the design then … Approach
  76. Who are our students & what are their motivations? What

    should our students learn? What are students learning? Which approaches increase learning? Adopt what works • Blended learning • Flipped classroom • Peer instruction • Hands-on approach • Rubric grading • Concept inventories • Muddy point assessment • Group activities & presentations • Video capture & pre- class screen-casts • Creating publicly available resources • etc… … incrementally add new approaches over course iterations! Collaboratively develop the design then … Approach