Lecture 15 Short Read Alignment

L15: SHORT READ ALIGNMENT Foundations in Data Driven Life Sciences
BMMB 554

Today’s learning objectives • Understand the Burrows Wheeler Transform as
applied to short read alignment

Short read alignment • Problem: • We have sequenced a
genomic sample using a high-throughput sequencer, resulting in millions of short reads (30bp – 150bp). • We want to map these reads back to the reference genome (millions to billions of bp). • The reads contain some errors (1 miscalled base per 100bp) • The sequenced genome is different from the reference genome (perhaps 1% single nucleotide polymorphism rate). • Required: efficient algorithm to quickly map millions of reads against a >3 billion bp genome.

Genotyping application …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC AAATTTGC
TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA AGGCTATAT AGGCTATAT AGGCTATAT GGCTATATG CTATATGCG …CC …CC …CCA …CCA …CCAT ATAC… C… C… …CCAT …CCATAG TATGCGCCC GGTATAC… CGGTATAC Goal: identify variations Sequencing reads Reference genome http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

Burrows-Wheeler Transform • Very space-efficient way of indexing the genome
• Basis of popular Bowtie & BWA short-read aligners • Reversible permutation: • Once BWT(T) is built, discard everything else • Full matrix shown in following slides for illustration Burrows Wheeler Matrix Last column BWT(T) T http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt Matrix: all cyclic rotations of T, sorted alphabetically

Burrows-Wheeler Transform • BWT(T) is reversible due to “LF Mapping”
• ith occurrence of a character X in Last column is same text occurrence as the ith occurrence in First column T BWT(T) Burrows Wheeler Matrix 2nd ‘a’ in column http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt 2nd ‘a’ in column

Burrows-Wheeler Transform • BWT(T) is reversible due to “LF Mapping”
• ith occurrence of a character X in Last column is same text occurrence as the ith occurrence in First column T BWT(T) Burrows Wheeler Matrix http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt 1st ‘c’ in column 1st ‘c’ in column

Burrows-Wheeler Transform Genome: ACAACG$ Rotations: ACAACG$ $ACAACG G$ACAAC CG$ACAA ACG$ACA
AACG$AC CAACG$A Sort: $ACAACG AACG$AC ACAACG$ ACG$ACA CAACG$A CG$ACAA G$ACAAC BWT: GC$AAAC

Unpermuting the BWT G → 6 C → 4 $
→ 0 A → 1 A → 2 A → 3 C → 5 BWT: 0 1 2 3 4 5 6 $ .. A .. A .. A .. C .. C .. G ..

BWT: Unpermute operation • To recreate T from BWT(T), repeatedly
apply rule: T = BWT[ LF(i) ] + T; i = LF(i) • Where LF(i) maps row i to row whose first character corresponds to i’s last (per LF mapping) • Called “unpermute” or “walk-left” algorithm Final T http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Exact Matching • To match Q in T using
BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) • Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Exact Matching • In progressive rounds, top & bot
delimit the range of rows beginning with progressively longer suffixes of Q http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Exact Matching • If range becomes empty (top =
bot) the query suffix (and therefore the query) does not occur in the text http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Exact Matching G → 6 C → 4 $
→ 0 A → 1 A → 2 A → 3 C → 5 BWT: Query: AAC 0 1 2 3 4 5 6 Rows that end in C G → 6 C → 4 $ → 0 A → 1 A → 2 A → 3 C → 5 Rows that end in AC G → 6 C → 4 $ → 0 A → 1 A → 2 A → 3 C → 5 Rows that end in AAC G → 6 C → 4 $ → 0 A → 1 A → 2 A → 3 C → 5

BWT: Rows to Reference Positions • Once we know a
row contains a legal alignment, how do we determine its position in the reference? Where am I? http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Rows to Reference Positions • Naïve solution: Use “walk-left”
to walk back to the beginning of the text; number of steps = offset of hit • Linear in length of text in general – too slow 2 steps, so hit offset = 2 http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

• Hybrid solution: • Label the offsets of some rows
• “walk left” to next marked row to the left • Bowtie marks every 32nd row by default (configurable) Rows to Reference Positions 1 step offset = 1 Hit offset = 1 + 1 = 2 http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

• Algorithm concludes: “aac” occurs at offset 2 in “acaacg”
BWT exact alignment http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Handling mismatches • Consider an attempt to find Q
= “agc” in T = “acaacg”: • Instead of giving up, try to “backtrack” to a previous position and try a different base “gc” does not occur in the text “g” “c” http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

BWT: Backtracking • Backtracking attempt for Q = “agc”, T
= “acaacg”: Found this alignment: acaacg agc “g” “a” “a” “c” “c” “gc” does not occur in the text Substitution http://www.cbcb.umd.edu/~langmead/NCBI_Nov2008.ppt

Another example • Given the following BWT: CCGTACAACT$ • Map
the following reads: • CGAC • TTGA • Reconstruct the genome

BWT: CCGTACAACT$ C C G T A C A A
C T $ Last-First Mapping

C T $ L-F Row à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 0 1 2 3 4 5 6 7 8 9 10 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 Map: CGAC

C T $ L-F Row à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 0 1 2 3 4 5 6 7 8 9 10 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 Map: TTGA

C T $ L-F Row à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 0 1 2 3 4 5 6 7 8 9 10 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 Reconstruct Genome C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 C C G T A C A A C T $ L-F à 4 à 5 à 8 à 9 à 1 à 6 à 2 à 3 à 7 à 10 à 0 Row 0 1 2 3 4 5 6 7 8 9 10

Compressing genomes • Run length encoding: TTTTTGGGAAAACCCCCCA à 5T3G4A6C1A •
Sort & run length encoding? TACGTAACGATACGAT à AAAAAACCCGGGTTTT à 6A3C3G4T Compressible, but not reversible! • Genomes don’t contain many runs, but they do contain repeats. Can we convert a repeat-rich genome into a (reversible) string with many runs?

Connecting compression & the BWT panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana
anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa anamabananas$p Cyclic rotations: $panamabananas abananas$panam amabananas$pan anamabananas$p ananas$panamab anas$panamaban as$panamabanan bananas$panama mabananas$pana namabananas$pa nanas$panamaba nas$panamabana panamabananas$ s$panamabanana BWT sort

Complexity of searching a large BWT(N) for a word (W)
index BWT(N) LF 0 A 10291 1 A 17279 2 C 00102 3 T 00557 4 G 40300 5 A 21209 … … … … … … … … … … … … 100000 A 77182 100001 T 99293 100002 C 89921 100003 G 11890 index BWT(N) LF 00557 G 56671 00558 G 78881 00559 T 33120 00560 A 66054 00561 T 35002 00562 C 14672 … … … … … … … … … … … … 99290 C 88991 99291 A 89902 99292 A 99679 99293 T 78930 O(|W|) instead of O(N)

Summary • The Burrows Wheeler Transform produces a compressible, reversible
transformation of a string. • BWT enables fast exact matching of short strings in a longer string. • See Chapter 5, Bioinformatics & Functional Genomics (p196-7) • See Chapter 7, Compeau & Pevzner (p325 - 350)

Lecture 15 Short Read Alignment

Lecture 15 Short Read Alignment

shaunmahony

More Decks by shaunmahony

Featured

Transcript

L15: SHORT READ ALIGNMENT Foundations in Data Driven Life Sciences

Today’s learning objectives • Understand the Burrows Wheeler Transform as

Short read alignment • Problem: • We have sequenced a

Genotyping application …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC AAATTTGC

Burrows-Wheeler Transform • Very space-efficient way of indexing the genome

Burrows-Wheeler Transform • BWT(T) is reversible due to “LF Mapping”

Burrows-Wheeler Transform • BWT(T) is reversible due to “LF Mapping”

Burrows-Wheeler Transform Genome: ACAACG$ Rotations: ACAACG$ $ACAACG G$ACAAC CG$ACAA ACG$ACA

Unpermuting the BWT G → 6 C → 4 $

BWT: Unpermute operation • To recreate T from BWT(T), repeatedly

BWT: Exact Matching • To match Q in T using

BWT: Exact Matching • In progressive rounds, top & bot

BWT: Exact Matching • If range becomes empty (top =

BWT: Exact Matching G → 6 C → 4 $

BWT: Rows to Reference Positions • Once we know a

BWT: Rows to Reference Positions • Naïve solution: Use “walk-left”

• Hybrid solution: • Label the offsets of some rows

• Algorithm concludes: “aac” occurs at offset 2 in “acaacg”

BWT: Handling mismatches • Consider an attempt to find Q

BWT: Backtracking • Backtracking attempt for Q = “agc”, T

Another example • Given the following BWT: CCGTACAACT$ • Map

BWT: CCGTACAACT$ C C G T A C A A

BWT: CCGTACAACT$ C C G T A C A A

BWT: CCGTACAACT$ C C G T A C A A

BWT: CCGTACAACT$ C C G T A C A A

Compressing genomes • Run length encoding: TTTTTGGGAAAACCCCCCA à 5T3G4A6C1A •

Connecting compression & the BWT panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana

Complexity of searching a large BWT(N) for a word (W)

Summary • The Burrows Wheeler Transform produces a compressible, reversible