Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JHU EN.600.649 Guest lecture, February 16 2017

Ben Langmead
February 16, 2017

JHU EN.600.649 Guest lecture, February 16 2017

Index-assisted approximate matching (part 2), co-traversal, double indexing, Bowtie, BWA, SAM/BAM

Ben Langmead

February 16, 2017
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Department of Computer Science Applied Comparative Genomics EN.600.649 Spring 2017

    Guest lecture, Prof Langmead langmea@cs.jhu.edu www.langmead-lab.org February 16, 2017
  2. Suffix indexes Suffix indexes are much smaller than the suffix

    trie, but enable suffix trie traversal A C G T A T A G C G T A T A G A G A G 6 5 3 1 0 4 2 $ A$ ANA$ ANANA$ BANANA$ NA$ NANA$ $ B A NA NA A $ B A N A N A NA $ B A N A NA NA $ B B A NA N A $ NA $ B A NA NA NA $ B A
  3. Neighborhood search 1 + n(a - 1) + n +

    (n + 1)a Delete each char in P * n + 1 positions where we can insert any of the a characters * Say | P | = n, | ∑ | = a. How many strings with edit distance ≤1? Within edit distance k? O(na) O(nkak) Hamming dist = 1
  4. Two approaches How much work to query suffix tree with

    all strings within distance k? O(n) for each of O(aknk) strings, so O(naknk) O(naknk) O(nm) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 0 0 0 0 15 15 15 0 0 30 15 0 0 0 15 0 15 0 0 15 15 15 0 0 0 0 0 30 15 0 0 30 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 45 5 0 15 45 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 5 60 20 0 0 0 15 0 0 0 15 0 15 15 15 0 0 20 30 0 0 20 75 35 15 0 0 30 15 15 0 0 0 0 0 0 0 0 0 0 0 0 35 90 50 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 50 60 20 45 5 0 0 15 0 15 15 15 0 0 15 15 0 0 0 30 10 65 30 5 60 20 15 0 - c c c t t c c t t a c g c g a c c c a - t c t t a c g a c Reference Read length of reference # k-edit neighbors
  5. Two approaches O(naknk) O(nm) 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 0 0 0 0 15 15 15 0 0 30 15 0 0 0 15 0 15 0 0 15 15 15 0 0 0 0 0 30 15 0 0 30 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 45 5 0 15 45 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 5 60 20 0 0 0 15 0 0 0 15 0 15 15 15 0 0 20 30 0 0 20 75 35 15 0 0 30 15 15 0 0 0 0 0 0 0 0 0 0 0 0 35 90 50 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 50 60 20 45 5 0 0 15 0 15 15 15 0 0 15 15 0 0 0 30 10 65 30 5 60 20 15 0 - c c c t t c c t t a c g c g a c c c a - t c t t a c g a c Reference Read length of reference # k-edit neighbors say n = 100, k = 3 edits, a = 4 aknk = 43·1003 = 64·106 ≈ 6·107 say m = 3·109 (human genome) now say k = 4 aknk = 44·1004 = 256·108 ≈ 3·1010
  6. Two approaches O(naknk) # k-edit neighbors We’ll generally keep k

    small. E.g. recent versions of Bowtie and BWA use this approach for seed finding; both n and k can be controlled and kept small Further benefit comes from pruning and double indexing…
  7. Neighborhood search “Neighborhood trie” of a pattern P = CAA

    and Hamming distance ≤ 1 A C G T A A A A CG T C G T A A A A A A A # leaves = # neighbors = 1 + n(a - 1) = 1 + 3(4 -1) = 10
  8. Co-traversal Co-traversal: act of traversing all paths in both tries

    A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1
  9. Co-traversal A C G T A A A A CG

    T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 Try AAA: we fall off suffix trie after A, before AA Co-traversal: act of traversing all paths in both tries
  10. Co-traversal Co-traversal: act of traversing all paths in both tries

    A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 Next try CAA: we fall off suffix trie after CA, before CAA
  11. Co-traversal Co-traversal: act of traversing all paths in both tries

    A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 CAC and CAG also fail. Next try CAT: success.
  12. Co-traversal A C G T A A A A CG

    T C G T A A A A A A A A C G T A T A G C G T A T A G A G A G Suffix trie of T = ACATAG Co-traversal: act of traversing all paths in both tries A C A T A G | | C A A T: P: Common path ending in a neighbor leaf corresponds to an alignment of a neighbor of P to a substring of T Neighborhood trie, P = CAA Hamming dist ≤ 1
  13. Co-traversal We can find all such alignments with co-traversal: explore

    all paths that are present in both trees and end in a neighbor leaf A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Trie for neighborhood within 1 mismatch of P = CAA 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Lexicographical depth-first co-traversal visits node pairs in this order: Only visited 8 nodes in these 20- and 22-node tries 1 1
  14. Co-traversal: pruning We can think of the tree we’re exploring

    as being the intersection of these two trees... A C G T A A A A CG T C G T A A A A A A A A C G T A T A G C G T A T A G A G A G ∩
  15. Co-traversal: pruning Suffix trie of T = ACATAG A C

    G T A T A Trie for neighborhood within 1 mismatch of P = CAA 2 3 4 5 6 7 8 1 ∩
  16.          

                                                                                                                       Read: Reference: Lambda phage (~50,000 nt) Policy: Up to 1 mismatch Pruning GTCCTTTCCGGTGATCCGACAGGTTACG
  17. Double-index pruning Case 1: 1 mismatch allowed in left half,

    0 in right, use Forward FM Case 2: 1 mismatch allowed in right half, 0 in left, use Reverse FM Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10(3):R25 Forward FM Index w/ BWT(T) T A C Reverse FM Index w/ BWT(reverse(T)) T A C Halfway
  18. Double-index pruning Langmead B, Trapnell C, Pop M, Salzberg SL.

    Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10(3):R25 A C G T C C A C G T C A T T T G A A A C G T A A T T T G T T T A C G T T T T T T G A T T G T T G T T T A C G T T T T A C G T G G G A C G T A A A C G A C G C A C A G C C T A A C G G T G T G A T G T T T G A T T T G A T T T G G C A T T T G A C G G A T C C G A C A G G T T A C G Without double index With double index Case 1 Case 2
  19. Bowtie & BWA Both do pruned co-traversal; Bowtie and later

    versions of BWA do tricks like double-indexing and bi-directional BWT (not discussed) Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60.
  20. Bowtie: alignment 0 0 35 30 35 30 0 0

    35 30 35 30 Read Referenc Ref string 1 Ref string 3 Ref substring Ref string 1 Ref string 1 Alignment Read Read substring Ref string 1 Ref string 3 Ref substring ∅ Read substring 0 0 35 30 35 30 0 0 35 30 35 30 0 0 35 30 35 30 Alignment Read Referenc Read substring x 0 0 35 30 35 30 Ref string 1 Ref string 1 Ref string 1 Ref string 1 Hit Read Ref string 1 Ref string 1 Alignment Bowtie 2 Bowtie 1 FM index + approximate matching
  21. Bowtie: alignment 0 0 35 30 35 30 0 0

    35 30 35 30 Read Referenc Ref string 1 Ref string 3 Ref substring Ref string 1 Ref string 1 Alignment Read Read substring Ref string 1 Ref string 3 Ref substring ∅ Read substring 0 0 35 30 35 30 0 0 35 30 35 30 0 0 35 30 35 30 Alignment Read Referenc Read substring x 0 0 35 30 35 30 Ref string 1 Ref string 1 Ref string 1 Ref string 1 Hit Read Ref string 1 Ref string 1 Alignment Bowtie 2 Bowtie 1 Dynamic programming
  22. Bowtie: index building bowtie-build [options]* <reference_in> <ebwt_base> $ B A

    N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A genome.1.bt2 genome.2.bt2 genome.3.bt2 genome.4.bt2 genome.rev.1.bt2 genome.rev.2.bt2 Genome in FASTA Bowtie index, forward & reverse bowtie2-build [options]* <reference_in> <bt2_base>
  23. Bowtie: alignment bowtie [options]* <ebwt> {-1 <m1> -2 <m2> |

    <s>} [-S <sam>] $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>] Bowtie index FASTQ SAM/BAM (more on this later)
  24. BWA: index building bwa index [options] <in.fasta> Genome in FASTA

    BWA index, forward & reverse $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A in.fasta.amb in.fasta.ann in.fasta.bwt in.fasta.pac in.fasta.sa
  25. BWA: alignment bwa mem [options] <idxbase> <in1.fq> [in2.fq] > out.sam

    $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A BWA index FASTQ SAM/BAM (more on this later)
  26. SAM/BAM https://samtools.github.io/hts-specs/SAMv1.pdf SAM: Sequence Alignment/Map format BAM: Binary version of

    SAM Header lines (starting with @) describe alignment job Reference genome, aligner, aligner version, parameters, … Record lines (after header): each line describes one alignment for one read Name of read, where it aligned, relationship to larger paired- end fragment, alignment “shape,” mapping quality, …
  27. SAM/BAM Fields on a record line: “query” = read “mapping

    quality” = confidence that alignment is correct “mate” = one end / other end of a paired-end read “template” = paired-end fragment “base quality” = confidence that base call is correct Image from https://samtools.github.io/hts-specs/SAMv1.pdf
  28. SAM/BAM FLAG field packs a lot of info in 1

    number “Explain SAM FLAGs” app: https://broadinstitute.github.io/picard/explain-flags.html Least significant bit Most significant bit 0x1 0x2 0x4 0x8 0x10 decimal, base 10 0x20 … 1 2 4 8 16 32 … hexadecimal, base 16 Each bit is true or false. FLAG = bitwise OR of true flags (equivalently, sum of true flags) Image from https://samtools.github.io/hts-specs/SAMv1.pdf
  29. SAM/BAM read paired = 1 proper pair = 2 reverse

    strand = 16 first in pair = 64 1 + 2 + 16 + 64 = 83 “Explain SAM FLAGs” app: https://broadinstitute.github.io/picard/explain-flags.html
  30. SAM/BAM CIGAR string describes alignment “shape” What does the traceback

    look like, from top to bottom? Read Ref (portion) Read CIGAR = 3M2D2M2I3M CIGAR = 1S3M2I2M2S “soft clipping” M = match or mismatch Ref (portion)
  31. SAM/BAM • Foes conspire to create alignment error: • Alignment

    error: aligner fails to map a read to its true point of origin • Contamination • Genetic variation • Sequencing error • Repetitive genomes
  32. SAM/BAM Like base quality, mapping quality is a confidence level

    p cor ⌘ Probability correct q = 10 · log10(1 p cor) Also like base quality, encoding using Phred (-10 log) scale:
  33. SAM/BAM ERR050083.28 0 5 10643 1 100M * 0 0

    TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.24 16 1 10023 11 100M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTACC ERR050083.26 0 X 156030515 11 15M1D85M * 0 0 GGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGT ERR050083.21 0 12 10063 11 100M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.14 16 5 11575 1 21S79M * 0 0 CCCCTCACCCCCACTCCCTCTACCCCAAACCCTAACCCTAACC ERR050083.19 0 15 101981086 11 76M1D24M * 0 0 TAGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.33 16 3 10591 11 7S93M * 0 0 ACCCCTACCTCCCCCCACCCCTCCCCCTACCCCTAACCCTAAC ERR050083.30 16 1 10108 1 39M1D61M * 0 0 CACCCCTAACCCTATCCCTAACCCTACCCCTAACCCTAA ERR050083.34 16 1 10052 11 98M2S * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC ERR050083.32 16 10 10275 11 57M1D5M1D38M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTACCCCTA ERR050083.39 0 1 10013 11 100M * 0 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.36 0 5 10216 1 94M1D6M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.38 16 5 10308 1 100M * 0 0 AACCCCAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTA ERR050083.35 16 5 11274 1 100M * 0 0 AACCCTACACCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.29 0 1 248945997 1 60M1D40M * 0 0 AGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGG ERR050083.43 0 1 10012 11 100M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.41 0 X 156030454 1 85M2I13M * 0 0 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.49 0 3 10740 14 73M1D12M1D14M1S * 0 0 AACCCTAACCCTTAACCCTACCCCTAACCCTAACC ERR050083.44 16 17 83247157 11 4S7M1D6M1D6M1D45M1D9M1I22M * 0 0 GTTTAGGGTTAGGGTTAGGGTTA ERR050083.40 0 1 10014 14 100M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.45 16 1 10185 11 57M1I42M * 0 0 CTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.50 16 11 191803 2 43S8M1I47M1S * 0 0 CAACCCCACCCCCACCCCCACCCCCAACCCCAACC ERR050083.42 16 7 10001 11 1S65M3D34M * 0 0 CCTACCCCTACCCCTAACCCTACCCCTAACCCTAACCCT ERR050083.48 16 1 10134 11 44M2I54M * 0 0 ACCCCAACCCTAACCCCTATCCCTACCCCTAACCCTAAC ERR050083.37 16 X 156030581 11 3S12M1D85M * 0 0 GTGGGGTGAGGGTTAGGTTAGGGTTAGGGTTAGGGTGAG ERR050083.52 16 1 10295 11 58M1I40M1S * 0 0 CCCAACCCCAACCCCAACCCCTACCCCAACCCCAACCCC ERR050083.46 16 X 156030594 14 100M * 0 0 GGTTGGGGTTAGGGTTAGGGTGAGGGTTAGGGTTAGGGTTAGG round( q = 10 · log10(1 p cor)) SAM requires rounding to nearest integer
  34. SAMtools Command-line toolkit to: Convert between SAM, BAM, FASTQ, FASTA

    Order BAM records, e.g. along the genome Retrieve alignments in given interval Index BAM & FASTA files for faster querying $ samtools Program: samtools (Tools for alignments in the SAM format) Version: 1.3.1 (using htslib 1.3.1) Usage: samtools <command> [options] Commands: -- Indexing: dict, faidx, index -- Editing: calmd, fixmate, reheader, rmdup, targetcut, addreplacerg -- File operations: collate, cat, merge, mpileup, sort, split quickcheck, fastq, fasta -- Statistics: bedcov, depth, flagstat, idxstats, phase, stats -- Viewing: flags, tview, view, depad Summarize BAM, e.g. in preparation for variant calling etc
  35. File format notebooks Jupyter notebooks detailing FASTA, FASTQ and SAM

    formats, including some Python code for parsing them: FASTA: http://bit.ly/fasta_nb FASTQ: http://bit.ly/fastq_nb SAM: http://bit.ly/sam_nb