trie, but enable suffix trie traversal A C G T A T A G C G T A T A G A G A G 6 5 3 1 0 4 2 $ A$ ANA$ ANANA$ BANANA$ NA$ NANA$ $ B A NA NA A $ B A N A N A NA $ B A N A NA NA $ B B A NA N A $ NA $ B A NA NA NA $ B A
(n + 1)a Delete each char in P * n + 1 positions where we can insert any of the a characters * Say | P | = n, | ∑ | = a. How many strings with edit distance ≤1? Within edit distance k? O(na) O(nkak) Hamming dist = 1
small. E.g. recent versions of Bowtie and BWA use this approach for seed finding; both n and k can be controlled and kept small Further benefit comes from pruning and double indexing…
T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 Try AAA: we fall off suffix trie after A, before AA Co-traversal: act of traversing all paths in both tries
A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 Next try CAA: we fall off suffix trie after CA, before CAA
A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 CAC and CAG also fail. Next try CAT: success.
T C G T A A A A A A A A C G T A T A G C G T A T A G A G A G Suffix trie of T = ACATAG Co-traversal: act of traversing all paths in both tries A C A T A G | | C A A T: P: Common path ending in a neighbor leaf corresponds to an alignment of a neighbor of P to a substring of T Neighborhood trie, P = CAA Hamming dist ≤ 1
all paths that are present in both trees and end in a neighbor leaf A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Trie for neighborhood within 1 mismatch of P = CAA 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Lexicographical depth-first co-traversal visits node pairs in this order: Only visited 8 nodes in these 20- and 22-node tries 1 1
0 in right, use Forward FM Case 2: 1 mismatch allowed in right half, 0 in left, use Reverse FM Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10(3):R25 Forward FM Index w/ BWT(T) T A C Reverse FM Index w/ BWT(reverse(T)) T A C Halfway
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10(3):R25 A C G T C C A C G T C A T T T G A A A C G T A A T T T G T T T A C G T T T T T T G A T T G T T G T T T A C G T T T T A C G T G G G A C G T A A A C G A C G C A C A G C C T A A C G G T G T G A T G T T T G A T T T G A T T T G G C A T T T G A C G G A T C C G A C A G G T T A C G Without double index With double index Case 1 Case 2
versions of BWA do tricks like double-indexing and bi-directional BWT (not discussed) Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60.
N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A genome.1.bt2 genome.2.bt2 genome.3.bt2 genome.4.bt2 genome.rev.1.bt2 genome.rev.2.bt2 Genome in FASTA Bowtie index, forward & reverse bowtie2-build [options]* <reference_in> <bt2_base>
<s>} [-S <sam>] $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>] Bowtie index FASTQ SAM/BAM (more on this later)
BWA index, forward & reverse $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A in.fasta.amb in.fasta.ann in.fasta.bwt in.fasta.pac in.fasta.sa
SAM Header lines (starting with @) describe alignment job Reference genome, aligner, aligner version, parameters, … Record lines (after header): each line describes one alignment for one read Name of read, where it aligned, relationship to larger paired- end fragment, alignment “shape,” mapping quality, …
quality” = confidence that alignment is correct “mate” = one end / other end of a paired-end read “template” = paired-end fragment “base quality” = confidence that base call is correct Image from https://samtools.github.io/hts-specs/SAMv1.pdf
number “Explain SAM FLAGs” app: https://broadinstitute.github.io/picard/explain-flags.html Least significant bit Most significant bit 0x1 0x2 0x4 0x8 0x10 decimal, base 10 0x20 … 1 2 4 8 16 32 … hexadecimal, base 16 Each bit is true or false. FLAG = bitwise OR of true flags (equivalently, sum of true flags) Image from https://samtools.github.io/hts-specs/SAMv1.pdf
Order BAM records, e.g. along the genome Retrieve alignments in given interval Index BAM & FASTA files for faster querying $ samtools Program: samtools (Tools for alignments in the SAM format) Version: 1.3.1 (using htslib 1.3.1) Usage: samtools <command> [options] Commands: -- Indexing: dict, faidx, index -- Editing: calmd, fixmate, reheader, rmdup, targetcut, addreplacerg -- File operations: collate, cat, merge, mpileup, sort, split quickcheck, fastq, fasta -- Statistics: bedcov, depth, flagstat, idxstats, phase, stats -- Viewing: flags, tview, view, depad Summarize BAM, e.g. in preparation for variant calling etc