JHU EN.600.649 Guest lecture, February 16 2017

Department of Computer Science Applied Comparative Genomics EN.600.649 Spring 2017
Guest lecture, Prof Langmead [email protected] www.langmead-lab.org February 16, 2017

Suffix indexes Suffix indexes are much smaller than the suffix
trie, but enable suffix trie traversal A C G T A T A G C G T A T A G A G A G 6 5 3 1 0 4 2 $ A$ ANA$ ANANA$ BANANA$ NA$ NANA$ $ B A NA NA A $ B A N A N A NA $ B A N A NA NA $ B B A NA N A $ NA $ B A NA NA NA $ B A

Neighborhood search 1 + n(a - 1) + n +
(n + 1)a Delete each char in P * n + 1 positions where we can insert any of the a characters * Say | P | = n, | ∑ | = a. How many strings with edit distance ≤1? Within edit distance k? O(na) O(nkak) Hamming dist = 1

Two approaches How much work to query suﬃx tree with
all strings within distance k? O(n) for each of O(aknk) strings, so O(naknk) O(naknk) O(nm) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 0 0 0 0 15 15 15 0 0 30 15 0 0 0 15 0 15 0 0 15 15 15 0 0 0 0 0 30 15 0 0 30 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 45 5 0 15 45 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 5 60 20 0 0 0 15 0 0 0 15 0 15 15 15 0 0 20 30 0 0 20 75 35 15 0 0 30 15 15 0 0 0 0 0 0 0 0 0 0 0 0 35 90 50 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 50 60 20 45 5 0 0 15 0 15 15 15 0 0 15 15 0 0 0 30 10 65 30 5 60 20 15 0 - c c c t t c c t t a c g c g a c c c a - t c t t a c g a c Reference Read length of reference # k-edit neighbors

Two approaches O(naknk) O(nm) 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 0 0 0 0 15 15 15 0 0 30 15 0 0 0 15 0 15 0 0 15 15 15 0 0 0 0 0 30 15 0 0 30 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 45 5 0 15 45 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 5 60 20 0 0 0 15 0 0 0 15 0 15 15 15 0 0 20 30 0 0 20 75 35 15 0 0 30 15 15 0 0 0 0 0 0 0 0 0 0 0 0 35 90 50 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 50 60 20 45 5 0 0 15 0 15 15 15 0 0 15 15 0 0 0 30 10 65 30 5 60 20 15 0 - c c c t t c c t t a c g c g a c c c a - t c t t a c g a c Reference Read length of reference # k-edit neighbors say n = 100, k = 3 edits, a = 4 aknk = 43·1003 = 64·106 ≈ 6·107 say m = 3·109 (human genome) now say k = 4 aknk = 44·1004 = 256·108 ≈ 3·1010

Two approaches O(naknk) # k-edit neighbors We’ll generally keep k
small. E.g. recent versions of Bowtie and BWA use this approach for seed ﬁnding; both n and k can be controlled and kept small Further beneﬁt comes from pruning and double indexing…

Neighborhood search “Neighborhood trie” of a pattern P = CAA
and Hamming distance ≤ 1 A C G T A A A A CG T C G T A A A A A A A # leaves = # neighbors = 1 + n(a - 1) = 1 + 3(4 -1) = 10

Co-traversal Co-traversal: act of traversing all paths in both tries
A C G T A A A A CG T C G T A A A A A A A Suﬃx trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1

Co-traversal A C G T A A A A CG
T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 Try AAA: we fall off suffix trie after A, before AA Co-traversal: act of traversing all paths in both tries

A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 Next try CAA: we fall off suffix trie after CA, before CAA

A C G T A A A A CG T C G T A A A A A A A Suﬃx trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Neighborhood trie, P = CAA Hamming dist ≤ 1 CAC and CAG also fail. Next try CAT: success.

Co-traversal A C G T A A A A CG
T C G T A A A A A A A A C G T A T A G C G T A T A G A G A G Suﬃx trie of T = ACATAG Co-traversal: act of traversing all paths in both tries A C A T A G | | C A A T: P: Common path ending in a neighbor leaf corresponds to an alignment of a neighbor of P to a substring of T Neighborhood trie, P = CAA Hamming dist ≤ 1

Co-traversal We can find all such alignments with co-traversal: explore
all paths that are present in both trees and end in a neighbor leaf A C G T A A A A CG T C G T A A A A A A A Suffix trie of T = ACATAG A C G T A T A G C G T A T A G A G A G Trie for neighborhood within 1 mismatch of P = CAA 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Lexicographical depth-first co-traversal visits node pairs in this order: Only visited 8 nodes in these 20- and 22-node tries 1 1

Co-traversal: pruning We can think of the tree we’re exploring
as being the intersection of these two trees... A C G T A A A A CG T C G T A A A A A A A A C G T A T A G C G T A T A G A G A G ∩

Co-traversal: pruning Suﬃx trie of T = ACATAG A C
G T A T A Trie for neighborhood within 1 mismatch of P = CAA 2 3 4 5 6 7 8 1 ∩

Read: Reference: Lambda phage (~50,000 nt) Policy: Up to 1 mismatch Pruning GTCCTTTCCGGTGATCCGACAGGTTACG

Double-index pruning Case 1: 1 mismatch allowed in left half,
0 in right, use Forward FM Case 2: 1 mismatch allowed in right half, 0 in left, use Reverse FM Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efﬁcient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10(3):R25 Forward FM Index w/ BWT(T) T A C Reverse FM Index w/ BWT(reverse(T)) T A C Halfway

Double-index pruning Langmead B, Trapnell C, Pop M, Salzberg SL.
Ultrafast and memory-efﬁcient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10(3):R25 A C G T C C A C G T C A T T T G A A A C G T A A T T T G T T T A C G T T T T T T G A T T G T T G T T T A C G T T T T A C G T G G G A C G T A A A C G A C G C A C A G C C T A A C G G T G T G A T G T T T G A T T T G A T T T G G C A T T T G A C G G A T C C G A C A G G T T A C G Without double index With double index Case 1 Case 2

Bowtie & BWA Both do pruned co-traversal; Bowtie and later
versions of BWA do tricks like double-indexing and bi-directional BWT (not discussed) Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-eﬃcient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60.

Bowtie: alignment 0 0 35 30 35 30 0 0
35 30 35 30 Read Referenc Ref string 1 Ref string 3 Ref substring Ref string 1 Ref string 1 Alignment Read Read substring Ref string 1 Ref string 3 Ref substring ∅ Read substring 0 0 35 30 35 30 0 0 35 30 35 30 0 0 35 30 35 30 Alignment Read Referenc Read substring x 0 0 35 30 35 30 Ref string 1 Ref string 1 Ref string 1 Ref string 1 Hit Read Ref string 1 Ref string 1 Alignment Bowtie 2 Bowtie 1 FM index + approximate matching

Bowtie: alignment 0 0 35 30 35 30 0 0
35 30 35 30 Read Referenc Ref string 1 Ref string 3 Ref substring Ref string 1 Ref string 1 Alignment Read Read substring Ref string 1 Ref string 3 Ref substring ∅ Read substring 0 0 35 30 35 30 0 0 35 30 35 30 0 0 35 30 35 30 Alignment Read Referenc Read substring x 0 0 35 30 35 30 Ref string 1 Ref string 1 Ref string 1 Ref string 1 Hit Read Ref string 1 Ref string 1 Alignment Bowtie 2 Bowtie 1 Dynamic programming

Bowtie: index building bowtie-build [options]* <reference_in> <ebwt_base> $ B A
N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A genome.1.bt2 genome.2.bt2 genome.3.bt2 genome.4.bt2 genome.rev.1.bt2 genome.rev.2.bt2 Genome in FASTA Bowtie index, forward & reverse bowtie2-build [options]* <reference_in> <bt2_base>

Bowtie: alignment bowtie [options]* <ebwt> {-1 <m1> -2 <m2> |
<s>} [-S <sam>] $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>] Bowtie index FASTQ SAM/BAM (more on this later)

BWA: index building bwa index [options] <in.fasta> Genome in FASTA
BWA index, forward & reverse $ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A in.fasta.amb in.fasta.ann in.fasta.bwt in.fasta.pac in.fasta.sa

BWA: alignment bwa mem [options] <idxbase> <in1.fq> [in2.fq] > out.sam
$ B A N A N A A $ B A N A N A N A $ B A N A N A N A $ B B A N A N A $ N A $ B A N A N A N A $ B A BWA index FASTQ SAM/BAM (more on this later)

SAM/BAM https://samtools.github.io/hts-specs/SAMv1.pdf SAM: Sequence Alignment/Map format BAM: Binary version of
SAM Header lines (starting with @) describe alignment job Reference genome, aligner, aligner version, parameters, … Record lines (after header): each line describes one alignment for one read Name of read, where it aligned, relationship to larger paired- end fragment, alignment “shape,” mapping quality, …

SAM/BAM Fields on a record line: “query” = read “mapping
quality” = conﬁdence that alignment is correct “mate” = one end / other end of a paired-end read “template” = paired-end fragment “base quality” = conﬁdence that base call is correct Image from https://samtools.github.io/hts-specs/SAMv1.pdf

SAM/BAM FLAG field packs a lot of info in 1
number “Explain SAM FLAGs” app: https://broadinstitute.github.io/picard/explain-flags.html Least significant bit Most significant bit 0x1 0x2 0x4 0x8 0x10 decimal, base 10 0x20 … 1 2 4 8 16 32 … hexadecimal, base 16 Each bit is true or false. FLAG = bitwise OR of true flags (equivalently, sum of true flags) Image from https://samtools.github.io/hts-specs/SAMv1.pdf

SAM/BAM read paired = 1 proper pair = 2 reverse
strand = 16 ﬁrst in pair = 64 1 + 2 + 16 + 64 = 83 “Explain SAM FLAGs” app: https://broadinstitute.github.io/picard/explain-ﬂags.html

SAM/BAM CIGAR string describes alignment “shape” What does the traceback
look like, from top to bottom? Read Ref (portion) Read CIGAR = 3M2D2M2I3M CIGAR = 1S3M2I2M2S “soft clipping” M = match or mismatch Ref (portion)

SAM/BAM • Foes conspire to create alignment error: • Alignment
error: aligner fails to map a read to its true point of origin • Contamination • Genetic variation • Sequencing error • Repetitive genomes

SAM/BAM Like base quality, mapping quality is a conﬁdence level
p cor ⌘ Probability correct q = 10 · log10(1 p cor) Also like base quality, encoding using Phred (-10 log) scale:

SAM/BAM ERR050083.28 0 5 10643 1 100M * 0 0
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.24 16 1 10023 11 100M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTACC ERR050083.26 0 X 156030515 11 15M1D85M * 0 0 GGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGT ERR050083.21 0 12 10063 11 100M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.14 16 5 11575 1 21S79M * 0 0 CCCCTCACCCCCACTCCCTCTACCCCAAACCCTAACCCTAACC ERR050083.19 0 15 101981086 11 76M1D24M * 0 0 TAGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.33 16 3 10591 11 7S93M * 0 0 ACCCCTACCTCCCCCCACCCCTCCCCCTACCCCTAACCCTAAC ERR050083.30 16 1 10108 1 39M1D61M * 0 0 CACCCCTAACCCTATCCCTAACCCTACCCCTAACCCTAA ERR050083.34 16 1 10052 11 98M2S * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC ERR050083.32 16 10 10275 11 57M1D5M1D38M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTACCCCTA ERR050083.39 0 1 10013 11 100M * 0 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.36 0 5 10216 1 94M1D6M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.38 16 5 10308 1 100M * 0 0 AACCCCAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTA ERR050083.35 16 5 11274 1 100M * 0 0 AACCCTACACCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.29 0 1 248945997 1 60M1D40M * 0 0 AGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGG ERR050083.43 0 1 10012 11 100M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.41 0 X 156030454 1 85M2I13M * 0 0 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.49 0 3 10740 14 73M1D12M1D14M1S * 0 0 AACCCTAACCCTTAACCCTACCCCTAACCCTAACC ERR050083.44 16 17 83247157 11 4S7M1D6M1D6M1D45M1D9M1I22M * 0 0 GTTTAGGGTTAGGGTTAGGGTTA ERR050083.40 0 1 10014 14 100M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.45 16 1 10185 11 57M1I42M * 0 0 CTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.50 16 11 191803 2 43S8M1I47M1S * 0 0 CAACCCCACCCCCACCCCCACCCCCAACCCCAACC ERR050083.42 16 7 10001 11 1S65M3D34M * 0 0 CCTACCCCTACCCCTAACCCTACCCCTAACCCTAACCCT ERR050083.48 16 1 10134 11 44M2I54M * 0 0 ACCCCAACCCTAACCCCTATCCCTACCCCTAACCCTAAC ERR050083.37 16 X 156030581 11 3S12M1D85M * 0 0 GTGGGGTGAGGGTTAGGTTAGGGTTAGGGTTAGGGTGAG ERR050083.52 16 1 10295 11 58M1I40M1S * 0 0 CCCAACCCCAACCCCAACCCCTACCCCAACCCCAACCCC ERR050083.46 16 X 156030594 14 100M * 0 0 GGTTGGGGTTAGGGTTAGGGTGAGGGTTAGGGTTAGGGTTAGG round( q = 10 · log10(1 p cor)) SAM requires rounding to nearest integer

SAMtools Command-line toolkit to: Convert between SAM, BAM, FASTQ, FASTA
Order BAM records, e.g. along the genome Retrieve alignments in given interval Index BAM & FASTA ﬁles for faster querying $ samtools Program: samtools (Tools for alignments in the SAM format) Version: 1.3.1 (using htslib 1.3.1) Usage: samtools <command> [options] Commands: -- Indexing: dict, faidx, index -- Editing: calmd, fixmate, reheader, rmdup, targetcut, addreplacerg -- File operations: collate, cat, merge, mpileup, sort, split quickcheck, fastq, fasta -- Statistics: bedcov, depth, flagstat, idxstats, phase, stats -- Viewing: flags, tview, view, depad Summarize BAM, e.g. in preparation for variant calling etc

File format notebooks Jupyter notebooks detailing FASTA, FASTQ and SAM
formats, including some Python code for parsing them: FASTA: http://bit.ly/fasta_nb FASTQ: http://bit.ly/fastq_nb SAM: http://bit.ly/sam_nb

JHU EN.600.649 Guest lecture, February 16 2017

JHU EN.600.649 Guest lecture, February 16 2017

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript