Genomic sketching with HyperLogLog

Ben Langmead JHU Computer Science [email protected], langmead-lab.org, @BenLangmead Genome Informatics,
Cold Spring Harbor Laboratory November, 2019 Genomic sketching with HyperLogLog Tweetable

Sketching How to sift and summarize huge datasets so we
can answer similarity questions later? Genomes Sequencing data Image: doi:10.1038/nbt.4229  Image: doi:10.1038/s41576-018-0088-9

Sketching Capture extreme or informative items k-merize Sample FASTA FASTA
FASTQ FASTQ (shingle)

Cardinality Biological relatedness AGGCCACAGTGTATTATGACTG ||||||||||| ||||||||| AGGCCACAGTGAGTTATGACTG AAAAAAAAAAAGATGT-AAGTA |||||||||||||||| |||||
AAAAAAAAAAAGATGTAAAGTA GAGG--TCAGATTCACAGCCAC |||| |||||||||||||||| GAGGGGTCAGATTCACAGCCAC Set similarities J = |A ∩ B| |A ∪ B| Cardinalities |A| |B| |A ∪ B| |A ∩ B| C = |A ∩ B| |A|

342 830 017 332 525 092 709 I take cards
labeled 1--1,000 and choose a random subset of size n to hide in my hat You may see one representative from the cards in the hat; which to pick? You would like to estimate n Hat problem

What if minimum was 500? Estimate should grow as minimum
shrinks ...10? ... 4? Hat problem 0 999 min = 40 40 ≈ 1000/(n + 1) n ≈ 24 Easy to compute, fits in 10 bits

Hat analogy seems contrived... ...but matches the situation where we
hash items up front 0x39AD49CC 0x9FAA176B h(x) <latexit sha1_base64="3LcopZtEpejfqLYR7b/0BGglZ0w=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZEUGXRTcuK9gLtEPJpJlOaJIZkoxYhr6CGxeKuPWF3Pk2ZtpZaOsPgY//nEPO+YOEM21c99spra1vbG6Vtys7u3v7B9XDo46OU0Vom8Q8Vr0Aa8qZpG3DDKe9RFEsAk67weQ2r3cfqdIslg9mmlBf4LFkISPY5FZUfzofVmtuw50LrYJXQA0KtYbVr8EoJqmg0hCOte57bmL8DCvDCKezyiDVNMFkgse0b1FiQbWfzXedoTPrjFAYK/ukQXP390SGhdZTEdhOgU2kl2u5+V+tn5rw2s+YTFJDJVl8FKYcmRjlh6MRU5QYPrWAiWJ2V0QirDAxNp6KDcFbPnkVOhcNz/L9Za15U8RRhhM4hTp4cAVNuIMWtIFABM/wCm+OcF6cd+dj0Vpyiplj+CPn8wdvb43T</latexit> <latexit sha1_base64="3LcopZtEpejfqLYR7b/0BGglZ0w=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZEUGXRTcuK9gLtEPJpJlOaJIZkoxYhr6CGxeKuPWF3Pk2ZtpZaOsPgY//nEPO+YOEM21c99spra1vbG6Vtys7u3v7B9XDo46OU0Vom8Q8Vr0Aa8qZpG3DDKe9RFEsAk67weQ2r3cfqdIslg9mmlBf4LFkISPY5FZUfzofVmtuw50LrYJXQA0KtYbVr8EoJqmg0hCOte57bmL8DCvDCKezyiDVNMFkgse0b1FiQbWfzXedoTPrjFAYK/ukQXP390SGhdZTEdhOgU2kl2u5+V+tn5rw2s+YTFJDJVl8FKYcmRjlh6MRU5QYPrWAiWJ2V0QirDAxNp6KDcFbPnkVOhcNz/L9Za15U8RRhhM4hTp4cAVNuIMWtIFABM/wCm+OcF6cd+dj0Vpyiplj+CPn8wdvb43T</latexit> <latexit sha1_base64="3LcopZtEpejfqLYR7b/0BGglZ0w=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZEUGXRTcuK9gLtEPJpJlOaJIZkoxYhr6CGxeKuPWF3Pk2ZtpZaOsPgY//nEPO+YOEM21c99spra1vbG6Vtys7u3v7B9XDo46OU0Vom8Q8Vr0Aa8qZpG3DDKe9RFEsAk67weQ2r3cfqdIslg9mmlBf4LFkISPY5FZUfzofVmtuw50LrYJXQA0KtYbVr8EoJqmg0hCOte57bmL8DCvDCKezyiDVNMFkgse0b1FiQbWfzXedoTPrjFAYK/ukQXP390SGhdZTEdhOgU2kl2u5+V+tn5rw2s+YTFJDJVl8FKYcmRjlh6MRU5QYPrWAiWJ2V0QirDAxNp6KDcFbPnkVOhcNz/L9Za15U8RRhhM4hTp4cAVNuIMWtIFABM/wCm+OcF6cd+dj0Vpyiplj+CPn8wdvb43T</latexit> <latexit sha1_base64="3LcopZtEpejfqLYR7b/0BGglZ0w=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZEUGXRTcuK9gLtEPJpJlOaJIZkoxYhr6CGxeKuPWF3Pk2ZtpZaOsPgY//nEPO+YOEM21c99spra1vbG6Vtys7u3v7B9XDo46OU0Vom8Q8Vr0Aa8qZpG3DDKe9RFEsAk67weQ2r3cfqdIslg9mmlBf4LFkISPY5FZUfzofVmtuw50LrYJXQA0KtYbVr8EoJqmg0hCOte57bmL8DCvDCKezyiDVNMFkgse0b1FiQbWfzXedoTPrjFAYK/ukQXP390SGhdZTEdhOgU2kl2u5+V+tn5rw2s+YTFJDJVl8FKYcmRjlh6MRU5QYPrWAiWJ2V0QirDAxNp6KDcFbPnkVOhcNz/L9Za15U8RRhhM4hTp4cAVNuIMWtIFABM/wCm+OcF6cd+dj0Vpyiplj+CPn8wdvb43T</latexit> 0xDF89114B { , , , , , ..., } Hat problem - hash function

Two-hat problem B A Can we estimate cardinality of unions?
Intersections? About coincidences Cardinalities |A| |B| |A ∪ B| |A ∩ B|

A B Space of coincidences is large Image inspired by:
Ondov B, Starrett G, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biol 20, 232 (2019) Need multiple representatives per set Two-hat problem

Bottom k A B Instead of minimum only, consider "bottom
3" "Bottom-k sketch" can estimate cardinalities of unions and intersections Larger k averages more, improving estimate

k-partition A B Instead of bottom-3, consider minimum in each
of 3 partitions Accomplishes something similar to bottom-k

Mash Bottom-k MinHash sketch 1. Ondov BD, Treangen TJ, Melsted
P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016 Jun 20;17(1):132. 2. Broder AZ. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997 - Proceedings 1998:21–29. A B Image: ref 1

Mash Representative fits in bits U is often 32 or
64 ⌈log2 U⌉ 1. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016 Jun 20;17(1):132. Image: ref 1

Log vs LogLog Instead of minimum, say we use log-minimum
min 0x030F6556 ⌊log2 ⌋ 25 Estimate is of ; can re-exponentiate later, but with added variance & bias ⌊log2 n⌋ Representatives take rather than bits log log U log U Pro: Con: 32 bits 6 x 5 bits

Dashing Daniel Baker Baker DN, Langmead B. Dashing: fast and
accurate genomic distances with HyperLogLog. In press, Genome Biology. http://bit.ly/dash_pre Software: https://github.com/dnbaker/dashing Library: https://github.com/dnbaker/sketch Many methods implemented, not just HLL MinHash, Bloom filters, Mod sketch, bBit Minwise, CountMin, Count sketch, HeavyKeeper, ...

HyperLogLog HLL Input items 11110011 11110011 11110011 11110011 001 01001
110 00001 ... ... Hash values ⛺ ... ... Register 000 01001 10001 10101 10110 00100 Register 001 00100 10110 01011 10101 00010 01011 11111 11110 Register 010 Register 011 Register 111 ... Hash Take prefix Cardinality Estimate 3 2 ~ 22 ~ 23 ... ... ... ... p q ... ... ... ... Overall Estimate Baker DN, Langmead B. Dashing: fast and accurate genomic distances with HyperLogLog. In press, Genome Biology. 1. k-partition 2. ⌊log2 n⌋ 3. Re-exponentiation 4. Averaging, bias correction

Trick 1: SIMD instructions 8-bit registers well suited to vectorized
(SIMD) instructions Union equals elementwise min: HLL(A) HLL(B) HLL(A U B) PMINUB VPMINUB VPMINUB AVX2 AVX512-BW SSE2 ... ... ...

Trick 2: Changing log base 0x030F65A691DD9010 25 64 bits 6
bits ⌊log1.19 ⌋ 8 bits min ⌊log2 ⌋ 101 Waste Register array Register array LZC or

Dashing HLL handles lopsided sets better than bottom-k MinHash 1,2
1. Koslicki, David, and Hooman Zabeti. Improving MinHash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation 354 (2019): 206-215. 2. Ondov B, Starrett G, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biol 20, 232 (2019) MinHash HLL J = 0.111 |J − ̂ J| log2 (sketch bytes)

Dashing • • • • • • • • •
• • • • • • • −0.2 −0.1 0.0 0.1 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) True J Est J − True J k = 16, log2 (sketch bytes) = 10 • • • • • • • • • • • • • • • • • • • • −0.025 0.000 0.025 0.050 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) True J Est J − True J k = 16, log2 (sketch bytes) = 14 • 0.2 0.3 k = 21, log2 (sketch bytes) = 10 • • 0.06 k = 21, log2 (sketch bytes) = 14 • • • • • • • • • −0.2 −0.1 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) True J −0.06 −0.03 [0, 0. Mash BinDash Dashin More accurate than Mash comparing real genome pairs at various similarities; BinDash is competitive |J − ̂ J| True J • • • • • • • • • • • • • • −0.2 −0.1 0.0 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) True J Est J −0.06 −0.03 0.00 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3 Est J Mash BinDash Dashing (MLE) • • • • • • • • • • • • • • • • • • • • −0.2 −0.1 0.0 0.1 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) True J Est J − True • • • • • • −0.06 −0.03 0.00 0.03 [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6 True J Est J − True Mash BinDash Dashing (MLE) k = 16 1 KB sketch * Zhao X. BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics. 2019 Feb 15;35(4):671-673. *

Dashing Dashing sketches and performs all-pairs distance calculations for 87,113
bacterial genomes in ~6m Fastest at sketching in general; fastest overall (incl. distance estimation) for small sketches Sketching All-pairs distances Wall clock Peak mem Wall clock Peak mem Mash 22m25s 17 GB 31m41s 1.1 GB BinDash 19m17s 141 MB 1m14s 409 Dashing 4m31s 13 GB 1m40s 116 k = 31 1 KB sketch Versus ~20m for BinDash, ~54m for Mash 100 threads

Future work Multi-k sketching Weighted Jaccard Indexing public data with
HLLs How to synthesize various k-mer lengths? How to restore multiplicity information? Does this require a new sketch, or just an "adapter" for HLL?

Funding: • NSF: IIS-1349906 • NIH: R01GM118568 • XSEDE: TG-CIE170020
Looking for Ph.D. students; write to [email protected] or chat with me Daniel Baker JHU: • Brad Solomon Software: https://github.com/dnbaker/dashing Library: https://github.com/dnbaker/sketch Preprint: http://bit.ly/dash_pre MinHash HyperLogLog Representatives come from min Store representatives in log U bits Representatives come from log-min (LZC) Bottom-k for averaging (in Mash) k-partition for averaging Store representatives in log log U bits

References Broder AZ. On the resemblance and containment of documents.
Compression and Complexity of Sequences 1997 - Proceedings 1998:21–29. Koslicki, David, and Hooman Zabeti. Improving MinHash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation 354 (2019): 206-215. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016 Jun 20;17(1):132. MinHash Mash Containment MinHash Zhao X. BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics. 2019 Feb 15;35(4):671-673. BinDash Ertl, O.: New cardinality estimation algorithms for hyperloglog sketches. CoRR abs/1702.01284 (2017). 1702.01284 HLL card. estimation Dashing Baker DN, Langmead B. Dashing: fast and accurate genomic distances with HyperLogLog. In press, Genome Biology. Preprint: http://bit.ly/dash_pre

Genomic sketching with HyperLogLog

Genomic sketching with HyperLogLog

Ben Langmead

More Decks by Ben Langmead

Other Decks in Research

Featured

Transcript

Ben Langmead JHU Computer Science [email protected], langmead-lab.org, @BenLangmead Genome Informatics,

Sketching How to sift and summarize huge datasets so we

Sketching Capture extreme or informative items k-merize Sample FASTA FASTA

Cardinality Biological relatedness AGGCCACAGTGTATTATGACTG ||||||||||| ||||||||| AGGCCACAGTGAGTTATGACTG AAAAAAAAAAAGATGT-AAGTA |||||||||||||||| |||||

342 830 017 332 525 092 709 I take cards

What if minimum was 500? Estimate should grow as minimum

Hat analogy seems contrived... ...but matches the situation where we

Two-hat problem B A Can we estimate cardinality of unions?

A B Space of coincidences is large Image inspired by:

Bottom k A B Instead of minimum only, consider "bottom

k-partition A B Instead of bottom-3, consider minimum in each

Mash Bottom-k MinHash sketch 1. Ondov BD, Treangen TJ, Melsted

Mash Representative fits in bits U is often 32 or

Log vs LogLog Instead of minimum, say we use log-minimum

Dashing Daniel Baker Baker DN, Langmead B. Dashing: fast and

HyperLogLog HLL Input items 11110011 11110011 11110011 11110011 001 01001

Trick 1: SIMD instructions 8-bit registers well suited to vectorized

Trick 2: Changing log base 0x030F65A691DD9010 25 64 bits 6

Dashing HLL handles lopsided sets better than bottom-k MinHash 1,2

Dashing • • • • • • • • •

Dashing Dashing sketches and performs all-pairs distance calculations for 87,113

Future work Multi-k sketching Weighted Jaccard Indexing public data with

Funding: • NSF: IIS-1349906 • NIH: R01GM118568 • XSEDE: TG-CIE170020

References Broder AZ. On the resemblance and containment of documents.