Hash Functions FTW

Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage
Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

What’s in this Presentation • Hash Function Survey • Hash
Performance • Bloom Filters • HashFile : Hash Storage

Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data)
// 64-bit int v1 = hash(“foo”); int v2 = hash(“goo”); int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME; }

Hash Functions • Goal : v1 has many bit differences
from v2 • Desirable Properties: • Uniform Distribution - no collisions • Very Fast Computation

Hash Applications Goal: O(1) access • Hash Table • Hash
Set • Bloom Filter

Popular Hash Functions • FNV Hash • DJB Hash •
Jenkins Hash • Murmur2 • New (Promising?): CrapWow • Awesome & Slow: SHA-1, MD5 etc.

Evaluating Hash Functions • Hash Function “Zoo” • Quality of:
CRC32 DJB Jenkins FNV Murmur2 SHA1 • Performance: (MM ops/s) !" #" $!" $#" %!" %#" &!" &#" '!" '#" %#(" ('" )" !"#$%&'()*(+",-'%./%0'/%1',23$% *+,-.,/" 012312%" 456$"

A Strawman “Set” • N keys, K bytes per key
• Allocate array of size K * N bytes • Utilize array storage as: • a heap or tree: O(lg N) insert/delete/ remove • a hash: O(1) insert/delete/remove • What if we don’t have room for K*N bytes?

Bloom Filter • Key Point: give up on storing all
the keys • Store r bits per key instead of K bytes • Allocate bit vector of size: M = r * N, where N is expected number of entries • Use multiple hash functions of key to determine which bits to set • Premise: if hash functions are well- distributed, few collisions, high accuracy

Bloom Filter

Tuning Bloom Filters Let r = M bits / N
keys (r: num bits/key) Let k = 0.7 * r (k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives) Working backwards, we can use desired false positive rate p to tune the data structure space consumption: r = 8, p = 2.1e-2 r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11

Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM
entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key : 714k ops/s 1BN entries, 32bits/key : 185k ops/s Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector

Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant
db “CDB” • Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k) • Constant aka “Immutable” Data Store create(), add(k, v) ... , build() ... before lookup(k) • Use properties of hash table to achieve O(1) disk seeks per lookup

HashFile Structure • Header (ﬁxed width): table pointers, contains offests
of hash tables and count of elements per table • Body (variable width): contains concatenation of all keys and values (with data lengths) • Footer (ﬁxed width): hash “tables” containing long hash values of keys alongside long offsets into body

HashFile Diagram • Create: initialize empty header, start appending keys/values
while recording offsets and hash values of keys • Build: take list of hash values and offsets and turn them into hash tables, backﬁll header with values • Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to ﬁnd offset into body, return the value from body HEADER p1s3p2s4p3s2p4s1 BODY k1v1k2v2k3v3k4v4k5v5k6v6k7v7 FOOTER hk7o7hk3o3hk4o4hk1o1

HashFile Performance • Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number of entries • X25E SSD: 1BN 8-byte keys, values (41GB): 650μs lookup w/ cold cache, up to 700x faster as ﬁlesystem cache warms, 0.9μs when in-memory • With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm

Conclusions • Be aware of different Hash Functions and their
collision / performance tradeoffs • Bloom Filters are extremely useful for fast, large-scale set membership • HashFile provides excellent performance in cases where a static K/V store sufﬁces

Future Work • Implement cWow hash in Java • Extend
HashFile with conﬁgurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) • Implement a read-write (non-constant) version of HashFile • Bloom Filter that spills to SSD

Thank You! ...Any questions? :)

References • GitHub Project: g414-hash (hash function, bloom ﬁlter, HashFile
implementations) • Wikipedia: Hash Function, Bloom Filter • Non-Cryptographic Hash Function Zoo • DJB CDB, sg-cdb (java implementation)

Hash Functions FTW

Hash Functions FTW

Sunny Gleason

More Decks by Sunny Gleason

Other Decks in Technology

Featured

Transcript

Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage

What’s in this Presentation • Hash Function Survey • Hash

Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data)

Hash Functions • Goal : v1 has many bit differences

Hash Applications Goal: O(1) access • Hash Table • Hash

Popular Hash Functions • FNV Hash • DJB Hash •

Evaluating Hash Functions • Hash Function “Zoo” • Quality of:

A Strawman “Set” • N keys, K bytes per key

Bloom Filter • Key Point: give up on storing all

Bloom Filter

Tuning Bloom Filters Let r = M bits / N

Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM

Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant

HashFile Structure • Header (ﬁxed width): table pointers, contains offests

HashFile Diagram • Create: initialize empty header, start appending keys/values

HashFile Performance • Spec: ≤ 2 disk seeks per lookup

Conclusions • Be aware of different Hash Functions and their

Future Work • Implement cWow hash in Java • Extend

Thank You! ...Any questions? :)

References • GitHub Project: g414-hash (hash function, bloom ﬁlter, HashFile