Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hash Functions FTW

Sunny Gleason
November 12, 2010

Hash Functions FTW

Sunny Gleason

November 12, 2010
Tweet

More Decks by Sunny Gleason

Other Decks in Technology

Transcript

  1. Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage

    Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
  2. What’s in this Presentation • Hash Function Survey • Hash

    Performance • Bloom Filters • HashFile : Hash Storage
  3. Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data)

    // 64-bit int v1 = hash(“foo”); int v2 = hash(“goo”); int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME; }
  4. Hash Functions • Goal : v1 has many bit differences

    from v2 • Desirable Properties: • Uniform Distribution - no collisions • Very Fast Computation
  5. Popular Hash Functions • FNV Hash • DJB Hash •

    Jenkins Hash • Murmur2 • New (Promising?): CrapWow • Awesome & Slow: SHA-1, MD5 etc.
  6. Evaluating Hash Functions • Hash Function “Zoo” • Quality of:

    CRC32 DJB Jenkins FNV Murmur2 SHA1 • Performance: (MM ops/s) !" #" $!" $#" %!" %#" &!" &#" '!" '#" %#(" ('" )" !"#$%&'()*(+",-'%./%0'/%1',23$% *+,-.,/" 012312%" 456$"
  7. A Strawman “Set” • N keys, K bytes per key

    • Allocate array of size K * N bytes • Utilize array storage as: • a heap or tree: O(lg N) insert/delete/ remove • a hash: O(1) insert/delete/remove • What if we don’t have room for K*N bytes?
  8. Bloom Filter • Key Point: give up on storing all

    the keys • Store r bits per key instead of K bytes • Allocate bit vector of size: M = r * N, where N is expected number of entries • Use multiple hash functions of key to determine which bits to set • Premise: if hash functions are well- distributed, few collisions, high accuracy
  9. Tuning Bloom Filters Let r = M bits / N

    keys (r: num bits/key) Let k = 0.7 * r (k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives) Working backwards, we can use desired false positive rate p to tune the data structure space consumption: r = 8, p = 2.1e-2 r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11
  10. Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM

    entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key : 714k ops/s 1BN entries, 32bits/key : 185k ops/s Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector
  11. Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant

    db “CDB” • Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k) • Constant aka “Immutable” Data Store create(), add(k, v) ... , build() ... before lookup(k) • Use properties of hash table to achieve O(1) disk seeks per lookup
  12. HashFile Structure • Header (fixed width): table pointers, contains offests

    of hash tables and count of elements per table • Body (variable width): contains concatenation of all keys and values (with data lengths) • Footer (fixed width): hash “tables” containing long hash values of keys alongside long offsets into body
  13. HashFile Diagram • Create: initialize empty header, start appending keys/values

    while recording offsets and hash values of keys • Build: take list of hash values and offsets and turn them into hash tables, backfill header with values • Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to find offset into body, return the value from body HEADER p1s3p2s4p3s2p4s1 BODY k1v1k2v2k3v3k4v4k5v5k6v6k7v7 FOOTER hk7o7hk3o3hk4o4hk1o1
  14. HashFile Performance • Spec: ≤ 2 disk seeks per lookup

    • Number of seeks independent of number of entries • X25E SSD: 1BN 8-byte keys, values (41GB): 650μs lookup w/ cold cache, up to 700x faster as filesystem cache warms, 0.9μs when in-memory • With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm
  15. Conclusions • Be aware of different Hash Functions and their

    collision / performance tradeoffs • Bloom Filters are extremely useful for fast, large-scale set membership • HashFile provides excellent performance in cases where a static K/V store suffices
  16. Future Work • Implement cWow hash in Java • Extend

    HashFile with configurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) • Implement a read-write (non-constant) version of HashFile • Bloom Filter that spills to SSD
  17. References • GitHub Project: g414-hash (hash function, bloom filter, HashFile

    implementations) • Wikipedia: Hash Function, Bloom Filter • Non-Cryptographic Hash Function Zoo • DJB CDB, sg-cdb (java implementation)