Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amusing Algorithms and Data Structures

Elastic Co
March 10, 2015

Amusing Algorithms and Data Structures

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

When you want to make search fast, 80% of the job involves organizing your data so that it can be accessed with as little work as possible. This is the exact reason why Elasticsearch is based on an inverted index.

But there are some very interesting algorithms and data structures involved in that last 20% of the job. In this talk, you will gain insights into some internals of Elasticsearch and see how priority queues, finite state machines, bit twiddling hacks and several other algorithms and data structures power Elasticsearch.

Elastic Co

March 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. { } CC-BY-ND 4.0 Agenda • conjunctions • regexp queries

    • numeric doc values compression • cardinality aggregation 2
  2. { } CC-BY-ND 4.0 Inverted index 4 index lucene elastic

    shard Terms  dictionary 2 10 49 1 5 2 5 49 2 9 10 50 52 Postings  lists
  3. { } CC-BY-ND 4.0 Inverted index 5 index lucene elastic

    shard Terms  dictionary 2 10 49 1 5 2 5 49 2 9 10 50 52 Postings  lists next next
  4. { } CC-BY-ND 4.0 Inverted index 6 index lucene elastic

    shard Terms  dictionary 2 10 49 1 5 2 5 49 2 9 10 50 52 Postings  lists advance(30) advance(“search”) Uses  skip  lists Uses  a  tiny  in-­‐ memory  terms   index.
  5. { } CC-BY-ND 4.0 Conjunctions 7 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 1. Sort by cost
  6. { } CC-BY-ND 4.0 Conjunctions 8 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 1. Sort by cost 2. Leap frog!
  7. { } CC-BY-ND 4.0 Conjunctions 9 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2
  8. { } CC-BY-ND 4.0 Conjunctions 10 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR
  9. { } CC-BY-ND 4.0 Conjunctions 11 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13
  10. { } CC-BY-ND 4.0 Conjunctions 12 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13
  11. { } CC-BY-ND 4.0 Conjunctions 13 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH 13
  12. { } CC-BY-ND 4.0 Conjunctions 14 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 13
  13. { } CC-BY-ND 4.0 Conjunctions 15 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 advance(17) → 22 TOO FAR 13
  14. { } CC-BY-ND 4.0 Conjunctions 16 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 advance(17) → 22 TOO FAR 13
  15. { } CC-BY-ND 4.0 Conjunctions 17 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 advance(17) → 22 TOO FAR advance(22) → 98 13
  16. { } CC-BY-ND 4.0 Conjunctions 18 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 advance(17) → 22 TOO FAR advance(22) → 98 advance(98) → 98 13
  17. { } CC-BY-ND 4.0 Conjunctions 19 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 advance(17) → 22 TOO FAR advance(22) → 98 advance(98) → 98 advance(98) → 98 MATCH 13 98
  18. { } CC-BY-ND 4.0 Conjunctions 20 1 3 13 20

    35 80 2 13 17 20 98 98 1 13 22 35 98 99 next → 2 advance(2) → 13 TOO FAR advance(13) → 13 already on 13 advance(13) → 13 MATCH next → 17 advance(17) → 22 TOO FAR advance(22) → 98 advance(98) → 98 advance(98) → 98 MATCH next → ∞ END 13 98
  19. { } CC-BY-ND 4.0 Regexp queries 22 index lucene elastic

    search 2 10 49 1 5 2 5 49 5 10 50 shard 2 9 10 Challenge: find matching terms and merge postings lists Naive way: - iterate over terms - evaluate regexp against every term SLOWWWWWW
  20. { } CC-BY-ND 4.0 Regexp queries 24 • Not limited

    to regexps • Fuzzy queries too! – example: es~1
  21. { } CC-BY-ND 4.0 How are numeric doc values compressed?

    a column-stride, on-disk, un-inverted index 25
  22. { } CC-BY-ND 4.0 Aggregation Execution 26 “color” Doc IDs

    blue green red 0 5, 20, 22 12 What is average price of green docs? (inverted index)
  23. { } CC-BY-ND 4.0 _______________ 27 Doc ID “price” 0

    5 10 12 20 22 10 20 20 60 60 20 “color” Doc IDs blue green red 0 5, 20, 22 12 What is average price of green docs? (20 + 60 + 20) = 33.33 3 (field data) Aggregation Execution
  24. { } CC-BY-ND 4.0 Field Data and Doc Values 28

    • In-memory, lives on JVM Heap • All-or-nothing • Lazily constructed at query-time • Disk-based, leverages OS FS cache • Pages in/out of FS cache • Precomputed at index-time • Allows better compression Field Data Doc Values
  25. { } CC-BY-ND 4.0 Numerics: one unique value 30 Doc

    ID “price” 0 5 10 12 20 22 10 10 10 10 10 10 • Easy :) • Write the constant value and set a flag • 4 bytes to represent n values Constant Encoding
  26. { } CC-BY-ND 4.0 Numerics: < 256 unique values 31

    Table Encoding • Write a table of all possible values, then encode data as bit- packed ordinals • Great compression when few unique values • Better when num_docs >> num_values • Best case is 1 bits/doc, worst case is 1 byte/doc
  27. { } CC-BY-ND 4.0 Numerics: < 256 unique values 32

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 Only three unique values (10, 20, 50) Table Encoding
  28. { } CC-BY-ND 4.0 Numerics: < 256 unique values 33

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 De-dupe, sort Table Encoding [10, 20, 50]
  29. { } CC-BY-ND 4.0 Numerics: < 256 unique values 34

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 De-dupe, sort Table Encoding Write Longs [10, 20, 50] 00 00 00 0A 00 00 00 14 00 00 00 30
  30. { } CC-BY-ND 4.0 Numerics: < 256 unique values 35

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 Table Encoding Encode with min bits De-dupe, sort Write Longs [10, 20, 50] 00 00 00 0A 00 00 00 14 00 00 00 30
  31. { } CC-BY-ND 4.0 Numerics: < 256 unique values 36

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 Table Encoding Encode with min bits 11 10 10 10 01 11 De-dupe, sort Write Longs [10, 20, 50] \x0 \x0 \x0 \x10 \x0 \x0 \x0 \x14 \x0 \x0 \x0 \x30 min_bits = msb( table.size() - 1 ); most significant bit (2 bits) 3 values = 00000011
  32. { } CC-BY-ND 4.0 Numerics: < 256 unique values 37

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 Table Encoding Encode with min bits 11 10 10 10 01 11 De-dupe, sort Write Longs [10, 20, 50] 00 00 00 0A 00 00 00 14 00 00 00 30
  33. { } CC-BY-ND 4.0 Numerics: < 256 unique values 38

    Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 Table Encoding Encode with min bits 11 10 10 10 01 11 De-dupe, sort Write Longs [10, 20, 50] 00 00 00 0A 00 00 00 14 00 00 00 30 Pack Bytes 0E 0A 07
  34. { } CC-BY-ND 4.0 Numerics: Common Denominator 39 GCD Encoding

    • Certain types of data share common denominators • E.g. timestamps without ms precision • 142542454000 • 142542455000 • 142542456000
 • If a gcd is found, can encode multiples 
 of gcd interval, fewer bits gcd of 1000 x2 gcd x1 gcd 142542454000
  35. { } CC-BY-ND 4.0 Numerics: Common Denominator 40 Doc ID

    “price” 0 5 10 12 20 22 50 20 20 20 10 50 GCD Encoding Share 10 as common divisor
  36. { } CC-BY-ND 4.0 Numerics: Common Denominator 41 Doc ID

    “price” 0 5 10 12 20 22 50 20 20 20 10 50 GCD Encoding = (value - minValue) / gcd = (50 - 10) / 10 = 4 GCD Encode
  37. { } CC-BY-ND 4.0 Numerics: Common Denominator 42 Doc ID

    “price” 0 5 10 12 20 22 50 20 20 20 10 50 GCD Encoding GCD Encode [4, 2, 2, 1, 4]
  38. { } CC-BY-ND 4.0 Encode with min bits Numerics: Common

    Denominator 43 Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 GCD Encoding GCD Encode [4, 2, 2, 1, 4] 100 010 010 001 100
  39. { } CC-BY-ND 4.0 Encode with min bits Numerics: Common

    Denominator 44 Doc ID “price” 0 5 10 12 20 22 50 20 20 20 10 50 GCD Encoding GCD Encode [4, 2, 2, 1, 4] 100 010 010 001 100 Pack Bytes 04 04 08 0E
  40. { } CC-BY-ND 4.0 Numerics: If all else fails 45

    Delta Encoding • If we can’t use any “tricks”, delta encode • Encode everything as an offset from the minValue Offset minimum value 199872 Basically less-good GCD encoding!
  41. { } CC-BY-ND 4.0 Numerics: If all else fails 46

    Delta Encoding Doc ID “price” 0 5 10 12 20 22 3 2 2 4 5 6 = (value - minValue) = (3 - 2) = 1 Delta Encode
  42. { } CC-BY-ND 4.0 Numerics: If all else fails 47

    Delta Encoding Doc ID “price” 0 5 10 12 20 22 3 2 2 4 5 6 [1, 0, 0, 2, 3, 4] Delta Encode Encode with min bits 001 000 000 010 011 100 Pack Bytes 08 00 09 0C
  43. { } CC-BY-ND 4.0 48 Slides at end of presentation

    if you’re curious No time to talk about strings, sorry!
  44. { } CC-BY-ND 4.0 How does the Cardinality agg work?

    bit-pattern observable magic 49
  45. { } CC-BY-ND 4.0 51 blue green red … Distinct

    Counts: Naive Solution Maintain a map all values • Cardinality == map.size()
 • map.size() == n • Memory usage == n * size of each term (Ignoring map overhead)
  46. { } CC-BY-ND 4.0 52 blue green red … Distinct

    Counts: Naive Solution Gets worse in distributed environment Node 1 abc xyz … Node 2 abc green xyz … Node 3
  47. { } CC-BY-ND 4.0 53 Distinct Counts: Naive Solution Gets

    worse in distributed environment Node 1 blue green red … abc xyz … Node 2 Node 3 abc green xyz … Node 4 blue green red … abc green xyz … abc xyz … Merge
  48. { } CC-BY-ND 4.0 54 Distinct Counts: HyperLogLog++ Cardinality agg

    uses HyperLogLog++ instead • Approximates cardinality • Uses only a few Kb of memory for billions of distinct values • < 5% error (adjustable) • Fast! • Lossless unions
  49. { } CC-BY-ND 4.0 55 Bit-Observable Patterns Let’s flip some

    coins… 2 n 1 Probability of a “run”
  50. { } CC-BY-ND 4.0 56 Bit-Observable Patterns Let’s flip some

    coins… 2 n 1 Probability of a “run” 32 1 5 heads in a row
  51. { } CC-BY-ND 4.0 57 Bit-Observable Patterns Let’s flip some

    coins… 2 n 1 32 1 Probability of a “run” 5 heads in a row 1048576 1 20 heads in a row
  52. { } CC-BY-ND 4.0 58 Bit-Observable Patterns Let’s flip some

    coins… 2 n 1 32 1 Probability of a “run” 5 heads in a row 1048576 1 20 heads in a row Could do this in one sitting Might take all day
  53. { } CC-BY-ND 4.0 Key Insight: Length of the run

    ~= duration of coin flipping 59
  54. { } CC-BY-ND 4.0 60 Bit-Observable Patterns Let’s hash values,

    instead of flipping coins… v = 12345 h(v) = cbf5a = 11001011111101011010 Run of 1 zero
  55. { } CC-BY-ND 4.0 61 Bit-Observable Patterns Let’s hash values,

    instead of flipping coins… v = 12345 h(v) = cbf5a = 11001011111101011010 Set “register” to 1 1
  56. { } CC-BY-ND 4.0 62 Bit-Observable Patterns Let’s hash values,

    instead of flipping coins… v = 3456 h(v) = 8D338 = 10001101001100111000 Run of 3 zeros 1
  57. { } CC-BY-ND 4.0 63 Bit-Observable Patterns Let’s hash values,

    instead of flipping coins… Set “register” to 3 3 v = 3456 h(v) = 8D338 = 10001101001100111000
  58. { } CC-BY-ND 4.0 64 Bit-Observable Patterns Let’s hash values,

    instead of flipping coins… v = 948 h(v) = 47D34 = 01000111110100110100 Run of 2 zeros Don’t update register 3
  59. { } CC-BY-ND 4.0 Key Insight: Length of the run

    ~= 65 2 n 1 32 1 Probability of a “run” 5 zeros in a row 1048576 1 20 zeros in a row ~32 distinct values ~1048576 distinct values cardinality duration of coin flipping
  60. { } CC-BY-ND 4.0 What if you get unlucky on

    first value? 66 v = 938 h(v) = 0400 = 0000010000000000 Run of 10 zeros oops :(
  61. { } CC-BY-ND 4.0 Solution: keep multiple counters 67 v

    = 938 h(v) = 0400 = 0000010000000000 Stochastic Averaging
  62. { } CC-BY-ND 4.0 Solution: keep multiple counters 68 v

    = 938 h(v) = 0400 = 0000010000000000 Stochastic Averaging Run of 10 zeros Use first 3 bits as register index
  63. { } CC-BY-ND 4.0 Solution: keep multiple counters 69 v

    = 938 h(v) = 0400 = 0000010000000000 Stochastic Averaging Set register[0] to 10 10
  64. { } CC-BY-ND 4.0 Solution: keep multiple counters 70 v

    = 7482 h(v) = 9D3A = 1001110100111010 Stochastic Averaging Run of 1 zero Use first 3 bits as register index 10
  65. { } CC-BY-ND 4.0 Solution: keep multiple counters 71 v

    = 7482 h(v) = 9D3A = 1001110100111010 Stochastic Averaging 10 Set register[4] to 1 1
  66. { } CC-BY-ND 4.0 Cardinality is the Harmonic Mean of

    the registers 72 Stochastic Averaging 10 1 1 8 3 Harmonic Mean = 1.9544 (and some empirical constants)
  67. { } CC-BY-ND 4.0 Unions are lossless! 74 Other neat

    attributes 10 1 1 8 3 4 5 1 2 7 U = 10 5 1 8 7 Take max of each register
  68. { } CC-BY-ND 4.0 75 Which is perfect for distributed

    environments Node 1 Node 2 Node 3 Node 4 Merge Other neat attributes
  69. { } CC-BY-ND 4.0 In closing… 76 Stop worrying and

    learn to love approximate algorithms
  70. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0
  71. { } CC-BY-ND 4.0 Strings: Just a big ol' blob

    79 • Strings are simpler • Basic idea is to: • Encode a term dictionary in a binary blob • Compress ordinal using numeric compression
 • Three schemes to encode the blob: • Fixed, Variable, Prefix Term Dictionary Doc ID “widget” 0 5 10 0 1 2 Ord ID Term 0 1 2 aaa bbb abc Ordinal Map
  72. { } CC-BY-ND 4.0 Strings: Equal-sized terms 80 Serialize bytes

    Fixed-width Encoding 61 61 61 62 62 62 61 62 63 … ‘aaa’ ‘bbb’ ‘abc’ Doc ID “widget” 0 5 10 0 1 2 Ord ID Term 0 1 2 aaa bbb abc
  73. { } CC-BY-ND 4.0 Strings: Equal-sized terms 81 Serialize bytes

    Fixed-width Encoding 61 61 61 62 62 62 61 62 63 … ‘aaa’ ‘bbb’ ‘abc’ Doc ID “widget” 0 5 10 0 1 2 Ord ID Term 0 1 2 aaa bbb abc Compress as Numeric data
  74. { } CC-BY-ND 4.0 Strings: Variable-sized, < 1024 terms 82

    Serialize bytes Variable-width Encoding 61 61 62 61 62 63 … ‘a’ ‘ab’ ‘abc’ Doc ID “widget” 0 5 10 0 1 2 Ord ID Term 0 1 2 a ab abc
  75. { } CC-BY-ND 4.0 Strings: Variable-sized, < 1024 terms 83

    Serialize bytes Variable-width Encoding 61 61 62 61 62 63 … ‘a’ ‘ab’ ‘abc’ Doc ID “widget” 0 5 10 0 1 2 Ord ID Term 0 1 2 a ab abc Pack lengths in VarInts 06
  76. { } CC-BY-ND 4.0 Strings: Variable-sized, < 1024 terms 84

    Serialize bytes Variable-width Encoding 61 61 62 61 62 63 … ‘a’ ‘ab’ ‘abc’ Doc ID “widget” 0 5 10 0 1 2 Ord ID Term 0 1 2 a ab abc Pack lengths in VarInts 06 Compress as Numeric data
  77. { } CC-BY-ND 4.0 Strings: Everything Else 85 Serialize first

    Prefix Encoding 61 61 ‘aa’ Ord ID Term 0 1 2 aa aaa abc
  78. { } CC-BY-ND 4.0 Strings: Everything Else 86 Serialize first

    Prefix Encoding 61 61 ‘aa’ Ord ID Term 0 1 2 aa aaa abc Write prefix length 02
  79. { } CC-BY-ND 4.0 Strings: Everything Else 87 Serialize first

    Prefix Encoding 61 61 ‘aa’ Ord ID Term 0 1 2 aa aaa abc Write prefix length 02 Write remaining bytes 61 ‘a’
  80. { } CC-BY-ND 4.0 Strings: Everything Else 88 Serialize first

    Prefix Encoding 61 61 ‘aa’ Ord ID Term 0 1 2 aa aaa abc Write prefix length 02 Write remaining bytes 61 ‘a’ Write prefix length 01 Write remaining bytes 62 63 ‘bc’
  81. { } CC-BY-ND 4.0 Strings: Everything Else 89 Serialize first

    Prefix Encoding 61 61 ‘aa’ Ord ID Term 0 1 2 aa aaa abc Write prefix length 02 Write remaining bytes 61 ‘a’ Write prefix length 01 Write remaining bytes 62 63 ‘bc’ Re-serialize prefix start point every 16 terms
  82. { } CC-BY-ND 4.0 Strings: Everything Else 90 Prefix Encoding

    When done, write a ReverseTermIndex every 1024 terms Position Term 0 1024 2048 aa gef xyz Pack as VarInts
  83. { } CC-BY-ND 4.0 Strings: Everything Else 91 Prefix Encoding

    Finally, write out Ordinal Map Doc ID “widget” 0 5 10 0 1 2 Compress as Numeric data