{ } CC-BY-ND 4.0
Conjunctions
12
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
Slide 13
Slide 13 text
{ } CC-BY-ND 4.0
Conjunctions
13
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
13
Slide 14
Slide 14 text
{ } CC-BY-ND 4.0
Conjunctions
14
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
13
Slide 15
Slide 15 text
{ } CC-BY-ND 4.0
Conjunctions
15
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
advance(17) → 22 TOO FAR
13
Slide 16
Slide 16 text
{ } CC-BY-ND 4.0
Conjunctions
16
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
advance(17) → 22 TOO FAR
13
Slide 17
Slide 17 text
{ } CC-BY-ND 4.0
Conjunctions
17
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
advance(17) → 22 TOO FAR
advance(22) → 98
13
Slide 18
Slide 18 text
{ } CC-BY-ND 4.0
Conjunctions
18
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
advance(17) → 22 TOO FAR
advance(22) → 98
advance(98) → 98
13
Slide 19
Slide 19 text
{ } CC-BY-ND 4.0
Conjunctions
19
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
advance(17) → 22 TOO FAR
advance(22) → 98
advance(98) → 98
advance(98) → 98 MATCH
13
98
Slide 20
Slide 20 text
{ } CC-BY-ND 4.0
Conjunctions
20
1
3
13
20
35
80
2
13
17
20
98
98
1
13
22
35
98
99
next → 2
advance(2) → 13 TOO FAR
advance(13) → 13
already on 13
advance(13) → 13 MATCH
next → 17
advance(17) → 22 TOO FAR
advance(22) → 98
advance(98) → 98
advance(98) → 98 MATCH
next → ∞ END
13
98
Slide 21
Slide 21 text
{ } CC-BY-ND 4.0
How do regexp queries work?
21
Slide 22
Slide 22 text
{ } CC-BY-ND 4.0
Regexp queries
22
index
lucene
elastic
search
2 10 49
1 5
2 5 49
5 10
50
shard 2 9 10
Challenge: find matching terms and
merge postings lists
Naive way:
- iterate over terms
- evaluate regexp against every term
SLOWWWWWW
Slide 23
Slide 23 text
{ } CC-BY-ND 4.0
Regexp queries
23
Ela[Ss]tic.*
E l a
S
t i c
s
*
Slide 24
Slide 24 text
{ } CC-BY-ND 4.0
Regexp queries
24
• Not limited to regexps
• Fuzzy queries too!
– example: es~1
Slide 25
Slide 25 text
{ } CC-BY-ND 4.0
How are numeric doc values compressed?
a column-stride, on-disk, un-inverted index
25
Slide 26
Slide 26 text
{ } CC-BY-ND 4.0
Aggregation Execution
26
“color” Doc IDs
blue
green
red
0
5, 20, 22
12
What is average price of green docs?
(inverted index)
Slide 27
Slide 27 text
{ } CC-BY-ND 4.0
_______________
27
Doc ID “price”
0
5
10
12
20
22
10
20
20
60
60
20
“color” Doc IDs
blue
green
red
0
5, 20, 22
12
What is average price of green docs?
(20 + 60 + 20)
= 33.33
3
(field data)
Aggregation Execution
Slide 28
Slide 28 text
{ } CC-BY-ND 4.0
Field Data and Doc Values
28
• In-memory, lives on JVM Heap
• All-or-nothing
• Lazily constructed at query-time
• Disk-based, leverages OS FS cache
• Pages in/out of FS cache
• Precomputed at index-time
• Allows better compression
Field Data Doc Values
{ } CC-BY-ND 4.0
Numerics: one unique value
30
Doc ID “price”
0
5
10
12
20
22
10
10
10
10
10
10
• Easy :)
• Write the constant value and set a
flag
• 4 bytes to represent n values
Constant Encoding
Slide 31
Slide 31 text
{ } CC-BY-ND 4.0
Numerics: < 256 unique values
31
Table Encoding
• Write a table of all possible values, then encode data as bit-
packed ordinals
• Great compression when few unique values
• Better when num_docs >> num_values
• Best case is 1 bits/doc, worst case is 1 byte/doc
{ } CC-BY-ND 4.0
Numerics: Common Denominator
39
GCD Encoding
• Certain types of data share common denominators
• E.g. timestamps without ms precision
• 142542454000
• 142542455000
• 142542456000
• If a gcd is found, can encode multiples
of gcd interval, fewer bits
gcd of 1000
x2 gcd
x1 gcd
142542454000
Slide 40
Slide 40 text
{ } CC-BY-ND 4.0
Numerics: Common Denominator
40
Doc ID “price”
0
5
10
12
20
22
50
20
20
20
10
50
GCD Encoding
Share 10 as common divisor
{ } CC-BY-ND 4.0
Numerics: If all else fails
45
Delta Encoding
• If we can’t use any “tricks”, delta encode
• Encode everything as an offset from the minValue
Offset
minimum value
199872
Basically less-good
GCD encoding!
{ } CC-BY-ND 4.0 51
blue
green
red
…
Distinct Counts: Naive Solution
Maintain a map all values
• Cardinality == map.size()
• map.size() == n
• Memory usage == n * size of each term
(Ignoring map overhead)
Slide 52
Slide 52 text
{ } CC-BY-ND 4.0 52
blue
green
red
…
Distinct Counts: Naive Solution
Gets worse in distributed environment
Node 1
abc
xyz
…
Node 2
abc
green
xyz
…
Node 3
Slide 53
Slide 53 text
{ } CC-BY-ND 4.0 53
Distinct Counts: Naive Solution
Gets worse in distributed environment
Node 1
blue
green
red
…
abc
xyz
…
Node 2
Node 3
abc
green
xyz
…
Node 4
blue
green
red
…
abc
green
xyz
…
abc
xyz
…
Merge
Slide 54
Slide 54 text
{ } CC-BY-ND 4.0 54
Distinct Counts: HyperLogLog++
Cardinality agg uses HyperLogLog++ instead
• Approximates cardinality
• Uses only a few Kb of memory for billions of distinct values
• < 5% error (adjustable)
• Fast!
• Lossless unions
Slide 55
Slide 55 text
{ } CC-BY-ND 4.0 55
Bit-Observable Patterns
Let’s flip some coins…
2
n
1
Probability of a “run”
Slide 56
Slide 56 text
{ } CC-BY-ND 4.0 56
Bit-Observable Patterns
Let’s flip some coins…
2
n
1
Probability of a “run”
32
1
5 heads in a row
Slide 57
Slide 57 text
{ } CC-BY-ND 4.0 57
Bit-Observable Patterns
Let’s flip some coins…
2
n
1
32
1
Probability of a “run” 5 heads in a row
1048576
1
20 heads in a row
Slide 58
Slide 58 text
{ } CC-BY-ND 4.0 58
Bit-Observable Patterns
Let’s flip some coins…
2
n
1
32
1
Probability of a “run” 5 heads in a row
1048576
1
20 heads in a row
Could do this in
one sitting
Might take
all day
Slide 59
Slide 59 text
{ } CC-BY-ND 4.0
Key Insight:
Length of the run ~= duration of coin flipping
59
Slide 60
Slide 60 text
{ } CC-BY-ND 4.0 60
Bit-Observable Patterns
Let’s hash values, instead of flipping coins…
v = 12345
h(v) = cbf5a
= 11001011111101011010
Run of 1 zero
Slide 61
Slide 61 text
{ } CC-BY-ND 4.0 61
Bit-Observable Patterns
Let’s hash values, instead of flipping coins…
v = 12345
h(v) = cbf5a
= 11001011111101011010
Set “register” to 1
1
Slide 62
Slide 62 text
{ } CC-BY-ND 4.0 62
Bit-Observable Patterns
Let’s hash values, instead of flipping coins…
v = 3456
h(v) = 8D338
= 10001101001100111000
Run of 3 zeros
1
Slide 63
Slide 63 text
{ } CC-BY-ND 4.0 63
Bit-Observable Patterns
Let’s hash values, instead of flipping coins…
Set “register” to 3
3
v = 3456
h(v) = 8D338
= 10001101001100111000
Slide 64
Slide 64 text
{ } CC-BY-ND 4.0 64
Bit-Observable Patterns
Let’s hash values, instead of flipping coins…
v = 948
h(v) = 47D34
= 01000111110100110100
Run of 2 zeros
Don’t update register
3
Slide 65
Slide 65 text
{ } CC-BY-ND 4.0
Key Insight:
Length of the run ~=
65
2
n
1
32
1
Probability of a “run” 5 zeros in a row
1048576
1
20 zeros in a row
~32 distinct values ~1048576 distinct values
cardinality
duration of coin flipping
Slide 66
Slide 66 text
{ } CC-BY-ND 4.0
What if you get unlucky on first value?
66
v = 938
h(v) = 0400
= 0000010000000000
Run of 10 zeros
oops :(
{ } CC-BY-ND 4.0
Solution: keep multiple counters
68
v = 938
h(v) = 0400
= 0000010000000000
Stochastic Averaging
Run of 10 zeros
Use first 3 bits as
register index
Slide 69
Slide 69 text
{ } CC-BY-ND 4.0
Solution: keep multiple counters
69
v = 938
h(v) = 0400
= 0000010000000000
Stochastic Averaging
Set register[0] to 10
10
Slide 70
Slide 70 text
{ } CC-BY-ND 4.0
Solution: keep multiple counters
70
v = 7482
h(v) = 9D3A
= 1001110100111010
Stochastic Averaging
Run of 1 zero
Use first 3 bits as
register index
10
Slide 71
Slide 71 text
{ } CC-BY-ND 4.0
Solution: keep multiple counters
71
v = 7482
h(v) = 9D3A
= 1001110100111010
Stochastic Averaging
10
Set register[4] to 1
1
Slide 72
Slide 72 text
{ } CC-BY-ND 4.0
Cardinality is the Harmonic Mean of the registers
72
Stochastic Averaging
10
1
1
8
3
Harmonic Mean = 1.9544
(and some empirical constants)
Slide 73
Slide 73 text
{ } CC-BY-ND 4.0
Registers are small!
73
Other neat attributes
5-6 bits
10
1
1
8
3
Slide 74
Slide 74 text
{ } CC-BY-ND 4.0
Unions are lossless!
74
Other neat attributes
10
1
1
8
3
4
5
1
2
7
U =
10
5
1
8
7 Take max of each
register
Slide 75
Slide 75 text
{ } CC-BY-ND 4.0 75
Which is perfect for distributed environments
Node 1
Node 2
Node 3
Node 4
Merge
Other neat attributes
Slide 76
Slide 76 text
{ } CC-BY-ND 4.0
In closing…
76
Stop worrying and learn to love approximate algorithms
Slide 77
Slide 77 text
{ }
Thank you!
@jpountz @polyfractal
Slide 78
Slide 78 text
{ }
This work is licensed under the Creative Commons
Attribution-NoDerivatives 4.0 International License.
To view a copy of this license, visit:
http://creativecommons.org/licenses/by-nd/4.0/
or send a letter to:
Creative Commons
PO Box 1866
Mountain View, CA 94042
USA
CC-BY-ND 4.0
Slide 79
Slide 79 text
{ } CC-BY-ND 4.0
Strings: Just a big ol' blob
79
• Strings are simpler
• Basic idea is to:
• Encode a term dictionary in a binary blob
• Compress ordinal using numeric compression
• Three schemes to encode the blob:
• Fixed, Variable, Prefix
Term Dictionary
Doc ID “widget”
0
5
10
0
1
2
Ord ID Term
0
1
2
aaa
bbb
abc
Ordinal Map
{ } CC-BY-ND 4.0
Strings: Equal-sized terms
81
Serialize bytes
Fixed-width Encoding
61 61 61 62 62 62 61 62 63 …
‘aaa’ ‘bbb’ ‘abc’
Doc ID “widget”
0
5
10
0
1
2
Ord ID Term
0
1
2
aaa
bbb
abc
Compress as Numeric data
Slide 82
Slide 82 text
{ } CC-BY-ND 4.0
Strings: Variable-sized, < 1024 terms
82
Serialize bytes
Variable-width Encoding
61 61 62 61 62 63 …
‘a’ ‘ab’ ‘abc’
Doc ID “widget”
0
5
10
0
1
2
Ord ID Term
0
1
2
a
ab
abc
Slide 83
Slide 83 text
{ } CC-BY-ND 4.0
Strings: Variable-sized, < 1024 terms
83
Serialize bytes
Variable-width Encoding
61 61 62 61 62 63 …
‘a’ ‘ab’ ‘abc’
Doc ID “widget”
0
5
10
0
1
2
Ord ID Term
0
1
2
a
ab
abc
Pack lengths
in VarInts 06
Slide 84
Slide 84 text
{ } CC-BY-ND 4.0
Strings: Variable-sized, < 1024 terms
84
Serialize bytes
Variable-width Encoding
61 61 62 61 62 63 …
‘a’ ‘ab’ ‘abc’
Doc ID “widget”
0
5
10
0
1
2
Ord ID Term
0
1
2
a
ab
abc
Pack lengths
in VarInts 06
Compress as Numeric data
Slide 85
Slide 85 text
{ } CC-BY-ND 4.0
Strings: Everything Else
85
Serialize first
Prefix Encoding
61 61
‘aa’
Ord ID Term
0
1
2
aa
aaa
abc
Slide 86
Slide 86 text
{ } CC-BY-ND 4.0
Strings: Everything Else
86
Serialize first
Prefix Encoding
61 61
‘aa’
Ord ID Term
0
1
2
aa
aaa
abc
Write prefix length
02
Slide 87
Slide 87 text
{ } CC-BY-ND 4.0
Strings: Everything Else
87
Serialize first
Prefix Encoding
61 61
‘aa’
Ord ID Term
0
1
2
aa
aaa
abc
Write prefix length
02
Write remaining bytes
61
‘a’
Slide 88
Slide 88 text
{ } CC-BY-ND 4.0
Strings: Everything Else
88
Serialize first
Prefix Encoding
61 61
‘aa’
Ord ID Term
0
1
2
aa
aaa
abc
Write prefix length
02
Write remaining bytes
61
‘a’
Write prefix length 01
Write remaining bytes
62 63
‘bc’
Slide 89
Slide 89 text
{ } CC-BY-ND 4.0
Strings: Everything Else
89
Serialize first
Prefix Encoding
61 61
‘aa’
Ord ID Term
0
1
2
aa
aaa
abc
Write prefix length
02
Write remaining bytes
61
‘a’
Write prefix length 01
Write remaining bytes
62 63
‘bc’
Re-serialize prefix start point every 16 terms
Slide 90
Slide 90 text
{ } CC-BY-ND 4.0
Strings: Everything Else
90
Prefix Encoding
When done, write a ReverseTermIndex every 1024 terms
Position Term
0
1024
2048
aa
gef
xyz
Pack as VarInts
Slide 91
Slide 91 text
{ } CC-BY-ND 4.0
Strings: Everything Else
91
Prefix Encoding
Finally, write out Ordinal Map
Doc ID “widget”
0
5
10
0
1
2
Compress as Numeric data