Elastic{ON} 2018 - Get the Lay of the Lucene Land

Slide 1

Slide 1 text

Software engineer, Elastic @jpountz Get the Lay of the (Lucene) Land Adrien Grand

Slide 2

Slide 2 text

2 This is a sample image Working with Lucene since 2010 Lucene committer since 2012 Lucene PMC since 2013 Elastic employee since 2013

Slide 3

Slide 3 text

Wikipedia { } Apache Lucene is a free and open-source information retrieval software library

Slide 4

Slide 4 text

Lucene 4 (2012) (Elasticsearch 0.90, 1) Doc values - Flexible scoring Better postings/store compression Fuzzy queries speedup Lucene 5 (2015) (Elasticsearch 2) Index safety Slow query execution Lucene 6 (2016) (Elasticsearch 5) Points - Index sorting BM25 by default Multi-term synonyms Where is Lucene heading? 4

Slide 5

Slide 5 text

Where is Lucene heading? 5 Lucene 7 (2017) (Elasticsearch 6) Query planning Sparse doc values Lucene 7.x (Elasticsearch 6.x) ??? Lucene 8 (2018/2019?) (Elasticsearch 7) ???

Slide 6

Slide 6 text

Agenda 6 Lucene 7 (2017) (Elasticsearch 6) Query planning Sparse doc values Lucene 7.x (Elasticsearch 6.x) CoveringQuery Fine-grained flushing BKD-based geo bounding boxes Lucene 8 (2018/2019?) (Elasticsearch 7) WAND (Weak AND) Impacts indexing

Slide 7

Slide 7 text

Recap of 7.0 highlights Query planning Sparse doc values

Slide 8

Slide 8 text

Query planning: 2 options for ranges 8 Find all matches Verify P matches Points O(num_matches) O(num_matches) Doc values O(num_docs) O(P) 㱺 IndexOrDocValuesQuery

Slide 9

Slide 9 text

Query planning (6.5+) 9 30x faster

Slide 10

Slide 10 text

Sparse doc values fields (7.0+) 10 Doc ID Value 0 42 1 2 3 -3 4 100 5 Docs with ﬁeld [T, F, F, T, T, F] Value [42, 0, 0, -3, 100, 0] Docs with ﬁeld [0, 3, 4] Value [42, -3, 100] 6.x storage 7.0 storage

Slide 11

Slide 11 text

CoveringQuery

Slide 12

Slide 12 text

CoveringQuery (7.1+) 12 Remember minimum_should_match? GET hotels/_search { “query“: { “bool“: { “should“: [ { “term“: { “amenities“: ”pool” } }, { “term“: { “amenities“: ”spa” } }, { “term“: { “amenities“: ”fitness_center” } } ], “minimum_should_match“: 2 } } }

Slide 13

Slide 13 text

CoveringQuery (7.1+) 13 Attribute-based control PUT documents/_doc/1 { “content: “…”, “security_attributes“: [ “employee”, “project:lambda” ], “security_attributes_length“: 2 }

Slide 14

Slide 14 text

CoveringQuery (7.1+) 14 Attribute-based control GET documents/_search { “query“: { “terms_set“: { “security_attributes”: { “terms”: [ “employee”, “project:gamma”, “project:lambda” ], “minimum_should_match_field”: “security_attributes_length” } } } }

Slide 15

Slide 15 text

BKD-based geo shapes

Slide 16

Slide 16 text

• Current shape support still based on postings • BKD-based bounding boxes available in 7.1: LatLonBoundingBox • Indexed as a 4-dimensions point • Using BKD tree as a R tree • Upcoming general BKD-based shape support (7.x) BKD-based geo-shapes 16 Distance filter on geo points

Slide 17

Slide 17 text

Interested in geo? 17 The state of geo in Elasticsearch tomorrow 9:30

Slide 18

Slide 18 text

Fine-grained flushing

Slide 19

Slide 19 text

Fine-grained flushing 19 Say you want to spend 1GB on indexing and have 2 shards, how do you do it?

Slide 20

Slide 20 text

Fine-grained flushing 20 1 2 124MB 900MB Flush largest shard when total memory usage ≥ limit

Slide 21

Slide 21 text

Fine-grained flushing 21 1 2 124MB 0MB Flush largest shard when total memory usage ≥ limit

Slide 22

Slide 22 text

Fine-grained flushing 22 1 2 124MB 900MB Flush largest DWPT when total memory usage ≥ limit

Slide 23

Slide 23 text

Fine-grained flushing 23 1 2 124MB 600MB Flush largest DWPT when total memory usage ≥ limit

Slide 24

Slide 24 text

Fine-grained flushing 24 Creates larger segments Hopefully integrated in Elasticsearch 6.4

Slide 25

Slide 25 text

WAND (Weak AND)

Slide 26

Slide 26 text

WAND (8.0) 26 Can you make queries faster if you don’t need total hit counts? Sorted by field Index sorting (6.0) Sorted by score ???

Slide 27

Slide 27 text

• Documents are identified by doc ids 0..N • Queries produce iterators over (doc id, score) pairs, sorted by doc id • Score of a boolean query is the sum of the scores of its clauses Anatomy of a Lucene index/query 27

Slide 28

Slide 28 text

How do disjunctions work? 28 the quick fox 0 1 2 3 4 5 6 7 doc id score

Slide 29

Slide 29 text

How do disjunctions work? 29 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score

Slide 30

Slide 30 text

How do disjunctions work? 30 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6

Slide 31

Slide 31 text

How do disjunctions work? 31 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3

Slide 32

Slide 32 text

How do disjunctions work? 32 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0

Slide 33

Slide 33 text

How do disjunctions work? 33 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0 0.1

Slide 34

Slide 34 text

How do disjunctions work? 34 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0 0.1 1.9

Slide 35

Slide 35 text

How do disjunctions work? 35 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3 2.0 4.0 0.1 1.9

Slide 36

Slide 36 text

How do disjunctions work? 36 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3 4.0 2.4 2.0 0.1 1.9

Slide 37

Slide 37 text

How do disjunctions work? 37 the quick fox 0 1 2 3 4 5 6 7 2.5 doc id score 1.6 2.3 4.0 2.4 2.0 0.1 1.9

Slide 38

Slide 38 text

• Search for “ the OR fox ” • If • minimum competitive score is 1 • “the” contributes at most 0.2 to the score • Then documents MUST match “fox” to be competitive WAND: intuition 38 Subtitle

Slide 39

Slide 39 text

WAND: max score? 39 Subtitle ≤ BM25 score

Slide 40

Slide 40 text

• Given C clauses, find next target: • Sort by non-decreasing current doc id • Sum up max scores until Σ max_score ≥ min_competitive_score • Return doc id of the first clause to meet this requirement WAND: algorithm 40 Subtitle

Slide 41

Slide 41 text

WAND: example 41 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 doc id Min competitive score = 2.3 Next target

Slide 42

Slide 42 text

WAND: compute top 2 matches 42 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 doc id score Min competitive score = 0

Slide 43

Slide 43 text

WAND: compute top 2 matches 43 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 0

Slide 44

Slide 44 text

WAND: compute top 2 matches 44 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 1.6 1.6

Slide 45

Slide 45 text

WAND: compute top 2 matches 45 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.3 1.6 2.3

Slide 46

Slide 46 text

WAND: compute top 2 matches 46 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.3 1.6 2.3 X

Slide 47

Slide 47 text

WAND: compute top 2 matches 47 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.5 1.6 2.3 X 4.0

Slide 48

Slide 48 text

WAND: compute top 2 matches 48 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.5 1.6 2.3 X 4.0 2.4

Slide 49

Slide 49 text

WAND: compute top 2 matches 49 the quick fox 0 1 2 3 4 5 6 7 max score 0.2 max score 2.0 max score 3.0 2.5 doc id score Min competitive score = 2.5 1.6 2.3 X 4.0 2.4

Slide 50

Slide 50 text

• 0 to 1000x faster • If all terms have the same IDF: no improvement • Otherwise: could be 1000x faster! WAND: speedup? 50

Slide 51

Slide 51 text

Where are we now? 51 Disjunctions ✓ Other queries ???

Slide 52

Slide 52 text

Impacts indexing

Slide 53

Slide 53 text

Indexing of impacts 53 What does the .doc file store? Block of 128 doc ids Skip data Block of 128 freqs

Slide 54

Slide 54 text

• First doc id of the block • Offset of block in .doc (same file) • Offset in .pos (if positions indexed) • Offset in .pay (if offsets or payloads indexed) • List of competitive (freq, norm) pairs (NEW) • Makes it easy to know the upper bound of scores • Still allows to change Similarity on existing index Skip data 54

Slide 55

Slide 55 text

Usage of impacts? 55 Term queries Skip blocks whose max score is not competitive Conjunctions Skip blocks whose sum of max scores is not competitive Disjunctions WAND 㱺 block-max WAND Other queries TODO

Slide 56

Slide 56 text

• Term queries: ~8x faster • Conjunctions and disjunctions • Many times faster when terms frequently appear together (united AND kingdom, new OR york) • Depends a lot on data distribution otherwise Speedup? 56

Slide 57

Slide 57 text

57 More Questions? Visit us at the AMA

Slide 58

Slide 58 text

www.elastic.co

Slide 59

Slide 59 text

Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 59 Please attribute Elastic with a link to elastic.co