Slide 1

Slide 1 text

‹#› Geospatial Data Structures in Elasticsearch and Lucene Nick Knize @nknize

Slide 2

Slide 2 text

Geo capabilities are becoming more popular among Elasticsearch users 2

Slide 3

Slide 3 text

From combined geo with free text search… 3

Slide 4

Slide 4 text

… to monitoring network traffic to identify “bad actors” 4

Slide 5

Slide 5 text

Topics 5 Geo field types Geo indexing Geo search Geo Aggregations 1 2 3 4

Slide 6

Slide 6 text

PUT crime/incidents/_mapping { “properties” : { “location” : { “type” : “geo_point”, “ignore_malformed” : true, “geohash_precision” : 6, “geohash_prefix“ : true } } } define 6 geo_point Mappings

Slide 7

Slide 7 text

POST crime/incidents { “location” : { “lat” : 41.12, “lon” : -71.34 } } geo_point Mappings insert 7 POST crime/incidents { “location” : “41.12, -71.34” } POST crime/incidents { “location” : [[-71.34, 41.12], [-71.32, 41.21]] }

Slide 8

Slide 8 text

PUT police/precincts/_mapping { “properties” : { “coverage” : { “type” : “geo_shape”, “ignore_malformed” : false, “tree” : ”quadtree”, “precision” : “5m”, “distance_error_pct“ : 0.025, “orientation” : “ccw”, “points_only” : false } } } define 8 geo_shape Mappings

Slide 9

Slide 9 text

• Shapes are parsed using OGC and ISO standards definitions ‒OGC Simple Feature Access ‒ISO Geographic information — Spatial Schema (19107:2003) • Supports the following geo_shape types ‒Point, MultiPoint ‒LineString, MultiLineString ‒Polygon (with holes), MultiPolygon (with holes) ‒Envelope (box) 9 insert geo_shape Mappings

Slide 10

Slide 10 text

PUT police/precincts/ { “coverage” : { “type” : “polygon”, “coordinates” : [[ [40.7538588, -73.9762134], [40.7526327, -73.9742356], [40.7516774, -73.9656733], [40.7521246, -73.9763236], [40.7516733, -73.9723788], [40.7523556, -73.9732423], [40.7538588, -73.9762134] ]] } } insert geo_shape 10 geo_shape Mappings

Slide 11

Slide 11 text

‹#› Geo Indexing 11

Slide 12

Slide 12 text

12 postings based GeoPointField introduced in LUCENE 5.4 term postings (doc ids) 1 1, 2, 3, 4, 5 10 1, 2, 4 11 3, 5 100 1 101 2, 4 111 3, 5 1000 1 1010 4 1011 2 1110 3 1111 5 geo_point Indexing

Slide 13

Slide 13 text

postings based GeoPointField introduced in LUCENE 5.4 13 Encode lat, lon 64 bit integer (“z-curve”, or “morton code”) 1 2 3 Create prefix terms PRECISION_STEP (p) blocks up to 4p bits Add doc id to postings list (inverted index) geo_point Indexing

Slide 14

Slide 14 text

postings based GeoPointField introduced in LUCENE 5.4 14 0% 25% 50% 75% 100% Throughput Index Size Time Heap 12% 39% 28% 54% 66% 52% 80% 53% 2.1 2.2 2.3 geo_point Indexing

Slide 15

Slide 15 text

15 Divide earth into 64 bit integer quad cells (QuadTree) 1 2 3 Create WITHIN and INTERSECT prefix terms (cells) (WITHIN == lowest res) (INTERSECT == highest res) Add doc id to postings list (inverted index) QuadPrefixTree and PackedQuadPrefixTree introduced in LUCENE 5.4 geo_shape Indexing

Slide 16

Slide 16 text

16 QuadPrefixTree and PackedQuadPrefixTree introduced in LUCENE 5.4 • Max tree_levels == 32 (2 bits / cell) • distance_error_pct • “slop” factor to manage transient memory usage • % of the diagonal distance (degrees) of the shape • Default == 0 if precision set (2.0) • points_only • optimization for points only shape index • short-circuits recursion geo_shape Indexing

Slide 17

Slide 17 text

geo_shape vs. geo_point …why can’t we all just get along? 17

Slide 18

Slide 18 text

geo @experimental 18

Slide 19

Slide 19 text

19 geo Indexing Tree Structure (e.g., Balanced K-d Tree Bkd-tree) coming LUCENE 6.0 a b c d e f g h i j k l m n o p

Slide 20

Slide 20 text

geo Indexing Tree Structure (e.g., Balanced K-d Tree Bkd-tree) coming LUCENE 6.0 20 Encode lat, lon 64 bit integer (“z-curve”, or “morton code”) 1 2 3 4 Start at root, recurse to ideal “leaf” bucket (either bounding box can contain point, or expand box) split on longest span dimension (ensure “squareness”) Write blocks (#children == M or on flush when: M/2 < #children < M

Slide 21

Slide 21 text

21 points geo Indexing

Slide 22

Slide 22 text

22 shapes geo Indexing • Dimensional Shapes represented using Minimum Bounding Ranges (MBR) ‒Ranges (1D) ‒Rectangles (2D) ‒Cubes (3D) ‒Hexadecant (4D)

Slide 23

Slide 23 text

23 shapes geo Indexing m n o p

Slide 24

Slide 24 text

24 shapes geo Indexing m n o p

Slide 25

Slide 25 text

25 shapes geo Indexing m n o p

Slide 26

Slide 26 text

26 shapes geo Indexing m n o p

Slide 27

Slide 27 text

27 shapes geo Indexing m n o p

Slide 28

Slide 28 text

28 shapes geo Indexing m n o p

Slide 29

Slide 29 text

29 shapes geo Indexing m n o p

Slide 30

Slide 30 text

30 source: wikipedia

Slide 31

Slide 31 text

31 0% 25% 50% 75% 100% Index Size Index Time 49% 49% 100% 100% NumericField PointField 1D numerics geo Indexing

Slide 32

Slide 32 text

‹#› Geo Search 32

Slide 33

Slide 33 text

33 Divide earth into 64 bit integer quad cells (QuadTree) 1 2 3 Create WITHIN and INTERSECT prefix terms (cells) (WITHIN == lowest res) (INTERSECT == highest res) retrieve doc ids from postings list, use DocValues to post filter INTERSECT terms postings based GeoPointField introduced in LUCENE 5.4 geo_point Search

Slide 34

Slide 34 text

34 geo_point Search postings based GeoPointField introduced in LUCENE 5.4 • New LUCENE Spatial Queries (5.4 / ES 2.2) • BoundingBox, Distance, DistanceRange, Polygon • PRECISION_STEP controls number of query terms (must match with index) • TwoPhaseIterator (2.3) • Delays boundary confirmation so other query (filters, conjunctions) can pre-filter

Slide 35

Slide 35 text

geo_point Search 35 0% 25% 50% 75% 100% BoundingBox Distance DistanceRange Polygon 11% 21% 26% 36% 51% 45% 70% 82% 2.1 2.2 2.3 postings based GeoPointField introduced in LUCENE 5.4

Slide 36

Slide 36 text

geo_shape Search • Supports the following geo_shape types ‒ Point, MultiPoint ‒ LineString, MultiLineString ‒ Polygon (with holes), MultiPolygon (with holes) ‒ Envelope (box) • Shapes are parsed using OGC (SFA) and ISO (19107:2003) standards definitions • Supports relational queries ‒ INTERSECTS, DISJOINT, WITHIN, CONTAINS 36 geo_shape field

Slide 37

Slide 37 text

37 Recursively Traverse Query terms 1 2 OR the DocIDs from each term’s Postings List INTERSECTS geo_shape Search

Slide 38

Slide 38 text

38 (MUST, EXISTS) 1 2 (MUST_NOT, INTERSECTS) DISJOINT = !INTERSECTS geo_shape Search BooleanQuery

Slide 39

Slide 39 text

geo_shape Search WITHIN 39 Buffer Shape by PERCENT_DISTANCE (Provides EXCLUDE terms outside the shape perimeter) 1 2 3 4 Traverse query terms of the buffered shape Compute Relation between term and unbuffered Query Shape Accept docs whose terms INTERSECT / WITHIN but MUST_NOT contain DISJOINT terms

Slide 40

Slide 40 text

40 geo_shape Search • PERCENT_DISTANCE • -1, traverses entire map. Costly, but more accurate • >0, buffered shape. Faster at the cost of accuracy. • ES 2.3 uses distance_error_pct or 2.5% if set to 0. WITHIN

Slide 41

Slide 41 text

41 Recursively Traverse Query terms 1 2 AND the DocIDs from each term’s Postings List CONTAINS - NEW in 2.2 geo_shape Search

Slide 42

Slide 42 text

geo @experimental 42

Slide 43

Slide 43 text

geo Search Tree Structure (e.g., Balanced K-d Tree Bkd-tree) coming LUCENE 6.0 43 Begin with root node bounding box 1 2 3 4 RELATE with search criteria (WITHIN, CONTAIN, INTERSECT) DFS traverse (until Internal Node WITHIN or Leaf Node WITHIN or INTERSECTS) Collect doc IDs (WITHIN == ALL INTERSECTS == post filter)

Slide 44

Slide 44 text

44 geo Search Tree Structure for points (e.g., Balanced K-d Tree Bkd-tree) coming LUCENE 6.0 Leaf cell is fully within polygon (salmon) - return all docs Leaf cell crosses the boundary (gray) - two-phase check 1 2

Slide 45

Slide 45 text

45 geo Search Tree Structure for shapes (e.g., R* Tree) coming LUCENE 6.x/? a b c d e f g h i j k l m n o p m n o p a! a! a! x

Slide 46

Slide 46 text

46 geo Search Tree Structure for shapes (e.g., R* Tree) coming LUCENE 6.x/? a b c d e f g h i j k l m n o p m n o p a! a! x x a! a! a b c d e g

Slide 47

Slide 47 text

47 geo Search Tree Structure for shapes (e.g., R* Tree) coming LUCENE 6.x/? source: wikipedia

Slide 48

Slide 48 text

48 0% 25% 50% 75% 100% Search Time Search Time Heap Usage 15% 76% 100% 100% NumericField PointField geo Search 1D numerics

Slide 49

Slide 49 text

‹#› Geo Aggregations 49

Slide 50

Slide 50 text

‹#› 50 GeoDistance Agg { "aggs" : { “sf_rings" : { "geo_distance" : { "field" : "location", "origin" : [32.95, -96.82], "ranges" : [ { "to" : 50 }, { "from" : 50, "to" : 100 }, { "from" : 100, "to" : 300} ] } } } }

Slide 51

Slide 51 text

‹#› 51 GeoDistance Agg

Slide 52

Slide 52 text

‹#› 52 GeoGrid Agg { "aggs" : { “crime_cells" : { "geohash_grid" : { "field" : "location", "precision" : 8 } } } }

Slide 53

Slide 53 text

‹#› 53 GeoGrid Agg

Slide 54

Slide 54 text

‹#› 54 GeoCentroid Agg "query" : { "match" : { "crime" : "burglary" } }, "aggs" : { "towns" : { "terms" : { "field" : "town" }, "aggs" : { "centroid" : { "geo_centroid" : { "field" : “location" } } } } }

Slide 55

Slide 55 text

‹#› 55 GeoCentroid Agg

Slide 56

Slide 56 text

‹#› 56 GeoCentroid Agg

Slide 57

Slide 57 text

Questions? 19 Also find us at the AMA Booth

Slide 58

Slide 58 text

‹#› Please attribute Elastic with a link to elastic.co Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 58