Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Geospatial Data Structures in Elasticsearch and...

Elastic Co
February 19, 2016

Geospatial Data Structures in Elasticsearch and Apache Lucene

This talk covered everything you ever wanted to know about geo and Elasticsearch! Get advice on field mapping strategies, learn about geo aggregations and visualizations for exploratory spatial data analysis, as well as get insights into new spatial data structures being added to Lucene and Elasticsearch.

Elastic Co

February 19, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. PUT crime/incidents/_mapping { “properties” : { “location” : { “type”

    : “geo_point”, “ignore_malformed” : true, “geohash_precision” : 6, “geohash_prefix“ : true } } } define 6 geo_point Mappings
  2. POST crime/incidents { “location” : { “lat” : 41.12, “lon”

    : -71.34 } } geo_point Mappings insert 7 POST crime/incidents { “location” : “41.12, -71.34” } POST crime/incidents { “location” : [[-71.34, 41.12], [-71.32, 41.21]] }
  3. PUT police/precincts/_mapping { “properties” : { “coverage” : { “type”

    : “geo_shape”, “ignore_malformed” : false, “tree” : ”quadtree”, “precision” : “5m”, “distance_error_pct“ : 0.025, “orientation” : “ccw”, “points_only” : false } } } define 8 geo_shape Mappings
  4. • Shapes are parsed using OGC and ISO standards definitions

    ‒OGC Simple Feature Access ‒ISO Geographic information — Spatial Schema (19107:2003) • Supports the following geo_shape types ‒Point, MultiPoint ‒LineString, MultiLineString ‒Polygon (with holes), MultiPolygon (with holes) ‒Envelope (box) 9 insert geo_shape Mappings
  5. PUT police/precincts/ { “coverage” : { “type” : “polygon”, “coordinates”

    : [[ [40.7538588, -73.9762134], [40.7526327, -73.9742356], [40.7516774, -73.9656733], [40.7521246, -73.9763236], [40.7516733, -73.9723788], [40.7523556, -73.9732423], [40.7538588, -73.9762134] ]] } } insert geo_shape 10 geo_shape Mappings
  6. 12 postings based GeoPointField introduced in LUCENE 5.4 term postings

    (doc ids) 1 1, 2, 3, 4, 5 10 1, 2, 4 11 3, 5 100 1 101 2, 4 111 3, 5 1000 1 1010 4 1011 2 1110 3 1111 5 geo_point Indexing
  7. postings based GeoPointField introduced in LUCENE 5.4 13 Encode lat,

    lon 64 bit integer (“z-curve”, or “morton code”) 1 2 3 Create prefix terms PRECISION_STEP (p) blocks up to 4p bits Add doc id to postings list (inverted index) geo_point Indexing
  8. postings based GeoPointField introduced in LUCENE 5.4 14 0% 25%

    50% 75% 100% Throughput Index Size Time Heap 12% 39% 28% 54% 66% 52% 80% 53% 2.1 2.2 2.3 geo_point Indexing
  9. 15 Divide earth into 64 bit integer quad cells (QuadTree)

    1 2 3 Create WITHIN and INTERSECT prefix terms (cells) (WITHIN == lowest res) (INTERSECT == highest res) Add doc id to postings list (inverted index) QuadPrefixTree and PackedQuadPrefixTree introduced in LUCENE 5.4 geo_shape Indexing
  10. 16 QuadPrefixTree and PackedQuadPrefixTree introduced in LUCENE 5.4 • Max

    tree_levels == 32 (2 bits / cell) • distance_error_pct • “slop” factor to manage transient memory usage • % of the diagonal distance (degrees) of the shape • Default == 0 if precision set (2.0) • points_only • optimization for points only shape index • short-circuits recursion geo_shape Indexing
  11. 19 geo Indexing Tree Structure (e.g., Balanced K-d Tree Bkd-tree)

    coming LUCENE 6.0 a b c d e f g h i j k l m n o p
  12. geo Indexing Tree Structure (e.g., Balanced K-d Tree Bkd-tree) coming

    LUCENE 6.0 20 Encode lat, lon 64 bit integer (“z-curve”, or “morton code”) 1 2 3 4 Start at root, recurse to ideal “leaf” bucket (either bounding box can contain point, or expand box) split on longest span dimension (ensure “squareness”) Write blocks (#children == M or on flush when: M/2 < #children < M
  13. 22 shapes geo Indexing • Dimensional Shapes represented using Minimum

    Bounding Ranges (MBR) ‒Ranges (1D) ‒Rectangles (2D) ‒Cubes (3D) ‒Hexadecant (4D)
  14. 31 0% 25% 50% 75% 100% Index Size Index Time

    49% 49% 100% 100% NumericField PointField 1D numerics geo Indexing
  15. 33 Divide earth into 64 bit integer quad cells (QuadTree)

    1 2 3 Create WITHIN and INTERSECT prefix terms (cells) (WITHIN == lowest res) (INTERSECT == highest res) retrieve doc ids from postings list, use DocValues to post filter INTERSECT terms postings based GeoPointField introduced in LUCENE 5.4 geo_point Search
  16. 34 geo_point Search postings based GeoPointField introduced in LUCENE 5.4

    • New LUCENE Spatial Queries (5.4 / ES 2.2) • BoundingBox, Distance, DistanceRange, Polygon • PRECISION_STEP controls number of query terms (must match with index) • TwoPhaseIterator (2.3) • Delays boundary confirmation so other query (filters, conjunctions) can pre-filter
  17. geo_point Search 35 0% 25% 50% 75% 100% BoundingBox Distance

    DistanceRange Polygon 11% 21% 26% 36% 51% 45% 70% 82% 2.1 2.2 2.3 postings based GeoPointField introduced in LUCENE 5.4
  18. geo_shape Search • Supports the following geo_shape types ‒ Point,

    MultiPoint ‒ LineString, MultiLineString ‒ Polygon (with holes), MultiPolygon (with holes) ‒ Envelope (box) • Shapes are parsed using OGC (SFA) and ISO (19107:2003) standards definitions • Supports relational queries ‒ INTERSECTS, DISJOINT, WITHIN, CONTAINS 36 geo_shape field
  19. 37 Recursively Traverse Query terms 1 2 OR the DocIDs

    from each term’s Postings List INTERSECTS geo_shape Search
  20. geo_shape Search WITHIN 39 Buffer Shape by PERCENT_DISTANCE (Provides EXCLUDE

    terms outside the shape perimeter) 1 2 3 4 Traverse query terms of the buffered shape Compute Relation between term and unbuffered Query Shape Accept docs whose terms INTERSECT / WITHIN but MUST_NOT contain DISJOINT terms
  21. 40 geo_shape Search • PERCENT_DISTANCE • -1, traverses entire map.

    Costly, but more accurate • >0, buffered shape. Faster at the cost of accuracy. • ES 2.3 uses distance_error_pct or 2.5% if set to 0. WITHIN
  22. 41 Recursively Traverse Query terms 1 2 AND the DocIDs

    from each term’s Postings List CONTAINS - NEW in 2.2 geo_shape Search
  23. geo Search Tree Structure (e.g., Balanced K-d Tree Bkd-tree) coming

    LUCENE 6.0 43 Begin with root node bounding box 1 2 3 4 RELATE with search criteria (WITHIN, CONTAIN, INTERSECT) DFS traverse (until Internal Node WITHIN or Leaf Node WITHIN or INTERSECTS) Collect doc IDs (WITHIN == ALL INTERSECTS == post filter)
  24. 44 geo Search Tree Structure for points (e.g., Balanced K-d

    Tree Bkd-tree) coming LUCENE 6.0 Leaf cell is fully within polygon (salmon) - return all docs Leaf cell crosses the boundary (gray) - two-phase check 1 2
  25. 45 geo Search Tree Structure for shapes (e.g., R* Tree)

    coming LUCENE 6.x/? a b c d e f g h i j k l m n o p m n o p a! a! a! x
  26. 46 geo Search Tree Structure for shapes (e.g., R* Tree)

    coming LUCENE 6.x/? a b c d e f g h i j k l m n o p m n o p a! a! x x a! a! a b c d e g
  27. 47 geo Search Tree Structure for shapes (e.g., R* Tree)

    coming LUCENE 6.x/? source: wikipedia
  28. 48 0% 25% 50% 75% 100% Search Time Search Time

    Heap Usage 15% 76% 100% 100% NumericField PointField geo Search 1D numerics
  29. ‹#› 50 GeoDistance Agg { "aggs" : { “sf_rings" :

    { "geo_distance" : { "field" : "location", "origin" : [32.95, -96.82], "ranges" : [ { "to" : 50 }, { "from" : 50, "to" : 100 }, { "from" : 100, "to" : 300} ] } } } }
  30. ‹#› 52 GeoGrid Agg { "aggs" : { “crime_cells" :

    { "geohash_grid" : { "field" : "location", "precision" : 8 } } } }
  31. ‹#› 54 GeoCentroid Agg "query" : { "match" : {

    "crime" : "burglary" } }, "aggs" : { "towns" : { "terms" : { "field" : "town" }, "aggs" : { "centroid" : { "geo_centroid" : { "field" : “location" } } } } }
  32. ‹#› Please attribute Elastic with a link to elastic.co Except

    where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 58