Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Elasticsearch is SPARKing Our Geospatial Analysis: An Esri Story

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
December 01, 2015

How Elasticsearch is SPARKing Our Geospatial Analysis: An Esri Story

This session will explore how to apply Geospatial analytics on high-velocity streaming (data-in-motion) and high-volume batch (data-at-rest) using Elasticsearch and Apache Spark. Demonstrations will be performed throughout the session to cement these concepts.

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

December 01, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. How Elasticsearch is SPARKing Our Geospatial Analysis: An Esri Story

    Adam Mollenkopf, Real-Time GIS Capability Lead, Esri @amollenkopf amollenkopf@esri.com 1
  2. Esri Geographic Information System (GIS) • Environmental Systems Research Institute

    (ESRI) was founded in 1969 • Esri develops GIS software • Global Company with over 350,000 user organizations worldwide Headquarters in Redlands, CA 80 Esri distributors worldwide 2
  3. How Elasticsearch is SPARKing our Geospatial Analysis agenda • Use

    Cases • Real-Time Ingestion • Streaming Analytics • Storage & Search • Visualization • Batch Analytics 3
  4. Spatiotemporal Observation Data data-in-motion use cases Desktop Web Device Visualization

    Spatiotemporal Storage & Search Streaming Analytics Batch Analytics Ingestion • Moving Objects: - Aircraft, Drones, Trucks, Cars, Railways, Vessels, People, … • Sensor Networks: - Weather Stations, Road Traffic, Gas & Electric Utility Networks, Environmental Sensors, … 4
  5. TODO: INSERT “02” VIDEO HERE

  6. Ingestion
 of high velocity spatiotemporal data 6

  7. Ingestion of high velocity spatiotemporal data • Requirements: - Sustain

    a single node ingestion throughput of at least tens of thousands of events per second. - Achieve near linear scalability of throughput when adding additional nodes. - Gracefully handle bursty data. spatiotemporal observation data Ingestion 7
  8. Apache Kafka publish-subscribe messaging rethought as a distributed commit log

    • Fast - single broker can handle hundreds of MBs of reads and writes per second. • Scalable - data streams are partitioned and spread over a cluster of machines. • Durable - messages are persisted to disk and replicated within the cluster. • Distributed - cluster-centric design that offers strong durability and fault-tolerance guarantees. 8
  9. Apache Spark a fast and general engine for large-scale data

    processing • Unified big data processing: - write streaming jobs the same way you write batch jobs. - can combine streaming with batch and interactive queries. • Spark apps can be written in Java, Scala, Python, and R. 9
  10. of high velocity spatiotemporal data c4.2xlarge (Windows 2012 Server R2):

    8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS Ingestion: 1 node benchmark Ingestion 1 node Spark Streaming w/ Kafka 132k 10
  11. Ingestion: 2 node benchmark Ingestion 1 node 2 node Spark

    Streaming w/ Kafka 132k 282k of high velocity spatiotemporal data c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 11
  12. Streaming Analytics
 on high velocity & volume spatiotemporal data 12

  13. Streaming Analytics of high velocity & volume spatiotemporal data •

    Configure the flow of events, - the filtering and analytic steps to perform, - what ingestion stream(s) to apply them to, - and where to send the results. spatiotemporal observation data Streaming . Analytics . Ingestion 13
  14. of high velocity & volume spatiotemporal data Streaming Analytics KafkaUtils.createStream(ssc,

    …) .map( event => SlidingTimeWindow.tumble(event, …) ) .map( event => Aggregator.spatialAggregation(event, …) ) .map( event => MapService.density(event, …) ) .saveToEs(…) => DAG (Directed Acyclic Graph) • Configure the flow of events, - the filtering and analytic steps to perform, - what ingestion stream(s) to apply them to, - and where to send the results. 14
  15. Streaming Analytics of high velocity & volume spatiotemporal data •

    Run continuous analytics on high velocity spatiotemporal data-in-motion. Spatial Aggregation with a Sliding Time Window 30 meter cells Spatial Aggregation 200 meter cells 15
  16. GIS Tools for Hadoop http://esri.github.io/gis-tools-for-hadoop/ • Esri Geometry API for

    Java: - Geometry objects: points, lines, polygons. - Spatial relations: intersects, touches, overlaps, … - Spatial operations: buffer, cut, union, … • Spatial Framework for Hadoop - Includes Spatial UDFs (User Defined Functions). • GeoProcessing Tools for Hadoop Ch. 8 Geospatial & Temporal Data Analysis 16
  17. Storage & Search
 of high volume spatiotemporal data 17

  18. Storage & Search of high volume spatiotemporal data • Requirements:

    - Sustain a single-node write throughput of at least tens of thousands of events per second. - Achieve growth in volume capacity & write throughput when adding additional nodes. Spatiotemporal Storage & Search . Streaming . Analytics . Ingestion 18
  19. Elasticsearch search & analyze data in real time • Distributed,

    scalable, and highly available. • Simple, yet sophisticated, RESTful API. • Real-time full-text search, structured search, and analytic capabilities. • Has the ability to easily combine Geolocation with search and analytic capabilities. • Spark Elasticsearch Connector: - https://github.com/elastic/elasticsearch-hadoop (org.elasticsearch.spark.rdd.EsSpark) 19
  20. of high volume spatiotemporal data c4.2xlarge (Windows 2012 Server R2):

    8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS Storage 1 node 2 node 3 node 4 node 5 node {es} 106k 143k 192k 224k 249k Storage & Search: 5 Node Elasticsearch Cluster Write Throughput Ingest 1 node 2 node Spark + Kafka 132k 282k 20
  21. Searching high volume spatiotemporal data • Efficiently access and search

    a large volume of spatiotemporal data. - Query by any combination of id, time, space, and attributes. • Elasticsearch has the ability to easily combine Geolocation with structured & full-text search. 21
  22. Searching high volume spatiotemporal data • Geolocation search is made

    possible via spatial field types: - geo_point: a latitude-longitude pair - can calculate distance; used for sorting and relevance. - can be filtered by geo_bounding_box, geo_distance, or geo_distance_range. - can be aggregated into a grid to display on a map; uses Geohash. - geo_shape: complex shapes including polygon and polyline - used purely for filtering; expressed as GeoJSON. - For more info see: https://www.elastic.co/guide/en/elasticsearch/guide/current/geoloc.html 22
  23. Visualization
 of high velocity & volume spatiotemporal data 23

  24. Desktop Web Device Visualization Spatiotemporal Storage & Search . Streaming

    . Analytics . Ingestion • ArcGIS API for JavaScript - A lightweight way to embed maps in web apps. - Renders any Map or Feature Service compliant source. https://www.esri.com/library/whitepapers/pdfs/geoservices-rest-spec.pdf of high velocity & volume spatiotemporal data Visualization 24
  25. Visualization of high velocity & volume spatiotemporal data • Render

    with ability to do aggregation - Aggregations calculated at various levels of detail and are specific to each user session. - when zoomed in raw observations are returned and rendered. 25
  26. Visualization of high velocity & volume spatiotemporal data

  27. None
  28. Batch Analytics
 of high velocity & volume spatiotemporal data 28

  29. Desktop Web Device Visualization Spatiotemporal Storage & Search . Streaming

    . Analytics . . Batch . Analytics Ingestion of high volume spatiotemporal data Batch Analytics 29
  30. Batch Analytics of high volume spatiotemporal data

  31. Port of Rotterdam, courtesy of Frank Cremer vessel and port

    usage behavioral analytics • 8th largest port in the world. • Largest port in Europe. 31
  32. Polyline Track Batch Analytic Tool Speed Batch Analytic Tool Line

    Crosses Batch Analytic Tool Density Batch Analytic Tool Port of Rotterdam vessel and port usage behavioral analytics 32
  33. Port of Rotterdam polyline track analytics 33

  34. Port of Rotterdam polyline track analytics 34

  35. Port of Rotterdam density analytics 35

  36. D d Δ (Lat,lon) Where is Δ≃ 0 ? Port

    of Rotterdam dredging prioritization 36
  37. Port of Rotterdam dredging prioritization 37

  38. How Elasticsearch is SPARKing our Geospatial Analysis summary • When

    working with high velocity & volume spatiotemporal data we have found the best technology selections are as follows: - Real-Time Ingestion = Spark Streaming + Kafka. - Streaming Analytics = Spark Streaming + GIS Tools for Hadoop. - Storage & Search = Elasticsearch + Spark Elasticsearch Connector. - Visualization = ArcGIS API for JavaScript. - Batch Analytics = Spark Elasticsearch Connector + Spark Core + GIS Tools for Hadoop. - GIS Tools for Hadoop - Can be used as a basis to add spatial geometries, relations, and operators to Spark. http://esri.github.io/gis-tools-for-hadoop/ 38
  39. Q & A 39 Thank you!