Slide 1

Slide 1 text

How Elasticsearch is SPARKing Our Geospatial Analysis: An Esri Story Adam Mollenkopf, Real-Time GIS Capability Lead, Esri @amollenkopf [email protected] 1

Slide 2

Slide 2 text

Esri Geographic Information System (GIS) • Environmental Systems Research Institute (ESRI) was founded in 1969 • Esri develops GIS software • Global Company with over 350,000 user organizations worldwide Headquarters in Redlands, CA 80 Esri distributors worldwide 2

Slide 3

Slide 3 text

How Elasticsearch is SPARKing our Geospatial Analysis agenda • Use Cases • Real-Time Ingestion • Streaming Analytics • Storage & Search • Visualization • Batch Analytics 3

Slide 4

Slide 4 text

Spatiotemporal Observation Data data-in-motion use cases Desktop Web Device Visualization Spatiotemporal Storage & Search Streaming Analytics Batch Analytics Ingestion • Moving Objects: - Aircraft, Drones, Trucks, Cars, Railways, Vessels, People, … • Sensor Networks: - Weather Stations, Road Traffic, Gas & Electric Utility Networks, Environmental Sensors, … 4

Slide 5

Slide 5 text

TODO: INSERT “02” VIDEO HERE

Slide 6

Slide 6 text

Ingestion
 of high velocity spatiotemporal data 6

Slide 7

Slide 7 text

Ingestion of high velocity spatiotemporal data • Requirements: - Sustain a single node ingestion throughput of at least tens of thousands of events per second. - Achieve near linear scalability of throughput when adding additional nodes. - Gracefully handle bursty data. spatiotemporal observation data Ingestion 7

Slide 8

Slide 8 text

Apache Kafka publish-subscribe messaging rethought as a distributed commit log • Fast - single broker can handle hundreds of MBs of reads and writes per second. • Scalable - data streams are partitioned and spread over a cluster of machines. • Durable - messages are persisted to disk and replicated within the cluster. • Distributed - cluster-centric design that offers strong durability and fault-tolerance guarantees. 8

Slide 9

Slide 9 text

Apache Spark a fast and general engine for large-scale data processing • Unified big data processing: - write streaming jobs the same way you write batch jobs. - can combine streaming with batch and interactive queries. • Spark apps can be written in Java, Scala, Python, and R. 9

Slide 10

Slide 10 text

of high velocity spatiotemporal data c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS Ingestion: 1 node benchmark Ingestion 1 node Spark Streaming w/ Kafka 132k 10

Slide 11

Slide 11 text

Ingestion: 2 node benchmark Ingestion 1 node 2 node Spark Streaming w/ Kafka 132k 282k of high velocity spatiotemporal data c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS 11

Slide 12

Slide 12 text

Streaming Analytics
 on high velocity & volume spatiotemporal data 12

Slide 13

Slide 13 text

Streaming Analytics of high velocity & volume spatiotemporal data • Configure the flow of events, - the filtering and analytic steps to perform, - what ingestion stream(s) to apply them to, - and where to send the results. spatiotemporal observation data Streaming . Analytics . Ingestion 13

Slide 14

Slide 14 text

of high velocity & volume spatiotemporal data Streaming Analytics KafkaUtils.createStream(ssc, …) .map( event => SlidingTimeWindow.tumble(event, …) ) .map( event => Aggregator.spatialAggregation(event, …) ) .map( event => MapService.density(event, …) ) .saveToEs(…) => DAG (Directed Acyclic Graph) • Configure the flow of events, - the filtering and analytic steps to perform, - what ingestion stream(s) to apply them to, - and where to send the results. 14

Slide 15

Slide 15 text

Streaming Analytics of high velocity & volume spatiotemporal data • Run continuous analytics on high velocity spatiotemporal data-in-motion. Spatial Aggregation with a Sliding Time Window 30 meter cells Spatial Aggregation 200 meter cells 15

Slide 16

Slide 16 text

GIS Tools for Hadoop http://esri.github.io/gis-tools-for-hadoop/ • Esri Geometry API for Java: - Geometry objects: points, lines, polygons. - Spatial relations: intersects, touches, overlaps, … - Spatial operations: buffer, cut, union, … • Spatial Framework for Hadoop - Includes Spatial UDFs (User Defined Functions). • GeoProcessing Tools for Hadoop Ch. 8 Geospatial & Temporal Data Analysis 16

Slide 17

Slide 17 text

Storage & Search
 of high volume spatiotemporal data 17

Slide 18

Slide 18 text

Storage & Search of high volume spatiotemporal data • Requirements: - Sustain a single-node write throughput of at least tens of thousands of events per second. - Achieve growth in volume capacity & write throughput when adding additional nodes. Spatiotemporal Storage & Search . Streaming . Analytics . Ingestion 18

Slide 19

Slide 19 text

Elasticsearch search & analyze data in real time • Distributed, scalable, and highly available. • Simple, yet sophisticated, RESTful API. • Real-time full-text search, structured search, and analytic capabilities. • Has the ability to easily combine Geolocation with search and analytic capabilities. • Spark Elasticsearch Connector: - https://github.com/elastic/elasticsearch-hadoop (org.elasticsearch.spark.rdd.EsSpark) 19

Slide 20

Slide 20 text

of high volume spatiotemporal data c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS Storage 1 node 2 node 3 node 4 node 5 node {es} 106k 143k 192k 224k 249k Storage & Search: 5 Node Elasticsearch Cluster Write Throughput Ingest 1 node 2 node Spark + Kafka 132k 282k 20

Slide 21

Slide 21 text

Searching high volume spatiotemporal data • Efficiently access and search a large volume of spatiotemporal data. - Query by any combination of id, time, space, and attributes. • Elasticsearch has the ability to easily combine Geolocation with structured & full-text search. 21

Slide 22

Slide 22 text

Searching high volume spatiotemporal data • Geolocation search is made possible via spatial field types: - geo_point: a latitude-longitude pair - can calculate distance; used for sorting and relevance. - can be filtered by geo_bounding_box, geo_distance, or geo_distance_range. - can be aggregated into a grid to display on a map; uses Geohash. - geo_shape: complex shapes including polygon and polyline - used purely for filtering; expressed as GeoJSON. - For more info see: https://www.elastic.co/guide/en/elasticsearch/guide/current/geoloc.html 22

Slide 23

Slide 23 text

Visualization
 of high velocity & volume spatiotemporal data 23

Slide 24

Slide 24 text

Desktop Web Device Visualization Spatiotemporal Storage & Search . Streaming . Analytics . Ingestion • ArcGIS API for JavaScript - A lightweight way to embed maps in web apps. - Renders any Map or Feature Service compliant source. https://www.esri.com/library/whitepapers/pdfs/geoservices-rest-spec.pdf of high velocity & volume spatiotemporal data Visualization 24

Slide 25

Slide 25 text

Visualization of high velocity & volume spatiotemporal data • Render with ability to do aggregation - Aggregations calculated at various levels of detail and are specific to each user session. - when zoomed in raw observations are returned and rendered. 25

Slide 26

Slide 26 text

Visualization of high velocity & volume spatiotemporal data

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Batch Analytics
 of high velocity & volume spatiotemporal data 28

Slide 29

Slide 29 text

Desktop Web Device Visualization Spatiotemporal Storage & Search . Streaming . Analytics . . Batch . Analytics Ingestion of high volume spatiotemporal data Batch Analytics 29

Slide 30

Slide 30 text

Batch Analytics of high volume spatiotemporal data

Slide 31

Slide 31 text

Port of Rotterdam, courtesy of Frank Cremer vessel and port usage behavioral analytics • 8th largest port in the world. • Largest port in Europe. 31

Slide 32

Slide 32 text

Polyline Track Batch Analytic Tool Speed Batch Analytic Tool Line Crosses Batch Analytic Tool Density Batch Analytic Tool Port of Rotterdam vessel and port usage behavioral analytics 32

Slide 33

Slide 33 text

Port of Rotterdam polyline track analytics 33

Slide 34

Slide 34 text

Port of Rotterdam polyline track analytics 34

Slide 35

Slide 35 text

Port of Rotterdam density analytics 35

Slide 36

Slide 36 text

D d Δ (Lat,lon) Where is Δ≃ 0 ? Port of Rotterdam dredging prioritization 36

Slide 37

Slide 37 text

Port of Rotterdam dredging prioritization 37

Slide 38

Slide 38 text

How Elasticsearch is SPARKing our Geospatial Analysis summary • When working with high velocity & volume spatiotemporal data we have found the best technology selections are as follows: - Real-Time Ingestion = Spark Streaming + Kafka. - Streaming Analytics = Spark Streaming + GIS Tools for Hadoop. - Storage & Search = Elasticsearch + Spark Elasticsearch Connector. - Visualization = ArcGIS API for JavaScript. - Batch Analytics = Spark Elasticsearch Connector + Spark Core + GIS Tools for Hadoop. - GIS Tools for Hadoop - Can be used as a basis to add spatial geometries, relations, and operators to Spark. http://esri.github.io/gis-tools-for-hadoop/ 38

Slide 39

Slide 39 text

Q & A 39 Thank you!