Title: Big Geospatial Data with Open-source Tech - Masood Krohy

Copyright © 2019, PatternedScience Inc. www.patterned.science Big Geospatial Data with
Open-source Tech (Vectors, Rasters & Map-matching) Presenter Masood Krohy, Ph.D. June 11, 2019

2 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter
bio • Quick intro to Spark and a tour of Web UIs Intros 1 • Conﬁgs for a Spark cluster using GeoPySpark • Code walkthrough of a sample raster operation Operations on rasters with GeoPySpark 3 • Intro to mtl-trajet dataset and the math behind map-matching • Demo: map-matching in a Zeppelin note and viewing of results Large-scale map-matching with Spark and OSRM 4 • Data ingestion and temporal/spatial partitioning • Demo: sample vector operations in a Zeppelin note Operations on vectors with GeoMesa 2

3 Copyright © 2019, PatternedScience Inc. Ph.D. in Computer Engineering
Analytical modeling of botnets. Validated by data collected in industry. 3 top publications. Senior Analyst, Rogers Managing the analytics reporting/statistical analyses of the national benchmarking program. Data Scientist, Intact First Data Scientist of the company. Led the Big Data mining project for the UBI program. Lead Data Scientist, CN Implemented an object-within-object detection system to detect cracks in railway equipment. Masood Krohy Presenter Bio 2013 Sr Data Science Advisor, B.Yond Implemented a pattern detection system for stream of alarms coming from telecom devices. Chief Architect, UniAnalytica (advanced data science platform) Platform contains Apache Spark, GeoMesa, GeoPySpark and OSRM, among many others. 2014 2016 2017 2018 2019 Data Science Platform Architect & Advisor

4 Copyright © 2019, PatternedScience Inc. 2. Slicing/dicing of GPS/LiDAR
data OmniSci (MapD) GPU Database - Slicing and dicing of GPS and LiDAR data. Used along with the GPU DataFrame lib gives a much faster Pandas DataFrame alternative 6. Processing of point cloud data PDAL - Python lib on workers enabling many geospatial operations on point cloud data - can be hooked up with Spark to enable geospatial indexing and exploratory analysis of massive LiDAR datasets 3. Vector processing GeoMesa - Vector processing/operations - GeoMesa HDFS DataStore provides temporal and geospatial indexing and GeoMesa lib enables Spark to query the DataStore 5. Interactive geospatial analysis GeoNotebook - Interactive geospatial analysis (Jupyter notebook on the left side of the screen, interactive map on the right, with operations that get translated from one to the other) 1. Map-matching OSRM (Open Source Routing Machine) In combination with Spark for large-scale map-matching 4. Raster processing GeoPySpark (GeoTrellis) - Raster processing/operations - workﬂow to ingest NetCDF data Geospatial/Climate Data Processing Stack UniAnalytica Platform

12 Copyright © 2019, PatternedScience Inc. GeoMesa FileSystem Data Store
Data ingestion and temporal/spatial partitioning “GDELT Event Database periodically scans news articles and uses natural language processing to identify the people, locations, organizations, counts, themes, sources, emotions, quotes and events driving our global society every second of every day.” GDELT data is updated each morning at 6am. GDELT: Global Database of Events, Language, and Tone $ geomesa-fs ingest \ > --path hdfs:///data/geomesa \ > --encoding parquet \ > --partition-scheme daily,z2-2bit \ > --converter gdelt \ > --spec gdelt \ > --num-reducers 60 \ > hdfs:///tmp/geomesa_source/* INFO Schema 'gdelt' exists INFO Running ingestion in distributed mode INFO Submitting job - please wait... INFO Tracking available at http://master1:8088/proxy/application_1542841752250_0003/ Map (stage 1/2): [============================================================] 100% complete 1285001 mapped 0 failed in 00:00:54 Reduce (stage 2/2): [============================================================] 100% complete 1285001 written in 00:01:49 INFO Distributed ingestion complete in 00:02:44 INFO Ingested 1285001 features with no failures. Ingesting (and partitioning) GDELT hadoop distcp \ s3a://gdelt-open-data/events/2017 010* \ /tmp/geomesa_source S3 to HDFS with distcp

Code Walkthrough & Live Demo Zeppelin note: GeoMesa examples Notebook/Script

Code Walkthrough Jupyter notebook: geotrellis_geopyspark_quick_example Notebook/Script

18 Copyright © 2019, PatternedScience Inc. Map Matching The math
behind what we use • Hidden Markov Model (HMM) serves to ﬁnd the most probable state sequence for a given sequence of observations; • The states of the HMM are the individual road segments and the state measurements/observations are the noisy vehicle location (GPS) measurements; • After the probabilities of observation and transitions are computed with the HMM, the Viterbi algorithm is used to identify the most probable sequence of states (i.e., street segments); • This approach can also be applied to other networks, such as bike paths and railroads, and the GPS data collected on those modes of transport; • References and more information: ◦ P. Newson and J. Krumm. Hidden Markov Map Matching Through Noise and Sparseness. In Proceedings of International Conference on Advances in Geographic Information Systems, 2009. Project URL: https://www.microsoft.com/en-us/research/publication/hidden-markov-map-matching-noise-sparseness/ ◦ OSRM developers' announcement and introduction regarding the map-matching feature: https://www.mapbox.com/blog/map-matching/ ◦ Barefoot, another map-matching solution, also implements the same algorithm: https://github.com/bmwcarit/barefoot/wiki#hmm-map-matching

Code Walkthrough & Live Demo • Zeppelin note: Perform Map
Matching • Jupyter notebook: mapping_map_matching • Zeppelin note: Map Matching Perf Test Notebooks/Scripts

Title: Big Geospatial Data with Open-source Tec...

Title: Big Geospatial Data with Open-source Tech - Masood Krohy

PatternedScience

More Decks by PatternedScience

Other Decks in Technology

Featured

Transcript

Copyright © 2019, PatternedScience Inc. www.patterned.science Big Geospatial Data with

2 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

3 Copyright © 2019, PatternedScience Inc. Ph.D. in Computer Engineering

4 Copyright © 2019, PatternedScience Inc. 2. Slicing/dicing of GPS/LiDAR

5 Copyright © 2019, PatternedScience Inc. Graph source: Databricks

6 Copyright © 2019, PatternedScience Inc.

7 Copyright © 2019, PatternedScience Inc.

8 Copyright © 2019, PatternedScience Inc.

9 Copyright © 2019, PatternedScience Inc.

10 Copyright © 2019, PatternedScience Inc.

11 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

12 Copyright © 2019, PatternedScience Inc. GeoMesa FileSystem Data Store

Code Walkthrough & Live Demo Zeppelin note: GeoMesa examples Notebook/Script

14 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

Code Walkthrough Jupyter notebook: geotrellis_geopyspark_quick_example Notebook/Script

16 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

17 Copyright © 2019, PatternedScience Inc. Map Matching Introducing MTL-Trajet

18 Copyright © 2019, PatternedScience Inc. Map Matching The math

19 Copyright © 2019, PatternedScience Inc. Map Matching Map-matching of

Code Walkthrough & Live Demo • Zeppelin note: Perform Map

Q&A