Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Title: Big Geospatial Data with Open-source Tech - Masood Krohy

Title: Big Geospatial Data with Open-source Tech - Masood Krohy

Masood Krohy at June 11, 2019 event of EDPP Montreal (montrealml.dev/data)

Title: Big Geospatial Data with Open-source Tech (Vectors, Rasters & Map-matching)

Presentation/Demo video: check out PatternedScience's YouTube channel at https://www.youtube.com/channel/UCjbEIZlS2DA45Bswi5EXWRw

Summary: Geospatial datasets (i.e. geocoded data points) are everywhere nowadays and often add enormous value to data analytics/mining and machine learning projects. In this new era of Big Data, libraries and engines such as GeoPandas, PostGIS and the equivalent products in the commercial space often fall short and cannot scale up sufficiently to let us tap into the Big Data that is being collected in many use cases and by many organizations. In this talk/demo, we explore free, open-source, Big Data-ready technologies and workflows like GeoMesa, GeoPySpark and OSRM-on-Spark and show how to use these Apache Spark-based tech/workflows for key geospatial operations and use cases. We start by introducing GeoMesa and demo-ing how it can be used to ingest Big Geospatial Data and perform operations on vectors. Next, we briefly introduce GeoPySpark, the Python interface to Geotrellis, for performing operations on rasters. At the end, we turn to map-matching which is the process of associating names to geocoded data points from an underlying network (e.g., determining which street a particular GPS point should be associated with). We describe and demo how we can combine OSRM with Spark to do scalable map-matching on Big Data and therefore open up a lot of possibilities for advanced data mining and machine learning projects.

Bio: Masood Krohy is a Data Science Platform Architect/Advisor and most recently acted as the Chief Architect of UniAnalytica, an advanced data science platform with wide, out-of-the-box support for time-series and geospatial use cases. He has worked with several corporations in different industries in the past few years to design, implement and productionize Deep Learning and Big Data products. He holds a Ph.D. in computer engineering.

PatternedScience

June 11, 2019
Tweet

More Decks by PatternedScience

Other Decks in Technology

Transcript

  1. Copyright © 2019, PatternedScience Inc. www.patterned.science Big Geospatial Data with

    Open-source Tech (Vectors, Rasters & Map-matching) Presenter Masood Krohy, Ph.D. June 11, 2019
  2. 2 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Quick intro to Spark and a tour of Web UIs Intros 1 • Configs for a Spark cluster using GeoPySpark • Code walkthrough of a sample raster operation Operations on rasters with GeoPySpark 3 • Intro to mtl-trajet dataset and the math behind map-matching • Demo: map-matching in a Zeppelin note and viewing of results Large-scale map-matching with Spark and OSRM 4 • Data ingestion and temporal/spatial partitioning • Demo: sample vector operations in a Zeppelin note Operations on vectors with GeoMesa 2
  3. 3 Copyright © 2019, PatternedScience Inc. Ph.D. in Computer Engineering

    Analytical modeling of botnets. Validated by data collected in industry. 3 top publications. Senior Analyst, Rogers Managing the analytics reporting/statistical analyses of the national benchmarking program. Data Scientist, Intact First Data Scientist of the company. Led the Big Data mining project for the UBI program. Lead Data Scientist, CN Implemented an object-within-object detection system to detect cracks in railway equipment. Masood Krohy Presenter Bio 2013 Sr Data Science Advisor, B.Yond Implemented a pattern detection system for stream of alarms coming from telecom devices. Chief Architect, UniAnalytica (advanced data science platform) Platform contains Apache Spark, GeoMesa, GeoPySpark and OSRM, among many others. 2014 2016 2017 2018 2019 Data Science Platform Architect & Advisor
  4. 4 Copyright © 2019, PatternedScience Inc. 2. Slicing/dicing of GPS/LiDAR

    data OmniSci (MapD) GPU Database - Slicing and dicing of GPS and LiDAR data. Used along with the GPU DataFrame lib gives a much faster Pandas DataFrame alternative 6. Processing of point cloud data PDAL - Python lib on workers enabling many geospatial operations on point cloud data - can be hooked up with Spark to enable geospatial indexing and exploratory analysis of massive LiDAR datasets 3. Vector processing GeoMesa - Vector processing/operations - GeoMesa HDFS DataStore provides temporal and geospatial indexing and GeoMesa lib enables Spark to query the DataStore 5. Interactive geospatial analysis GeoNotebook - Interactive geospatial analysis (Jupyter notebook on the left side of the screen, interactive map on the right, with operations that get translated from one to the other) 1. Map-matching OSRM (Open Source Routing Machine) In combination with Spark for large-scale map-matching 4. Raster processing GeoPySpark (GeoTrellis) - Raster processing/operations - workflow to ingest NetCDF data Geospatial/Climate Data Processing Stack UniAnalytica Platform
  5. 11 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Quick intro to Spark and a tour of Web UIs Intros 1 • Configs for a Spark cluster using GeoPySpark • Code walkthrough of a sample raster operation Operations on rasters with GeoPySpark 3 • Intro to mtl-trajet dataset and the math behind map-matching • Demo: map-matching in a Zeppelin note and viewing of results Large-scale map-matching with Spark and OSRM 4 • Data ingestion and temporal/spatial partitioning • Demo: sample vector operations in a Zeppelin note Operations on vectors with GeoMesa 2
  6. 12 Copyright © 2019, PatternedScience Inc. GeoMesa FileSystem Data Store

    Data ingestion and temporal/spatial partitioning “GDELT Event Database periodically scans news articles and uses natural language processing to identify the people, locations, organizations, counts, themes, sources, emotions, quotes and events driving our global society every second of every day.” GDELT data is updated each morning at 6am. GDELT: Global Database of Events, Language, and Tone $ geomesa-fs ingest \ > --path hdfs:///data/geomesa \ > --encoding parquet \ > --partition-scheme daily,z2-2bit \ > --converter gdelt \ > --spec gdelt \ > --num-reducers 60 \ > hdfs:///tmp/geomesa_source/* INFO Schema 'gdelt' exists INFO Running ingestion in distributed mode INFO Submitting job - please wait... INFO Tracking available at http://master1:8088/proxy/application_1542841752250_0003/ Map (stage 1/2): [============================================================] 100% complete 1285001 mapped 0 failed in 00:00:54 Reduce (stage 2/2): [============================================================] 100% complete 1285001 written in 00:01:49 INFO Distributed ingestion complete in 00:02:44 INFO Ingested 1285001 features with no failures. Ingesting (and partitioning) GDELT hadoop distcp \ s3a://gdelt-open-data/events/2017 010* \ /tmp/geomesa_source S3 to HDFS with distcp
  7. 14 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Quick intro to Spark and a tour of Web UIs Intros 1 • Configs for a Spark cluster using GeoPySpark • Code walkthrough of a sample raster operation Operations on rasters with GeoPySpark 3 • Intro to mtl-trajet dataset and the math behind map-matching • Demo: map-matching in a Zeppelin note and viewing of results Large-scale map-matching with Spark and OSRM 4 • Data ingestion and temporal/spatial partitioning • Demo: sample vector operations in a Zeppelin note Operations on vectors with GeoMesa 2
  8. 16 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Quick intro to Spark and a tour of Web UIs Intros 1 • Configs for a Spark cluster using GeoPySpark • Code walkthrough of a sample raster operation Operations on rasters with GeoPySpark 3 • Intro to mtl-trajet dataset and the math behind map-matching • Demo: map-matching in a Zeppelin note and viewing of results Large-scale map-matching with Spark and OSRM 4 • Data ingestion and temporal/spatial partitioning • Demo: sample vector operations in a Zeppelin note Operations on vectors with GeoMesa 2
  9. 17 Copyright © 2019, PatternedScience Inc. Map Matching Introducing MTL-Trajet

    dataset Also available from the City of Montreal’s website: http://donnees.ville.montreal.qc.ca/dataset/mtl-trajet
  10. 18 Copyright © 2019, PatternedScience Inc. Map Matching The math

    behind what we use • Hidden Markov Model (HMM) serves to find the most probable state sequence for a given sequence of observations; • The states of the HMM are the individual road segments and the state measurements/observations are the noisy vehicle location (GPS) measurements; • After the probabilities of observation and transitions are computed with the HMM, the Viterbi algorithm is used to identify the most probable sequence of states (i.e., street segments); • This approach can also be applied to other networks, such as bike paths and railroads, and the GPS data collected on those modes of transport; • References and more information: ◦ P. Newson and J. Krumm. Hidden Markov Map Matching Through Noise and Sparseness. In Proceedings of International Conference on Advances in Geographic Information Systems, 2009. Project URL: https://www.microsoft.com/en-us/research/publication/hidden-markov-map-matching-noise-sparseness/ ◦ OSRM developers' announcement and introduction regarding the map-matching feature: https://www.mapbox.com/blog/map-matching/ ◦ Barefoot, another map-matching solution, also implements the same algorithm: https://github.com/bmwcarit/barefoot/wiki#hmm-map-matching
  11. Code Walkthrough & Live Demo • Zeppelin note: Perform Map

    Matching • Jupyter notebook: mapping_map_matching • Zeppelin note: Map Matching Perf Test Notebooks/Scripts
  12. Q&A