Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Title: Big Geospatial Data with Open-source Tech - Masood Krohy

Title: Big Geospatial Data with Open-source Tech - Masood Krohy

Masood Krohy at June 11, 2019 event of EDPP Montreal (montrealml.dev/data)

Title: Big Geospatial Data with Open-source Tech (Vectors, Rasters & Map-matching)

Presentation/Demo video: check out PatternedScience's YouTube channel at https://www.youtube.com/channel/UCjbEIZlS2DA45Bswi5EXWRw

Summary: Geospatial datasets (i.e. geocoded data points) are everywhere nowadays and often add enormous value to data analytics/mining and machine learning projects. In this new era of Big Data, libraries and engines such as GeoPandas, PostGIS and the equivalent products in the commercial space often fall short and cannot scale up sufficiently to let us tap into the Big Data that is being collected in many use cases and by many organizations. In this talk/demo, we explore free, open-source, Big Data-ready technologies and workflows like GeoMesa, GeoPySpark and OSRM-on-Spark and show how to use these Apache Spark-based tech/workflows for key geospatial operations and use cases. We start by introducing GeoMesa and demo-ing how it can be used to ingest Big Geospatial Data and perform operations on vectors. Next, we briefly introduce GeoPySpark, the Python interface to Geotrellis, for performing operations on rasters. At the end, we turn to map-matching which is the process of associating names to geocoded data points from an underlying network (e.g., determining which street a particular GPS point should be associated with). We describe and demo how we can combine OSRM with Spark to do scalable map-matching on Big Data and therefore open up a lot of possibilities for advanced data mining and machine learning projects.

Bio: Masood Krohy is a Data Science Platform Architect/Advisor and most recently acted as the Chief Architect of UniAnalytica, an advanced data science platform with wide, out-of-the-box support for time-series and geospatial use cases. He has worked with several corporations in different industries in the past few years to design, implement and productionize Deep Learning and Big Data products. He holds a Ph.D. in computer engineering.

PatternedScience

June 11, 2019
Tweet

More Decks by PatternedScience

Other Decks in Technology

Transcript

  1. Copyright © 2019, PatternedScience Inc.
    www.patterned.science
    Big Geospatial Data with Open-source Tech
    (Vectors, Rasters & Map-matching)
    Presenter
    Masood Krohy, Ph.D.
    June 11, 2019

    View Slide

  2. 2
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    1
    ● Configs for a Spark cluster using GeoPySpark
    ● Code walkthrough of a sample raster operation
    Operations on rasters with GeoPySpark
    3
    ● Intro to mtl-trajet dataset and the math behind map-matching
    ● Demo: map-matching in a Zeppelin note and viewing of results
    Large-scale map-matching with Spark and OSRM
    4
    ● Data ingestion and temporal/spatial partitioning
    ● Demo: sample vector operations in a Zeppelin note
    Operations on vectors with GeoMesa
    2

    View Slide

  3. 3
    Copyright © 2019, PatternedScience Inc.
    Ph.D. in Computer Engineering
    Analytical modeling of botnets. Validated by data collected in industry. 3 top publications.
    Senior Analyst, Rogers
    Managing the analytics reporting/statistical analyses of the national benchmarking program.
    Data Scientist, Intact
    First Data Scientist of the company. Led the Big Data mining project for the UBI program.
    Lead Data Scientist, CN
    Implemented an object-within-object detection system to detect cracks in railway equipment.
    Masood Krohy
    Presenter Bio
    2013
    Sr Data Science Advisor, B.Yond
    Implemented a pattern detection system for stream of alarms coming from telecom devices.
    Chief Architect, UniAnalytica (advanced data science platform)
    Platform contains Apache Spark, GeoMesa, GeoPySpark and OSRM, among many others.
    2014
    2016
    2017
    2018
    2019
    Data Science Platform Architect & Advisor

    View Slide

  4. 4
    Copyright © 2019, PatternedScience Inc.
    2. Slicing/dicing of GPS/LiDAR data
    OmniSci (MapD) GPU Database - Slicing and
    dicing of GPS and LiDAR data. Used along
    with the GPU DataFrame lib gives a much
    faster Pandas DataFrame alternative
    6. Processing of point cloud data
    PDAL - Python lib on workers enabling many
    geospatial operations on point cloud data - can be
    hooked up with Spark to enable geospatial
    indexing and exploratory analysis of massive
    LiDAR datasets
    3. Vector processing
    GeoMesa - Vector processing/operations -
    GeoMesa HDFS DataStore provides temporal
    and geospatial indexing and GeoMesa lib
    enables Spark to query the DataStore
    5. Interactive geospatial analysis
    GeoNotebook - Interactive geospatial analysis
    (Jupyter notebook on the left side of the screen,
    interactive map on the right, with operations that
    get translated from one to the other)
    1. Map-matching
    OSRM (Open Source Routing Machine)
    In combination with Spark for large-scale
    map-matching
    4. Raster processing
    GeoPySpark (GeoTrellis) - Raster
    processing/operations - workflow to
    ingest NetCDF data
    Geospatial/Climate Data Processing Stack
    UniAnalytica Platform

    View Slide

  5. 5
    Copyright © 2019, PatternedScience Inc.
    Graph source: Databricks

    View Slide

  6. 6
    Copyright © 2019, PatternedScience Inc.

    View Slide

  7. 7
    Copyright © 2019, PatternedScience Inc.

    View Slide

  8. 8
    Copyright © 2019, PatternedScience Inc.

    View Slide

  9. 9
    Copyright © 2019, PatternedScience Inc.

    View Slide

  10. 10
    Copyright © 2019, PatternedScience Inc.

    View Slide

  11. 11
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    1
    ● Configs for a Spark cluster using GeoPySpark
    ● Code walkthrough of a sample raster operation
    Operations on rasters with GeoPySpark
    3
    ● Intro to mtl-trajet dataset and the math behind map-matching
    ● Demo: map-matching in a Zeppelin note and viewing of results
    Large-scale map-matching with Spark and OSRM
    4
    ● Data ingestion and temporal/spatial partitioning
    ● Demo: sample vector operations in a Zeppelin note
    Operations on vectors with GeoMesa
    2

    View Slide

  12. 12
    Copyright © 2019, PatternedScience Inc.
    GeoMesa FileSystem Data Store
    Data ingestion and temporal/spatial partitioning
    “GDELT Event Database periodically
    scans news articles and uses
    natural language processing to
    identify the people, locations,
    organizations, counts, themes,
    sources, emotions, quotes and
    events driving our global society
    every second of every day.”
    GDELT data is updated each
    morning at 6am.
    GDELT: Global Database of
    Events, Language, and Tone
    $ geomesa-fs ingest \
    > --path hdfs:///data/geomesa \
    > --encoding parquet \
    > --partition-scheme daily,z2-2bit \
    > --converter gdelt \
    > --spec gdelt \
    > --num-reducers 60 \
    > hdfs:///tmp/geomesa_source/*
    INFO Schema 'gdelt' exists
    INFO Running ingestion in distributed mode
    INFO Submitting job - please wait...
    INFO Tracking available at http://master1:8088/proxy/application_1542841752250_0003/
    Map (stage 1/2): [============================================================]
    100% complete 1285001 mapped 0 failed in 00:00:54
    Reduce (stage 2/2): [============================================================]
    100% complete 1285001 written in 00:01:49
    INFO Distributed ingestion complete in 00:02:44
    INFO Ingested 1285001 features with no failures.
    Ingesting (and partitioning) GDELT
    hadoop distcp \
    s3a://gdelt-open-data/events/2017
    010* \
    /tmp/geomesa_source
    S3 to HDFS with distcp

    View Slide

  13. Code
    Walkthrough
    & Live Demo Zeppelin note: GeoMesa examples
    Notebook/Script

    View Slide

  14. 14
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    1
    ● Configs for a Spark cluster using GeoPySpark
    ● Code walkthrough of a sample raster operation
    Operations on rasters with GeoPySpark
    3
    ● Intro to mtl-trajet dataset and the math behind map-matching
    ● Demo: map-matching in a Zeppelin note and viewing of results
    Large-scale map-matching with Spark and OSRM
    4
    ● Data ingestion and temporal/spatial partitioning
    ● Demo: sample vector operations in a Zeppelin note
    Operations on vectors with GeoMesa
    2

    View Slide

  15. Code
    Walkthrough Jupyter notebook: geotrellis_geopyspark_quick_example
    Notebook/Script

    View Slide

  16. 16
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    1
    ● Configs for a Spark cluster using GeoPySpark
    ● Code walkthrough of a sample raster operation
    Operations on rasters with GeoPySpark
    3
    ● Intro to mtl-trajet dataset and the math behind map-matching
    ● Demo: map-matching in a Zeppelin note and viewing of results
    Large-scale map-matching with Spark and OSRM
    4
    ● Data ingestion and temporal/spatial partitioning
    ● Demo: sample vector operations in a Zeppelin note
    Operations on vectors with GeoMesa
    2

    View Slide

  17. 17
    Copyright © 2019, PatternedScience Inc.
    Map Matching
    Introducing MTL-Trajet dataset
    Also available from the City of Montreal’s website: http://donnees.ville.montreal.qc.ca/dataset/mtl-trajet

    View Slide

  18. 18
    Copyright © 2019, PatternedScience Inc.
    Map Matching
    The math behind what we use
    ● Hidden Markov Model (HMM) serves to find the most probable state sequence for a given sequence of
    observations;
    ● The states of the HMM are the individual road segments and the state measurements/observations are the
    noisy vehicle location (GPS) measurements;
    ● After the probabilities of observation and transitions are computed with the HMM, the Viterbi algorithm is used
    to identify the most probable sequence of states (i.e., street segments);
    ● This approach can also be applied to other networks, such as bike paths and railroads, and the GPS data
    collected on those modes of transport;
    ● References and more information:
    ○ P. Newson and J. Krumm. Hidden Markov Map Matching Through Noise and Sparseness. In Proceedings of International Conference on Advances
    in Geographic Information Systems, 2009.
    Project URL: https://www.microsoft.com/en-us/research/publication/hidden-markov-map-matching-noise-sparseness/
    ○ OSRM developers' announcement and introduction regarding the map-matching feature: https://www.mapbox.com/blog/map-matching/
    ○ Barefoot, another map-matching solution, also implements the same algorithm: https://github.com/bmwcarit/barefoot/wiki#hmm-map-matching

    View Slide

  19. 19
    Copyright © 2019, PatternedScience Inc.
    Map Matching
    Map-matching of 16 randomly-selected trips

    View Slide

  20. Code
    Walkthrough
    & Live Demo
    ● Zeppelin note: Perform Map Matching
    ● Jupyter notebook: mapping_map_matching
    ● Zeppelin note: Map Matching Perf Test
    Notebooks/Scripts

    View Slide

  21. Q&A

    View Slide