Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

Seth Fitzsimmons

October 07, 2018

More Decks by Seth Fitzsimmons

Other Decks in Technology


  1. What if you wanted to query all of OSM for

    all time (for every change ever)?
  2. We need a new name! Think of suggestions and at

    the end the talk we will vote!
  3. OSMesa is a set of Spark functions to churn through

    enormous quantities of weird data and produce useful results quickly.
  4. Philosophy: Make data accessible from standard tools OSM PBF -

    supported by OSM-aware tools ORC - supported by the Hadoop ecosystem OSM data model - (partially) supported by OSM-aware tools OGC data model - supported by everything else
  5. OSM Data Model The OSM data model consists mainly of

    3 elements: • Nodes - Points • Ways - LineStrings, Polygons • Relations - GeometryCollections, Polygon with holes, MultiPolygons As well as the tag-based metadata that applies to each elements, and changesets grouping edits
  6. OSM Data Model: Changesets • Edits are grouped into changesets,

    which have their own metadata such as use comments (for developers, think commit messages) • Adding hashtags to user comments allows downstream processing to group changes - for example, #HOTLunch
  7. • With OSMesa, we can create full historical geometries. •

    To do this, we need needed to create a concept of “minor versions” of geometries • We converted timestamp to an update date that propagates up to the way or relation • We added a “valid_until” tag on elements that tells when an element is no longer valid (either replaced or deleted) Creating features from History
  8. way v1 highway=unclassified node v1 node v1 node v1 node

    v1 node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 way v1 highway=unclassified
  9. way v1 highway=unclassified node v1 node v1 node v1 node

    v1 way v1.1 highway=unclassified node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 minor version change
  10. • With minor versions, we can bake new ORC files

    that contain geometries of every element in OSM history, with ways/relations representing every edit to the element as well as elements that they contain • Then, we compute statistics per changeset based on geometries, and roll up the statistics per user and hashtag Full historical geometries
  11. • Processing of full history into features in under 40

    minutes (cluster of 255 m3.2xlarge nodes) • This is not a small cluster ( ≈$65/hour). YMMV with smaller clusters. • We are building update mechanisms to avoid refreshing the entire dataset • Produces 600GB of ORC Processing OSM data at scale
  12. Other Features • Streaming updates of vector tiles based on

    replication files (Spark Streaming) • Streaming aggregations
  13. • Building matching between OSM and other vector datasets •

    Generating vector tiles for URCHN containing a subset of historical data to front-end analytics OSMesa: Other current uses
  14. The Future: Validation workflows, Reputation scores • Better validation workflows

    is a big question in the OSM community right now (according to SOTM US 2017) • HOT Tasking manager does some; we can do better • One way to improve validation workflows is to suggest validation be done by veteran mappers, validation be suggested for more junior mappers (“reputations core”) • Development Seed, who contribute & uses OSMesa work, have great ideas in this space.
  15. The Future: Machine Learning pre- and post- processing • Pre-processing

    geospatial imagery and OSM into training chips - a distributed label-maker • Managing data into and out of Raster Vision • Post-processing by cleaning the model output, matching to OSM or other vector data to remove duplicates, conflation workflows • Matching OSM to imagery dates: e.g. pre- and post- disaster.
  16. THANKS! Rob Emanuele, Azavea @lossyrob (Twitter, GitHub) www.azavea.com Seth Fitzsimmons,

    Pacific Atlas @mojodna (Twitter, GitHub) www.pacatlas.com github.com/azavea/osmesa One more cool visualization: https://vimeo.com/269953189