Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018


Seth Fitzsimmons

October 07, 2018


  1. What if you wanted to query all of OSM for

    all time (for every change ever)?
  2. What if you wanted to do it at scale?

  3. What if you wanted to do it quickly?

  4. We needed this for different reasons.

  5. None
  6. We approached it from different angles.

  7. Premise: with some work, relational algebra (SQL-like) can be used

    to process OSM data
  8. None

  10. We worked together to start OSMesa.

  11. We need a new name! Think of suggestions and at

    the end the talk we will vote!
  12. OSMesa is a set of Spark functions to churn through

    enormous quantities of weird data and produce useful results quickly.
  13. Doesn’t that already exist?

  14. Kind of...

  15. Not really.

  16. Not off the shelf.

  17. It takes forever. (not horizontally scalable)

  18. History is the straw that breaks everything.

  19. OSMesa is solving our problems.

  20. ...and it might be useful for you, too.

  21. Goal: Facilitate rapid concept iteration

  22. Goal: Produce useful derivatives that can be inputs into other

  23. Goal: Enable inconceivable things

  24. Philosophy: Compatibility with off-the-shelf tools

  25. Philosophy: Make data accessible from standard tools OSM PBF -

    supported by OSM-aware tools ORC - supported by the Hadoop ecosystem OSM data model - (partially) supported by OSM-aware tools OGC data model - supported by everything else
  26. OSMesa workflow AWS EMR Cluster AWS S3 ORC Statistics Vector

    Tiles ORC files
  27. Side-effect: Exchange of ideas

  28. OSM ←→ Spark / GeoTrellis / JTS

  29. Differentiators: • Spark-based • Defaults to history • Handles relations

    • Uses changesets for enrichment
  30. Things that are hard: • History • Relations

  31. OSM Data Model The OSM data model consists mainly of

    3 elements: • Nodes - Points • Ways - LineStrings, Polygons • Relations - GeometryCollections, Polygon with holes, MultiPolygons As well as the tag-based metadata that applies to each elements, and changesets grouping edits
  32. OSM Data Model: Relations

  33. OSM Data Model: Changesets • Edits are grouped into changesets,

    which have their own metadata such as use comments (for developers, think commit messages) • Adding hashtags to user comments allows downstream processing to group changes - for example, #HOTLunch
  34. None
  35. None
  36. • With OSMesa, we can create full historical geometries. •

    To do this, we need needed to create a concept of “minor versions” of geometries • We converted timestamp to an update date that propagates up to the way or relation • We added a “valid_until” tag on elements that tells when an element is no longer valid (either replaced or deleted) Creating features from History
  37. way v1 highway=unclassified node v1 node v1 node v1 node

    v1 node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 way v1 highway=unclassified
  38. way v1 highway=unclassified node v1 node v1 node v1 node

    v1 way v1.1 highway=unclassified node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 minor version change
  39. • With minor versions, we can bake new ORC files

    that contain geometries of every element in OSM history, with ways/relations representing every edit to the element as well as elements that they contain • Then, we compute statistics per changeset based on geometries, and roll up the statistics per user and hashtag Full historical geometries
  40. • Processing of full history into features in under 40

    minutes (cluster of 255 m3.2xlarge nodes) • This is not a small cluster ( ≈$65/hour). YMMV with smaller clusters. • We are building update mechanisms to avoid refreshing the entire dataset • Produces 600GB of ORC Processing OSM data at scale
  41. Other Features • Streaming updates of vector tiles based on

    replication files (Spark Streaming) • Streaming aggregations
  42. Related projects: • ohsome • Telenav Parquet exports (?) •

    Atlas (?) • OSM Wayback
  43. Some data created by OSMesa...

  44. None
  45. Viewing time slices of Rhode Island OSM

  46. Historical edits for several hashtag campaigns

  47. Global friction surface for cost distance calculations using elevation (SRTM)

    and OSM roads + water bodies
  48. Explorable Detroit (

  49. Detroit Contributor Heatmap (

  50. • Building matching between OSM and other vector datasets •

    Generating vector tiles for URCHN containing a subset of historical data to front-end analytics OSMesa: Other current uses
  51. The Future: Validation workflows, Reputation scores • Better validation workflows

    is a big question in the OSM community right now (according to SOTM US 2017) • HOT Tasking manager does some; we can do better • One way to improve validation workflows is to suggest validation be done by veteran mappers, validation be suggested for more junior mappers (“reputations core”) • Development Seed, who contribute & uses OSMesa work, have great ideas in this space.
  52. The Future: Machine Learning pre- and post- processing • Pre-processing

    geospatial imagery and OSM into training chips - a distributed label-maker • Managing data into and out of Raster Vision • Post-processing by cleaning the model output, matching to OSM or other vector data to remove duplicates, conflation workflows • Matching OSM to imagery dates: e.g. pre- and post- disaster.
  53. THANKS! Rob Emanuele, Azavea @lossyrob (Twitter, GitHub) Seth Fitzsimmons,

    Pacific Atlas @mojodna (Twitter, GitHub) One more cool visualization: