Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

Seth Fitzsimmons

October 07, 2018
Tweet

More Decks by Seth Fitzsimmons

Other Decks in Technology

Transcript

  1. What if you wanted to query all of OSM for all
    time (for every change ever)?

    View full-size slide

  2. What if you wanted to do it at scale?

    View full-size slide

  3. What if you wanted to do it quickly?

    View full-size slide

  4. We needed this for different reasons.

    View full-size slide

  5. We approached it from different angles.

    View full-size slide

  6. Premise: with some work, relational algebra
    (SQL-like) can be used to process OSM data

    View full-size slide

  7. We worked together to start OSMesa.

    View full-size slide

  8. We need a new name!
    Think of suggestions and at the end the
    talk we will vote!

    View full-size slide

  9. OSMesa is a set of Spark functions to churn
    through enormous quantities of weird data
    and produce useful results quickly.

    View full-size slide

  10. Doesn’t that already exist?

    View full-size slide

  11. Not off the shelf.

    View full-size slide

  12. It takes forever.
    (not horizontally scalable)

    View full-size slide

  13. History is the straw that breaks everything.

    View full-size slide

  14. OSMesa is solving our problems.

    View full-size slide

  15. ...and it might be useful for you, too.

    View full-size slide

  16. Goal: Facilitate rapid concept iteration

    View full-size slide

  17. Goal: Produce useful derivatives that can be
    inputs into other systems

    View full-size slide

  18. Goal: Enable inconceivable things

    View full-size slide

  19. Philosophy: Compatibility with off-the-shelf
    tools

    View full-size slide

  20. Philosophy: Make data accessible from
    standard tools
    OSM PBF - supported by OSM-aware tools
    ORC - supported by the Hadoop ecosystem
    OSM data model - (partially) supported by
    OSM-aware tools
    OGC data model - supported by everything
    else

    View full-size slide

  21. OSMesa workflow
    AWS EMR Cluster
    AWS S3
    ORC
    Statistics
    Vector Tiles
    ORC files

    View full-size slide

  22. Side-effect: Exchange of ideas

    View full-size slide

  23. OSM ←→ Spark / GeoTrellis / JTS

    View full-size slide

  24. Differentiators:
    ● Spark-based
    ● Defaults to history
    ● Handles relations
    ● Uses changesets for enrichment

    View full-size slide

  25. Things that are hard:
    ● History
    ● Relations

    View full-size slide

  26. OSM Data Model
    The OSM data model consists mainly of 3 elements:
    ● Nodes - Points
    ● Ways - LineStrings, Polygons
    ● Relations - GeometryCollections, Polygon with holes,
    MultiPolygons
    As well as the tag-based metadata that applies to each
    elements, and changesets grouping edits

    View full-size slide

  27. OSM Data Model: Relations

    View full-size slide

  28. OSM Data Model: Changesets
    ● Edits are grouped into changesets, which have their own
    metadata such as use comments (for developers, think
    commit messages)
    ● Adding hashtags to user comments allows downstream
    processing to group changes - for example, #HOTLunch

    View full-size slide

  29. ● With OSMesa, we can create full historical geometries.
    ● To do this, we need needed to create a concept of “minor
    versions” of geometries
    ● We converted timestamp to an update date that propagates
    up to the way or relation
    ● We added a “valid_until” tag on elements that tells when an
    element is no longer valid (either replaced or deleted)
    Creating features from History

    View full-size slide

  30. way v1
    highway=unclassified
    node v1
    node v1
    node v1
    node v1
    node v1
    node v1
    node v2
    node v2
    way v2
    highway=primary
    node v1
    node v1
    node v2
    node v2
    way v1
    highway=unclassified

    View full-size slide

  31. way v1
    highway=unclassified
    node v1
    node v1
    node v1
    node v1
    way v1.1
    highway=unclassified
    node v1
    node v1
    node v2
    node v2
    way v2
    highway=primary
    node v1
    node v1
    node v2
    node v2
    minor
    version
    change

    View full-size slide

  32. ● With minor versions, we can bake new ORC files that
    contain geometries of every element in OSM history, with
    ways/relations representing every edit to the element as well
    as elements that they contain
    ● Then, we compute statistics per changeset based on
    geometries, and roll up the statistics per user and hashtag
    Full historical geometries

    View full-size slide

  33. ● Processing of full history into features in under 40 minutes
    (cluster of 255 m3.2xlarge nodes)
    ● This is not a small cluster ( ≈$65/hour). YMMV with smaller
    clusters.
    ● We are building update mechanisms to avoid refreshing the
    entire dataset
    ● Produces 600GB of ORC
    Processing OSM data at scale

    View full-size slide

  34. Other Features
    ● Streaming updates of vector tiles based on
    replication files (Spark Streaming)
    ● Streaming aggregations

    View full-size slide

  35. Related projects:
    ● ohsome
    ● Telenav Parquet exports (?)
    ● Atlas (?)
    ● OSM Wayback

    View full-size slide

  36. Some data created by OSMesa...

    View full-size slide

  37. Viewing time slices of Rhode Island OSM

    View full-size slide

  38. Historical edits for several hashtag campaigns

    View full-size slide

  39. Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies

    View full-size slide

  40. Explorable Detroit (https://tinyurl.com/sotmus-explorable-detroit)

    View full-size slide

  41. Detroit Contributor Heatmap (https://tinyurl.com/sotmus-detroit-heatmap)

    View full-size slide

  42. ● Building matching between OSM and other vector datasets
    ● Generating vector tiles for URCHN containing a subset of
    historical data to front-end analytics
    OSMesa: Other current uses

    View full-size slide

  43. The Future: Validation workflows, Reputation
    scores
    ● Better validation workflows is a big question in the OSM
    community right now (according to SOTM US 2017)
    ● HOT Tasking manager does some; we can do better
    ● One way to improve validation workflows is to suggest
    validation be done by veteran mappers, validation be
    suggested for more junior mappers (“reputations core”)
    ● Development Seed, who contribute & uses OSMesa work,
    have great ideas in this space.

    View full-size slide

  44. The Future: Machine Learning pre- and post-
    processing
    ● Pre-processing geospatial imagery and OSM into training
    chips - a distributed label-maker
    ● Managing data into and out of Raster Vision
    ● Post-processing by cleaning the model output, matching to
    OSM or other vector data to remove duplicates, conflation
    workflows
    ● Matching OSM to imagery dates: e.g. pre- and post-
    disaster.

    View full-size slide

  45. THANKS!
    Rob Emanuele, Azavea
    @lossyrob (Twitter, GitHub)
    www.azavea.com
    Seth Fitzsimmons, Pacific Atlas
    @mojodna (Twitter, GitHub)
    www.pacatlas.com
    github.com/azavea/osmesa
    One more cool visualization: https://vimeo.com/269953189

    View full-size slide