Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

What if you wanted to query all of OSM for
all time (for every change ever)?

What if you wanted to do it at scale?

What if you wanted to do it quickly?

We needed this for different reasons.

We approached it from different angles.

Premise: with some work, relational algebra (SQL-like) can be used
to process OSM data

INSANITY

We worked together to start OSMesa.

We need a new name! Think of suggestions and at
the end the talk we will vote!

OSMesa is a set of Spark functions to churn through
enormous quantities of weird data and produce useful results quickly.

Doesn’t that already exist?

Kind of...

Not really.

Not off the shelf.

It takes forever. (not horizontally scalable)

History is the straw that breaks everything.

OSMesa is solving our problems.

...and it might be useful for you, too.

Goal: Facilitate rapid concept iteration

Goal: Produce useful derivatives that can be inputs into other
systems

Goal: Enable inconceivable things

Philosophy: Compatibility with off-the-shelf tools

Philosophy: Make data accessible from standard tools OSM PBF -
supported by OSM-aware tools ORC - supported by the Hadoop ecosystem OSM data model - (partially) supported by OSM-aware tools OGC data model - supported by everything else

OSMesa workflow AWS EMR Cluster AWS S3 ORC Statistics Vector
Tiles ORC files

Side-effect: Exchange of ideas

OSM ←→ Spark / GeoTrellis / JTS

Differentiators: • Spark-based • Defaults to history • Handles relations
• Uses changesets for enrichment

Things that are hard: • History • Relations

OSM Data Model The OSM data model consists mainly of
3 elements: • Nodes - Points • Ways - LineStrings, Polygons • Relations - GeometryCollections, Polygon with holes, MultiPolygons As well as the tag-based metadata that applies to each elements, and changesets grouping edits

OSM Data Model: Relations

OSM Data Model: Changesets • Edits are grouped into changesets,
which have their own metadata such as use comments (for developers, think commit messages) • Adding hashtags to user comments allows downstream processing to group changes - for example, #HOTLunch

• With OSMesa, we can create full historical geometries. •
To do this, we need needed to create a concept of “minor versions” of geometries • We converted timestamp to an update date that propagates up to the way or relation • We added a “valid_until” tag on elements that tells when an element is no longer valid (either replaced or deleted) Creating features from History

way v1 highway=unclassified node v1 node v1 node v1 node
v1 node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 way v1 highway=unclassified

way v1 highway=unclassified node v1 node v1 node v1 node
v1 way v1.1 highway=unclassified node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 minor version change

• With minor versions, we can bake new ORC files
that contain geometries of every element in OSM history, with ways/relations representing every edit to the element as well as elements that they contain • Then, we compute statistics per changeset based on geometries, and roll up the statistics per user and hashtag Full historical geometries

• Processing of full history into features in under 40
minutes (cluster of 255 m3.2xlarge nodes) • This is not a small cluster ( ≈$65/hour). YMMV with smaller clusters. • We are building update mechanisms to avoid refreshing the entire dataset • Produces 600GB of ORC Processing OSM data at scale

Other Features • Streaming updates of vector tiles based on
replication files (Spark Streaming) • Streaming aggregations

Related projects: • ohsome • Telenav Parquet exports (?) •
Atlas (?) • OSM Wayback

Some data created by OSMesa...

Viewing time slices of Rhode Island OSM

Historical edits for several hashtag campaigns

Global friction surface for cost distance calculations using elevation (SRTM)
and OSM roads + water bodies

Explorable Detroit (https://tinyurl.com/sotmus-explorable-detroit)

Detroit Contributor Heatmap (https://tinyurl.com/sotmus-detroit-heatmap)

• Building matching between OSM and other vector datasets •
Generating vector tiles for URCHN containing a subset of historical data to front-end analytics OSMesa: Other current uses

The Future: Validation workflows, Reputation scores • Better validation workflows
is a big question in the OSM community right now (according to SOTM US 2017) • HOT Tasking manager does some; we can do better • One way to improve validation workflows is to suggest validation be done by veteran mappers, validation be suggested for more junior mappers (“reputations core”) • Development Seed, who contribute & uses OSMesa work, have great ideas in this space.

The Future: Machine Learning pre- and post- processing • Pre-processing
geospatial imagery and OSM into training chips - a distributed label-maker • Managing data into and out of Raster Vision • Post-processing by cleaning the model output, matching to OSM or other vector data to remove duplicates, conflation workflows • Matching OSM to imagery dates: e.g. pre- and post- disaster.

THANKS! Rob Emanuele, Azavea @lossyrob (Twitter, GitHub) www.azavea.com Seth Fitzsimmons,
Pacific Atlas @mojodna (Twitter, GitHub) www.pacatlas.com github.com/azavea/osmesa One more cool visualization: https://vimeo.com/269953189

Working with OpenStreetMap using Apache Spark a...

Working with OpenStreetMap using Apache Spark and GeoTrellis - SotMUS 2018

More Decks by Seth Fitzsimmons

Other Decks in Technology

Featured

Transcript