Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Day3-1515-Rethinking binary map data exchange formats

sotm2017
September 01, 2017

Day3-1515-Rethinking binary map data exchange formats

sotm2017

September 01, 2017
Tweet

More Decks by sotm2017

Other Decks in Research

Transcript

  1. Rethinking Binary Map Data Exchange Formats State of the Map

    2017 Aizu-Wakamatsu 
 Andrew Byrd, Conveyal
  2. Subject: Data formats for bulk storage and transfer of OpenStreetMap

    data Not indexed databases for active editing
  3. Who am I? • Andrew Byrd • OpenStreetMap user and

    Github user abyrd • [email protected] • OpenTripPlanner (public transport + walking and biking) • major contributor • coordinate development (steering committee) • Conveyal • principal & co-founder • applying technology to urban transportation planning • long time participants in the open transit data movement (GTFS) • My team: accessibility for alternatives analysis in urban planning We use OpenStreetMap data every day: Not just for visual base maps but a routable network
  4. Main data exchange formats • OSM.XML • Original format circa

    2004 • Considered essentially human readable, arguably “self-documenting” • XML very prevalent in industry at the time for general purpose data exchange • Voluminous, contains many repeated XML tag and attribute names • In practice text is compressed (planet-latest.osm.gz or planet-latest.osm.bz2) • OSM.PBF • Based on Google’s Protocol Buffers • Appeared at the end of 2010 • Major advance for speed and compactness • Reduced file sizes by 50% or more over compressed XML • Used since 2011 by OpenTripPlanner and Conveyal
  5. OpenStreetMap XML format <?xml version="1.0" encoding="UTF-8"?> <osm version="0.6" generator="CGImap 0.0.2">

    <bounds minlat="54.0889580" minlon="12.2487570" maxlat="54.0913900" maxlon="12.2524800"/> <node id="1831881213" version="1" changeset="12370172" lat="54.0900666" lon="12.2539381" user="lafkor" uid="75625" visible="true" timestamp="2012-07-20T09:43:19Z"> <tag k="name" v="Neu Broderstorf"/> <tag k="traffic_sign" v="city_limit"/> </node> <node id="298884272" lat="54.0901447" lon="12.2516513" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/> <way id="26659127" user="Masch" uid="55988" visible="true" version="5" changeset="4142606" timestamp="2010-03-16T11:47:08Z"> <nd ref="292403538"/> <nd ref="298884289"/> ... <nd ref="261728686"/> <tag k="highway" v="unclassified"/> <tag k="name" v="Pastower Straße"/> </way> <relation id="56688" user="kmvar" uid="56190" visible="true" version="28" changeset="6947637" timestamp="2011-01-12T14:23:49Z"> <member type="node" ref="294942404" role=""/> ... <member type="node" ref="364933006" role=""/> <member type="way" ref="4579143" role=""/> ... <member type="node" ref="249673494" role=""/> <tag k="name" v="Küstenbus Linie 123"/> <tag k="network" v="VVW"/> <tag k="operator" v="Regionalverkehr Küste"/> <tag k="ref" v="123"/> <tag k="route" v="bus"/> <tag k="type" v="route"/> </relation> ... </osm>
  6. OpenStreetMap PBF format 00000000 12 - S 2 'primitivegroup' 00000000

    __ ad d5 07 - length 125613 bytes 00000000 __ __ __ __ 12 - PrimitiveGroup containing DenseNodes 00000000 __ __ __ __ __ a9 d5 07 - length 125609 bytes 00000000 __ __ __ __ __ __ __ __ 0a - DenseNodes 00000000 __ __ __ __ __ __ __ __ __ df 42 - length 8543 bytes 00000000 __ __ __ __ __ __ __ __ __ __ __ ce ad 0f 02 02 00000010 02 02 04 02 02 02 02 02 02 02 02 02 02 02 02 02 00000020 02 02 02 c6 8b ef 13 02 02 02 02 02 02 02 02 f0 00000030 ea 01 02 02 02 02 02 02 02 02 02 02 02 02 02 02
  7. OpenStreetMap PBF spec message PrimitiveBlock { required StringTable stringtable =

    1; repeated PrimitiveGroup primitivegroup = 2; // Granularity, units of nanodegrees, // used to store coordinates in this block optional int32 granularity = 17 [default=100]; // Offset value between the output coordinates coordinates // and the granularity grid, in units of nanodegrees. optional int64 lat_offset = 19 [default=0]; optional int64 lon_offset = 20 [default=0]; // Granularity of dates, // normally represented in milliseconds since the epoch optional int32 date_granularity = 18 [default=1000]; // Proposed extension: //optional BBox bbox = XX; } message PrimitiveGroup { repeated Node nodes = 1; optional DenseNodes dense = 2; repeated Way ways = 3; repeated Relation relations = 4; repeated ChangeSet changesets = 5; }
  8. OpenStreetMap PBF spec message Way { required int64 id =

    1; // Parallel arrays. repeated uint32 keys = 2 [packed = true]; repeated uint32 vals = 3 [packed = true]; optional Info info = 4; repeated sint64 refs = 8 [packed = true]; // DELTA coded } message DenseNodes { repeated sint64 id = 1 [packed = true]; // DELTA coded //repeated Info info = 4; optional DenseInfo denseinfo = 5; repeated sint64 lat = 8 [packed = true]; // DELTA coded repeated sint64 lon = 9 [packed = true]; // DELTA coded // Special packing of keys and vals into one array. // May be empty if all nodes in this block are tagless. repeated int32 keys_vals = 10 [packed = true]; }
  9. Dense messages and parallel arrays PrimitiveBlock: 1. StringTable: 1. highway

    2. secondary 3. high street 4. tertiary 5. second street 6. surface 7. name … 2. PrimitiveGroup: 2. DenseNodes: 1. id: 102773, +2, +1, +1, +2, +2, +1, +4, … 8. lat: 45613223, +1, +4, +6, -3, -2, +1, +2, … 9. lon: 12443214, -4, +3, +2, +1, +1, +5, -2, … 10. keys_vals: 1, 2, 7, 3, 0, 1, 4, 7, 5, 0, … …
  10. “Natural” protocol buffer messages NodeBlock Node 102773 lat: 45.123302 lon:

    16.224122 Tags: highway: secondary name: “main street” Node 102775 lat: 45.123503 lon: 16.224220 Tags: highway: secondary name: “second avenue” maps better to function calls, objects, and structs
  11. Conveyal experience of PBF internals • OpenTripPlanner • Build street

    network from OpenStreetMap data • Re-used existing PBF loader code • Vanilla Extract • Fast extract of large geographic areas for routing • https://github.com/conveyal/vanilla-extract • Required us to implement PBF reading and writing • Provided opportunity to experiment with the format • Implemented twice in Java and C
  12. Observations on PBF • Multi-level structure • Optionally compressed Blobs

    —> PrimitiveGroups —> lists of OSM entities • Breaking format into blocks is good, but layering strikes me unnecessarily complex • Flat stream is much simpler to work with • Circumvents one-to-one mapping from Protobuf messages OSM entities • Must deter Protobuf from adding repetitive integer field tags to each stored field of an entity • Separate “dense nodes” data type (two ways to express nodes) • Heavy use of parallel arrays • Sign that Protobuf is not a natural fit for OSM where compactness is desired • Our observations confirmed by others who have implemented PBF read/write
  13. Observations on PBF • String tables • Blocks can be

    (are usually are) gzipped • General purpose compression (deflate) builds very efficient lookup table for all data (strings or otherwise) • The additional code and computation are redudant • Two separate Protobuf specifications • File block structure • OSM entity data • Encoding and decoding requires implementer to intersperse hand-written auxiliary code • Arguably defeats the purpose of using declarative, schema-driven code generator • Unused or unimplemented features and options (granularity and offsets)
  14. • Flat binary structure • Also uses string tables •

    Fixed size table, least-recently-used eviction strategy • Complex to implement, can backfire on certain inputs • Rejects general purpose compression • zlib compression quite resource efficient, can operate on streams, available practically everywhere • Use of general purpose compression justified by test results • Size can be larger • gzipped O5M (redundant string table construction) larger than PBF • Oregon compressed O5M output 12% larger than bzipped XML Observations on O5M
  15. Advantageous aspects of PBF • Delta coding • Way IDs

    and node IDs are sequential if sorted • Nodes in ways are often sequential • Nodes with similar IDs are often close together • 10, 12, 13, 14, 16, 18 —> 2, 1, 1, 2, 2 • Reduces size and number of elements, increases repetition • Related to “filtering” in PNG and residual coding in FLAC • Variable-width zigzag integers • 0, 1, -1, 2, -2, 3, -3 -> 0, 1, 2, 3, 4, 5, 6 • -1 is the largest number in standard coding (18,446,744,073,709,551,615) • Decimal variable width example: 24 —> 42X rather than 00000024 • Minor impact but does help (reduces memory buffer requirements)
  16. • With these observations we can achieve PBF-like compression ratios

    and processing speeds with much simpler file formats • Delta coding alone with general purpose compression (zlib) yields roughly equal size to PBF • Both OSM exchange formats in widespread use are complex to parse and produce • This is likely impeding the development of OSM infrastructure components
  17. • Each block contains entities of a single type (nodes,

    ways, or relations) • Tiny uncompressed block header allows skipping over blocks without decompression • Entity contents are positional • All entities share common ID/tags segment • Grammar for this format is LL(1): can be consumed top-down using nothing more than a few clearly named functions that call one another Proposed format characteristics
  18. OSM data sizes relative to xml.bz2 (metadata omitted) 0.0 %

    25.0 % 50.0 % 75.0 % 100.0 % Switzerland China Croatia Oregon o5m txt.gz o5m.gz pbf vex
  19. Size reduction relative to xml.bz2 with metadata omitted 0.0 %

    8.3 % 16.7 % 25.0 % 33.3 % 41.7 % 50.0 % o5m txt.gz o5m.gz pbf vex 41 % 42 % 37 % 18 % 4 % Switzerland China Croatia Oregon Average
  20. Characteristics most important in binary OSM exchange • Developers •

    Ease of implementation and maintenance • Convenient development API • Seeking/skipping within files • Streamable • “Elegance” • Users • Layers (buildings, roads, POIs) • Speed • Compactness
  21. Social and technical barriers to adoption of new formats •

    Appreciation for substantial existing contributions • Stability • Concentrate effort on less code • Approachability of community (less things to explain) • Leave good enough alone
  22. Next steps • Review and finalize format specification • Feature

    parity: author and version metadata • Finalize reference implementation
 https://github.com/conveyal/osm-lib • Integrate with Osmosis • Integrate with existing tools (Osmosis) • Provide reusable libraries in Java and C (SWIG?)