Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simple Binary Formats

Simple Binary Formats

Using simple binary formats for your data to make your app faster and cooler


Tristan Hume

April 03, 2016


  1. Simple Binary Formats By Tristan Hume

  2. TerribleHack Projects http://ratewith.science/ With Dave Pagurek van Mossel http://dayder.thume.ca/ With

    Marc Mailhot
  3. TerribleHack Projects http://ratewith.science/ • Custom binary graph format • Designed

    for efficient path finding directly on the format • Wikipedia link graph only 700MB • Can be memory mapped and casted to an array of uint32 http://dayder.thume.ca/ • Custom binary time series format • Stores tons of time series efficiently in one file • Allows 6591 time series records to be transmitted to JS client with one megabyte of data
  4. JSON vs. Binary {"name":"Deaths by cholera, unspecified","data": [{"t":915166800,"v":1},{"t":946702800,"v":1},{"t": 978325200,"v":0},{"t":1009861200,"v":0},{"t": 1041397200,"v":0},{"t":1072933200,"v":0},{"t":

    1104555600,"v":0},{"t":1136091600,"v":0},{"t": 1167627600,"v":1},{"t":1199163600,"v":0},{"t": 1230786000,"v":1},{"t":1262322000,"v":0},{"t": 1293858000,"v":0},{"t":1325394000,"v":0},{"t": 1357016400,"v":1}]}
  5. 2½ times The space of the binary format The JSON

    on the previous slide takes
  6. Advantages of binary formats • Often take significantly less space

    ◦ Faster network transmission ◦ Less disk space for large data sets • Can sometimes directly map them into memory ◦ Directly read binary format and avoid deserialization time • Really fast to serialize/deserialize • Almost every language can read them without a third party library ◦ No libraries means no time linking things, less reading docs and less dependencies • You feel like a pro
  7. But, that sounds hard...

  8. Nope.

  9. btsf Number of points Length of name Bytes of name

    in UTF- 8 ... Time 1 Value 1 Time 2 Value 2 ... Version File header size Record header size Number of records Record 1 Record 2 ...
  10. Writing btsf

  11. Reading btsf

  12. How to binary in some languages • Ruby: The pack

    method on Array and the unpack method on String. • Python: The pack and unpack methods in the struct module. • C/C++: Casting pointers into byte arrays to the right type, or a library. • Rust: Some good third party libraries on crates.io ◦ https://github.com/BurntSushi/byteorder ◦ https://github.com/TyOverby/bincode • Javascript: ArrayBuffer, DataView, TextEncoder, TextDecoder ◦ Some of these are only in newer browsers but you can use polyfills for older ones (ordered by approximate ease of use easiest to hardest)
  13. Some pre-built options • Protocol Buffers & Thrift: https://developers.google.com/protocol-buffers/ https:

    //en.wikipedia.org/wiki/Apache_Thrift ◦ Generates code for reading and writing binary data into language structures ◦ Available for tons of languages • Cap’n Proto: https://capnproto.org/ ◦ Can be accessed and used directly from memory, no serialization/deserialization step ◦ Exceedingly awesome, but only available for a few languages • BSON & MessagePack: http://msgpack.org/ http://bsonspec.org/ ◦ More compact JSON, but you still pay overhead for structure field names ◦ Available for TONS of • Avro: http://avro.apache.org/docs/1.3.0/ ◦ A middle ground that doesn’t redundantly encode names like MessagePack, but includes the schema once so that no external schema is necessary to decode it. ◦ Better for dynamic languages, approaches Thrift size on larger files
  14. That’s all. Tristan Hume http://thume.ca/