Simple Binary Formats

Simple Binary Formats

Using simple binary formats for your data to make your app faster and cooler

7f9969d3b17c6fd9d604ad2c09a06d58?s=128

Tristan Hume

April 03, 2016
Tweet

Transcript

  1. Simple Binary Formats By Tristan Hume

  2. TerribleHack Projects http://ratewith.science/ With Dave Pagurek van Mossel http://dayder.thume.ca/ With

    Marc Mailhot
  3. TerribleHack Projects http://ratewith.science/ • Custom binary graph format • Designed

    for efficient path finding directly on the format • Wikipedia link graph only 700MB • Can be memory mapped and casted to an array of uint32 http://dayder.thume.ca/ • Custom binary time series format • Stores tons of time series efficiently in one file • Allows 6591 time series records to be transmitted to JS client with one megabyte of data
  4. JSON vs. Binary {"name":"Deaths by cholera, unspecified","data": [{"t":915166800,"v":1},{"t":946702800,"v":1},{"t": 978325200,"v":0},{"t":1009861200,"v":0},{"t": 1041397200,"v":0},{"t":1072933200,"v":0},{"t":

    1104555600,"v":0},{"t":1136091600,"v":0},{"t": 1167627600,"v":1},{"t":1199163600,"v":0},{"t": 1230786000,"v":1},{"t":1262322000,"v":0},{"t": 1293858000,"v":0},{"t":1325394000,"v":0},{"t": 1357016400,"v":1}]}
  5. 2½ times The space of the binary format The JSON

    on the previous slide takes
  6. Advantages of binary formats • Often take significantly less space

    ◦ Faster network transmission ◦ Less disk space for large data sets • Can sometimes directly map them into memory ◦ Directly read binary format and avoid deserialization time • Really fast to serialize/deserialize • Almost every language can read them without a third party library ◦ No libraries means no time linking things, less reading docs and less dependencies • You feel like a pro
  7. But, that sounds hard...

  8. Nope.

  9. btsf Number of points Length of name Bytes of name

    in UTF- 8 ... Time 1 Value 1 Time 2 Value 2 ... Version File header size Record header size Number of records Record 1 Record 2 ...
  10. Writing btsf

  11. Reading btsf

  12. How to binary in some languages • Ruby: The pack

    method on Array and the unpack method on String. • Python: The pack and unpack methods in the struct module. • C/C++: Casting pointers into byte arrays to the right type, or a library. • Rust: Some good third party libraries on crates.io ◦ https://github.com/BurntSushi/byteorder ◦ https://github.com/TyOverby/bincode • Javascript: ArrayBuffer, DataView, TextEncoder, TextDecoder ◦ Some of these are only in newer browsers but you can use polyfills for older ones (ordered by approximate ease of use easiest to hardest)
  13. Some pre-built options • Protocol Buffers & Thrift: https://developers.google.com/protocol-buffers/ https:

    //en.wikipedia.org/wiki/Apache_Thrift ◦ Generates code for reading and writing binary data into language structures ◦ Available for tons of languages • Cap’n Proto: https://capnproto.org/ ◦ Can be accessed and used directly from memory, no serialization/deserialization step ◦ Exceedingly awesome, but only available for a few languages • BSON & MessagePack: http://msgpack.org/ http://bsonspec.org/ ◦ More compact JSON, but you still pay overhead for structure field names ◦ Available for TONS of • Avro: http://avro.apache.org/docs/1.3.0/ ◦ A middle ground that doesn’t redundantly encode names like MessagePack, but includes the schema once so that no external schema is necessary to decode it. ◦ Better for dynamic languages, approaches Thrift size on larger files
  14. That’s all. Tristan Hume http://thume.ca/