Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simple Binary Formats

Simple Binary Formats

Using simple binary formats for your data to make your app faster and cooler

Tristan Hume

April 03, 2016
Tweet

More Decks by Tristan Hume

Other Decks in Programming

Transcript

  1. TerribleHack Projects http://ratewith.science/ • Custom binary graph format • Designed

    for efficient path finding directly on the format • Wikipedia link graph only 700MB • Can be memory mapped and casted to an array of uint32 http://dayder.thume.ca/ • Custom binary time series format • Stores tons of time series efficiently in one file • Allows 6591 time series records to be transmitted to JS client with one megabyte of data
  2. JSON vs. Binary {"name":"Deaths by cholera, unspecified","data": [{"t":915166800,"v":1},{"t":946702800,"v":1},{"t": 978325200,"v":0},{"t":1009861200,"v":0},{"t": 1041397200,"v":0},{"t":1072933200,"v":0},{"t":

    1104555600,"v":0},{"t":1136091600,"v":0},{"t": 1167627600,"v":1},{"t":1199163600,"v":0},{"t": 1230786000,"v":1},{"t":1262322000,"v":0},{"t": 1293858000,"v":0},{"t":1325394000,"v":0},{"t": 1357016400,"v":1}]}
  3. Advantages of binary formats • Often take significantly less space

    ◦ Faster network transmission ◦ Less disk space for large data sets • Can sometimes directly map them into memory ◦ Directly read binary format and avoid deserialization time • Really fast to serialize/deserialize • Almost every language can read them without a third party library ◦ No libraries means no time linking things, less reading docs and less dependencies • You feel like a pro
  4. btsf Number of points Length of name Bytes of name

    in UTF- 8 ... Time 1 Value 1 Time 2 Value 2 ... Version File header size Record header size Number of records Record 1 Record 2 ...
  5. How to binary in some languages • Ruby: The pack

    method on Array and the unpack method on String. • Python: The pack and unpack methods in the struct module. • C/C++: Casting pointers into byte arrays to the right type, or a library. • Rust: Some good third party libraries on crates.io ◦ https://github.com/BurntSushi/byteorder ◦ https://github.com/TyOverby/bincode • Javascript: ArrayBuffer, DataView, TextEncoder, TextDecoder ◦ Some of these are only in newer browsers but you can use polyfills for older ones (ordered by approximate ease of use easiest to hardest)
  6. Some pre-built options • Protocol Buffers & Thrift: https://developers.google.com/protocol-buffers/ https:

    //en.wikipedia.org/wiki/Apache_Thrift ◦ Generates code for reading and writing binary data into language structures ◦ Available for tons of languages • Cap’n Proto: https://capnproto.org/ ◦ Can be accessed and used directly from memory, no serialization/deserialization step ◦ Exceedingly awesome, but only available for a few languages • BSON & MessagePack: http://msgpack.org/ http://bsonspec.org/ ◦ More compact JSON, but you still pay overhead for structure field names ◦ Available for TONS of • Avro: http://avro.apache.org/docs/1.3.0/ ◦ A middle ground that doesn’t redundantly encode names like MessagePack, but includes the schema once so that no external schema is necessary to decode it. ◦ Better for dynamic languages, approaches Thrift size on larger files