Simple Binary Formats

Simple Binary Formats By Tristan Hume

TerribleHack Projects http://ratewith.science/ With Dave Pagurek van Mossel http://dayder.thume.ca/ With
Marc Mailhot

TerribleHack Projects http://ratewith.science/ • Custom binary graph format • Designed
for efficient path finding directly on the format • Wikipedia link graph only 700MB • Can be memory mapped and casted to an array of uint32 http://dayder.thume.ca/ • Custom binary time series format • Stores tons of time series efficiently in one file • Allows 6591 time series records to be transmitted to JS client with one megabyte of data

JSON vs. Binary {"name":"Deaths by cholera, unspecified","data": [{"t":915166800,"v":1},{"t":946702800,"v":1},{"t": 978325200,"v":0},{"t":1009861200,"v":0},{"t": 1041397200,"v":0},{"t":1072933200,"v":0},{"t":
1104555600,"v":0},{"t":1136091600,"v":0},{"t": 1167627600,"v":1},{"t":1199163600,"v":0},{"t": 1230786000,"v":1},{"t":1262322000,"v":0},{"t": 1293858000,"v":0},{"t":1325394000,"v":0},{"t": 1357016400,"v":1}]}

2½ times The space of the binary format The JSON
on the previous slide takes

Advantages of binary formats • Often take significantly less space
◦ Faster network transmission ◦ Less disk space for large data sets • Can sometimes directly map them into memory ◦ Directly read binary format and avoid deserialization time • Really fast to serialize/deserialize • Almost every language can read them without a third party library ◦ No libraries means no time linking things, less reading docs and less dependencies • You feel like a pro

But, that sounds hard...

btsf Number of points Length of name Bytes of name
in UTF- 8 ... Time 1 Value 1 Time 2 Value 2 ... Version File header size Record header size Number of records Record 1 Record 2 ...

Writing btsf

Reading btsf

How to binary in some languages • Ruby: The pack
method on Array and the unpack method on String. • Python: The pack and unpack methods in the struct module. • C/C++: Casting pointers into byte arrays to the right type, or a library. • Rust: Some good third party libraries on crates.io ◦ https://github.com/BurntSushi/byteorder ◦ https://github.com/TyOverby/bincode • Javascript: ArrayBuffer, DataView, TextEncoder, TextDecoder ◦ Some of these are only in newer browsers but you can use polyfills for older ones (ordered by approximate ease of use easiest to hardest)

Some pre-built options • Protocol Buffers & Thrift: https://developers.google.com/protocol-buffers/ https:
//en.wikipedia.org/wiki/Apache_Thrift ◦ Generates code for reading and writing binary data into language structures ◦ Available for tons of languages • Cap’n Proto: https://capnproto.org/ ◦ Can be accessed and used directly from memory, no serialization/deserialization step ◦ Exceedingly awesome, but only available for a few languages • BSON & MessagePack: http://msgpack.org/ http://bsonspec.org/ ◦ More compact JSON, but you still pay overhead for structure field names ◦ Available for TONS of • Avro: http://avro.apache.org/docs/1.3.0/ ◦ A middle ground that doesn’t redundantly encode names like MessagePack, but includes the schema once so that no external schema is necessary to decode it. ◦ Better for dynamic languages, approaches Thrift size on larger files

That’s all. Tristan Hume http://thume.ca/

Simple Binary Formats

Simple Binary Formats

Tristan Hume

More Decks by Tristan Hume

Other Decks in Programming

Featured

Transcript