Slide 1

Slide 1 text

Simple Binary Formats By Tristan Hume

Slide 2

Slide 2 text

TerribleHack Projects http://ratewith.science/ With Dave Pagurek van Mossel http://dayder.thume.ca/ With Marc Mailhot

Slide 3

Slide 3 text

TerribleHack Projects http://ratewith.science/ ● Custom binary graph format ● Designed for efficient path finding directly on the format ● Wikipedia link graph only 700MB ● Can be memory mapped and casted to an array of uint32 http://dayder.thume.ca/ ● Custom binary time series format ● Stores tons of time series efficiently in one file ● Allows 6591 time series records to be transmitted to JS client with one megabyte of data

Slide 4

Slide 4 text

JSON vs. Binary {"name":"Deaths by cholera, unspecified","data": [{"t":915166800,"v":1},{"t":946702800,"v":1},{"t": 978325200,"v":0},{"t":1009861200,"v":0},{"t": 1041397200,"v":0},{"t":1072933200,"v":0},{"t": 1104555600,"v":0},{"t":1136091600,"v":0},{"t": 1167627600,"v":1},{"t":1199163600,"v":0},{"t": 1230786000,"v":1},{"t":1262322000,"v":0},{"t": 1293858000,"v":0},{"t":1325394000,"v":0},{"t": 1357016400,"v":1}]}

Slide 5

Slide 5 text

2½ times The space of the binary format The JSON on the previous slide takes

Slide 6

Slide 6 text

Advantages of binary formats ● Often take significantly less space ○ Faster network transmission ○ Less disk space for large data sets ● Can sometimes directly map them into memory ○ Directly read binary format and avoid deserialization time ● Really fast to serialize/deserialize ● Almost every language can read them without a third party library ○ No libraries means no time linking things, less reading docs and less dependencies ● You feel like a pro

Slide 7

Slide 7 text

But, that sounds hard...

Slide 8

Slide 8 text

Nope.

Slide 9

Slide 9 text

btsf Number of points Length of name Bytes of name in UTF- 8 ... Time 1 Value 1 Time 2 Value 2 ... Version File header size Record header size Number of records Record 1 Record 2 ...

Slide 10

Slide 10 text

Writing btsf

Slide 11

Slide 11 text

Reading btsf

Slide 12

Slide 12 text

How to binary in some languages ● Ruby: The pack method on Array and the unpack method on String. ● Python: The pack and unpack methods in the struct module. ● C/C++: Casting pointers into byte arrays to the right type, or a library. ● Rust: Some good third party libraries on crates.io ○ https://github.com/BurntSushi/byteorder ○ https://github.com/TyOverby/bincode ● Javascript: ArrayBuffer, DataView, TextEncoder, TextDecoder ○ Some of these are only in newer browsers but you can use polyfills for older ones (ordered by approximate ease of use easiest to hardest)

Slide 13

Slide 13 text

Some pre-built options ● Protocol Buffers & Thrift: https://developers.google.com/protocol-buffers/ https: //en.wikipedia.org/wiki/Apache_Thrift ○ Generates code for reading and writing binary data into language structures ○ Available for tons of languages ● Cap’n Proto: https://capnproto.org/ ○ Can be accessed and used directly from memory, no serialization/deserialization step ○ Exceedingly awesome, but only available for a few languages ● BSON & MessagePack: http://msgpack.org/ http://bsonspec.org/ ○ More compact JSON, but you still pay overhead for structure field names ○ Available for TONS of ● Avro: http://avro.apache.org/docs/1.3.0/ ○ A middle ground that doesn’t redundantly encode names like MessagePack, but includes the schema once so that no external schema is necessary to decode it. ○ Better for dynamic languages, approaches Thrift size on larger files

Slide 14

Slide 14 text

That’s all. Tristan Hume http://thume.ca/