Blosc & bcolz : efficient compressed data containers

Slide 1

Slide 1 text

Blosc / bcolz Extremely fast compression & storage ! Francesc Alted Freelance Consultant (Department of GeoSciences, University of Oslo) DS4DS Workshop held in Berkeley, USA, Sep 2015

Slide 2

Slide 2 text

I’ve spent the last few years ﬁghting the memory-CPU speed mismatch The gap is wide and opening

Slide 3

Slide 3 text

The same data take less RAM Transmission + decompression faster than direct transfer? Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original  Dataset Compressed  Dataset

Slide 4

Slide 4 text

Blosc: (de-)compressing faster than memory Reads from Blosc chunks up to 5x faster than memcpy() (on synthetic data)

Slide 5

Slide 5 text

Compression matters! “Blosc compressors are the fastest ones out there at this point; there is no better publicly available option that I'm aware of. That's not just "yet another compressor library" case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)

Slide 6

Slide 6 text

Blosc ecosystem Small, but with big impact  (thanks mainly to PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)

Slide 7

Slide 7 text

What is bcolz? • Provides a storage layer that is both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)

Slide 8

Slide 8 text

bcolz vs pandas (size) The MovieLens Dataset https://github.com/Blosc/movielens-bench

Slide 9

Slide 9 text

Query Times 5-year old laptop (Intel Core2, 2 cores) Compression still slow things down

Slide 10

Slide 10 text

Query Times 3-year old laptop (Intel Ivy-Bridge, 2 cores) Compression speeds things up

Slide 11

Slide 11 text

Streaming analytics with bcolz bcolz is meant to be simple: note the modular approach! map(), ﬁlter(), groupby(), sortby(), reduceby(),  join() itertools, Dask, bquery, … bcolz container (disk or memory) iter(), iterblocks(),  where(), whereblocks(), __getitem__() bcolz  iterators/ﬁlters with blocking

Slide 12

Slide 12 text

–Alistair Miles! Head of Epidemiological Informatics for the Kwiatkowski group. Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”

Slide 13

Slide 13 text

Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1 (Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)

Slide 14

Slide 14 text

Blosc2 • Blosc1 only works with ﬁxed-length, equal-sized, chunks (blocks) • This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks

Slide 15

Slide 15 text

ARM/NEON: a ﬁrst-class citizen for Blosc2 • At 3 GB/s, Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON

Slide 16

Slide 16 text

Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta ﬁlter) ! •Support for more codecs and ﬁlters ! •Serialized version of the super-chunk (disk, network) …

Slide 17

Slide 17 text

Blosc2 has its own repo https://github.com/Blosc/c-blosc2 Meant to be usable only when heavily tested!