Blosc & bcolz : efficient compressed data containers

Blosc / bcolz Extremely fast compression & storage ! Francesc
Alted Freelance Consultant (Department of GeoSciences, University of Oslo) DS4DS Workshop held in Berkeley, USA, Sep 2015

I’ve spent the last few years ﬁghting the memory-CPU speed
mismatch The gap is wide and opening

The same data take less RAM Transmission + decompression faster
than direct transfer? Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original  Dataset Compressed  Dataset

Blosc: (de-)compressing faster than memory Reads from Blosc chunks up
to 5x faster than memcpy() (on synthetic data)

Compression matters! “Blosc compressors are the fastest ones out there
at this point; there is no better publicly available option that I'm aware of. That's not just "yet another compressor library" case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)

Blosc ecosystem Small, but with big impact  (thanks mainly to
PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)

What is bcolz? • Provides a storage layer that is
both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)

bcolz vs pandas (size) The MovieLens Dataset https://github.com/Blosc/movielens-bench

Query Times 5-year old laptop (Intel Core2, 2 cores) Compression
still slow things down

Query Times 3-year old laptop (Intel Ivy-Bridge, 2 cores) Compression
speeds things up

Streaming analytics with bcolz bcolz is meant to be simple:
note the modular approach! map(), ﬁlter(), groupby(), sortby(), reduceby(),  join() itertools, Dask, bquery, … bcolz container (disk or memory) iter(), iterblocks(),  where(), whereblocks(), __getitem__() bcolz  iterators/ﬁlters with blocking

–Alistair Miles! Head of Epidemiological Informatics for the Kwiatkowski group.
Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”

Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1
(Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)

Blosc2 • Blosc1 only works with ﬁxed-length, equal-sized, chunks (blocks)
• This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks

ARM/NEON: a ﬁrst-class citizen for Blosc2 • At 3 GB/s,
Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON

Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta
ﬁlter) ! •Support for more codecs and ﬁlters ! •Serialized version of the super-chunk (disk, network) …

Blosc2 has its own repo https://github.com/Blosc/c-blosc2 Meant to be usable
only when heavily tested!

Blosc & bcolz : efficient compressed data conta...

Blosc & bcolz : efficient compressed data containers

FrancescAlted

More Decks by FrancescAlted

Other Decks in Technology

Featured

Transcript

Blosc / bcolz Extremely fast compression & storage ! Francesc

I’ve spent the last few years ﬁghting the memory-CPU speed

The same data take less RAM Transmission + decompression faster

Blosc: (de-)compressing faster than memory Reads from Blosc chunks up

Compression matters! “Blosc compressors are the fastest ones out there

Blosc ecosystem Small, but with big impact  (thanks mainly to

What is bcolz? • Provides a storage layer that is

bcolz vs pandas (size) The MovieLens Dataset https://github.com/Blosc/movielens-bench

Query Times 5-year old laptop (Intel Core2, 2 cores) Compression

Query Times 3-year old laptop (Intel Ivy-Bridge, 2 cores) Compression

Streaming analytics with bcolz bcolz is meant to be simple:

–Alistair Miles! Head of Epidemiological Informatics for the Kwiatkowski group.

Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1

Blosc2 • Blosc1 only works with ﬁxed-length, equal-sized, chunks (blocks)

ARM/NEON: a ﬁrst-class citizen for Blosc2 • At 3 GB/s,

Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta

Blosc2 has its own repo https://github.com/Blosc/c-blosc2 Meant to be usable