Blosc & bcolz : efficient compressed data containers
Compressed data containers can be critical when you want to process datasets exceeding the capacities of your computer. Blosc/bcolz allows to do that in a very efficient way.
Blosc / bcolz Extremely fast compression & storage ! Francesc Alted Freelance Consultant (Department of GeoSciences, University of Oslo) DS4DS Workshop held in Berkeley, USA, Sep 2015
The same data take less RAM Transmission + decompression faster than direct transfer? Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original Dataset Compressed Dataset
Compression matters! “Blosc compressors are the fastest ones out there at this point; there is no better publicly available option that I'm aware of. That's not just "yet another compressor library" case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)
Blosc ecosystem Small, but with big impact (thanks mainly to PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)
What is bcolz? • Provides a storage layer that is both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)
–Alistair Miles! Head of Epidemiological Informatics for the Kwiatkowski group. Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”
Blosc2 • Blosc1 only works with fixed-length, equal-sized, chunks (blocks) • This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks
ARM/NEON: a first-class citizen for Blosc2 • At 3 GB/s, Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON
Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta filter) ! •Support for more codecs and filters ! •Serialized version of the super-chunk (disk, network) …