Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc & bcolz : efficient compressed data containers

FrancescAlted
September 22, 2015

Blosc & bcolz : efficient compressed data containers

Compressed data containers can be critical when you want to process datasets exceeding the capacities of your computer. Blosc/bcolz allows to do that in a very efficient way.

FrancescAlted

September 22, 2015
Tweet

More Decks by FrancescAlted

Other Decks in Technology

Transcript

  1. Blosc / bcolz Extremely fast compression & storage ! Francesc

    Alted Freelance Consultant (Department of GeoSciences, University of Oslo) DS4DS Workshop held in Berkeley, USA, Sep 2015
  2. The same data take less RAM Transmission + decompression faster

    than direct transfer? Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset
  3. Blosc: (de-)compressing faster than memory Reads from Blosc chunks up

    to 5x faster than memcpy() (on synthetic data)
  4. Compression matters! “Blosc compressors are the fastest ones out there

    at this point; there is no better publicly available option that I'm aware of. That's not just "yet another compressor library" case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)
  5. Blosc ecosystem Small, but with big impact
 (thanks mainly to

    PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)
  6. What is bcolz? • Provides a storage layer that is

    both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)
  7. Streaming analytics with bcolz bcolz is meant to be simple:

    note the modular approach! map(), filter(), groupby(), sortby(), reduceby(),
 join() itertools, Dask, bquery, … bcolz container (disk or memory) iter(), iterblocks(),
 where(), whereblocks(), __getitem__() bcolz
 iterators/filters with blocking
  8. –Alistair Miles! Head of Epidemiological Informatics for the Kwiatkowski group.

    Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”
  9. Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1

    (Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)
  10. Blosc2 • Blosc1 only works with fixed-length, equal-sized, chunks (blocks)

    • This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks
  11. ARM/NEON: a first-class citizen for Blosc2 • At 3 GB/s,

    Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON
  12. Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta

    filter) ! •Support for more codecs and filters ! •Serialized version of the super-chunk (disk, network) …