Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc & bcolz : efficient compressed data containers

65868d36f26f237938997dd28c2b2453?s=47 FrancescAlted
September 22, 2015

Blosc & bcolz : efficient compressed data containers

Compressed data containers can be critical when you want to process datasets exceeding the capacities of your computer. Blosc/bcolz allows to do that in a very efficient way.

65868d36f26f237938997dd28c2b2453?s=128

FrancescAlted

September 22, 2015
Tweet

Transcript

  1. Blosc / bcolz Extremely fast compression & storage ! Francesc

    Alted Freelance Consultant (Department of GeoSciences, University of Oslo) DS4DS Workshop held in Berkeley, USA, Sep 2015
  2. I’ve spent the last few years fighting the memory-CPU speed

    mismatch The gap is wide and opening
  3. The same data take less RAM Transmission + decompression faster

    than direct transfer? Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset
  4. Blosc: (de-)compressing faster than memory Reads from Blosc chunks up

    to 5x faster than memcpy() (on synthetic data)
  5. Compression matters! “Blosc compressors are the fastest ones out there

    at this point; there is no better publicly available option that I'm aware of. That's not just "yet another compressor library" case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)
  6. Blosc ecosystem Small, but with big impact
 (thanks mainly to

    PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)
  7. What is bcolz? • Provides a storage layer that is

    both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)
  8. bcolz vs pandas (size) The MovieLens Dataset https://github.com/Blosc/movielens-bench

  9. Query Times 5-year old laptop (Intel Core2, 2 cores) Compression

    still slow things down
  10. Query Times 3-year old laptop (Intel Ivy-Bridge, 2 cores) Compression

    speeds things up
  11. Streaming analytics with bcolz bcolz is meant to be simple:

    note the modular approach! map(), filter(), groupby(), sortby(), reduceby(),
 join() itertools, Dask, bquery, … bcolz container (disk or memory) iter(), iterblocks(),
 where(), whereblocks(), __getitem__() bcolz
 iterators/filters with blocking
  12. –Alistair Miles! Head of Epidemiological Informatics for the Kwiatkowski group.

    Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”
  13. Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1

    (Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)
  14. Blosc2 • Blosc1 only works with fixed-length, equal-sized, chunks (blocks)

    • This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks
  15. ARM/NEON: a first-class citizen for Blosc2 • At 3 GB/s,

    Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON
  16. Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta

    filter) ! •Support for more codecs and filters ! •Serialized version of the super-chunk (disk, network) …
  17. Blosc2 has its own repo https://github.com/Blosc/c-blosc2 Meant to be usable

    only when heavily tested!