Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc & bcolz : efficient compressed data containers

FrancescAlted
September 22, 2015

Blosc & bcolz : efficient compressed data containers

Compressed data containers can be critical when you want to process datasets exceeding the capacities of your computer. Blosc/bcolz allows to do that in a very efficient way.

FrancescAlted

September 22, 2015
Tweet

More Decks by FrancescAlted

Other Decks in Technology

Transcript

  1. Blosc / bcolz
    Extremely fast compression & storage
    !
    Francesc Alted
    Freelance Consultant
    (Department of GeoSciences, University of Oslo)
    DS4DS Workshop held in Berkeley, USA, Sep 2015

    View Slide

  2. I’ve spent the last few years fighting the
    memory-CPU speed mismatch
    The gap is wide and opening

    View Slide

  3. The same data take less RAM
    Transmission + decompression faster than direct
    transfer?
    Disk or Memory Bus
    Decompression
    Disk or Memory (RAM)
    CPU Cache
    Original

    Dataset
    Compressed

    Dataset

    View Slide

  4. Blosc: (de-)compressing faster than
    memory
    Reads from Blosc chunks up to 5x faster than memcpy()
    (on synthetic data)

    View Slide

  5. Compression matters!
    “Blosc compressors are the fastest ones out there at
    this point; there is no better publicly available option
    that I'm aware of. That's not just "yet another
    compressor library" case.”
    — Ivan Smirnov
    (advocating for Blosc inclusion in h5py)

    View Slide

  6. Blosc ecosystem
    Small, but with big impact

    (thanks mainly to PyTables/pandas)
    Blosc
    PyTables
    pandas
    bcolz Castra
    h5py
    Bloscpack
    scikit-allel
    bquery
    C / C++ world
    (e.g. OpenVDB)

    View Slide

  7. What is bcolz?
    • Provides a storage layer that is both chunked
    and is compressible
    • It is meant for both memory and persistent
    storage (disk)
    • Main goal: to demonstrate that compression
    can accelerate data access (both on disk and
    in-memory)

    View Slide

  8. bcolz vs pandas (size)
    The MovieLens Dataset
    https://github.com/Blosc/movielens-bench

    View Slide

  9. Query Times
    5-year old laptop (Intel Core2, 2 cores)
    Compression still slow things down

    View Slide

  10. Query Times
    3-year old laptop (Intel Ivy-Bridge, 2 cores)
    Compression speeds things up

    View Slide

  11. Streaming analytics with bcolz
    bcolz is meant to be simple: note the modular approach!
    map(), filter(),
    groupby(), sortby(),
    reduceby(),

    join()
    itertools,
    Dask,
    bquery,

    bcolz container
    (disk or memory)
    iter(), iterblocks(),

    where(), whereblocks(),
    __getitem__()
    bcolz

    iterators/filters
    with blocking

    View Slide

  12. –Alistair Miles!
    Head of Epidemiological Informatics for the Kwiatkowski group.
    Author of scikit-allel.
    “The future for me clearly involves lots of
    block-wise processing of multidimensional
    bcolz carrays"”

    View Slide

  13. Introducing Blosc2
    Next generation for Blosc
    Blosc2
    Header
    Chunk 1 (Blosc1)
    Chunk 2 (Blosc1)
    Chunk 3 (Blosc1)
    Chunk N (Blosc1)

    View Slide

  14. Blosc2
    • Blosc1 only works with fixed-length, equal-sized,
    chunks (blocks)
    • This can lead to a poor use of space to
    accommodate variable-length data (potentially
    large zero-paddings)
    • Blosc2 addresses this shortcoming by using
    superchunks of variable-length chunks

    View Slide

  15. ARM/NEON: a first-class citizen for Blosc2
    • At 3 GB/s, Blosc2 on ARM achieves one of the best
    bandwidth/Watt ratios in the market
    • Profound implications for the density of data storage
    devices (e.g. arrays of disks driven by ARM)
    Not using NEON Using NEON

    View Slide

  16. Other planned features for Blosc2
    •Looking into inter-chunk
    redundancies (delta filter)
    !
    •Support for more codecs
    and filters
    !
    •Serialized version of the
    super-chunk (disk, network)

    View Slide

  17. Blosc2 has its own repo
    https://github.com/Blosc/c-blosc2
    Meant to be usable only when heavily tested!

    View Slide