Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc & bcolz : efficient compressed data containers

FrancescAlted
September 22, 2015

Blosc & bcolz : efficient compressed data containers

Compressed data containers can be critical when you want to process datasets exceeding the capacities of your computer. Blosc/bcolz allows to do that in a very efficient way.

FrancescAlted

September 22, 2015
Tweet

More Decks by FrancescAlted

Other Decks in Technology

Transcript

  1. Blosc / bcolz
    Extremely fast compression & storage
    !
    Francesc Alted
    Freelance Consultant
    (Department of GeoSciences, University of Oslo)
    DS4DS Workshop held in Berkeley, USA, Sep 2015

    View full-size slide

  2. I’ve spent the last few years fighting the
    memory-CPU speed mismatch
    The gap is wide and opening

    View full-size slide

  3. The same data take less RAM
    Transmission + decompression faster than direct
    transfer?
    Disk or Memory Bus
    Decompression
    Disk or Memory (RAM)
    CPU Cache
    Original

    Dataset
    Compressed

    Dataset

    View full-size slide

  4. Blosc: (de-)compressing faster than
    memory
    Reads from Blosc chunks up to 5x faster than memcpy()
    (on synthetic data)

    View full-size slide

  5. Compression matters!
    “Blosc compressors are the fastest ones out there at
    this point; there is no better publicly available option
    that I'm aware of. That's not just "yet another
    compressor library" case.”
    — Ivan Smirnov
    (advocating for Blosc inclusion in h5py)

    View full-size slide

  6. Blosc ecosystem
    Small, but with big impact

    (thanks mainly to PyTables/pandas)
    Blosc
    PyTables
    pandas
    bcolz Castra
    h5py
    Bloscpack
    scikit-allel
    bquery
    C / C++ world
    (e.g. OpenVDB)

    View full-size slide

  7. What is bcolz?
    • Provides a storage layer that is both chunked
    and is compressible
    • It is meant for both memory and persistent
    storage (disk)
    • Main goal: to demonstrate that compression
    can accelerate data access (both on disk and
    in-memory)

    View full-size slide

  8. bcolz vs pandas (size)
    The MovieLens Dataset
    https://github.com/Blosc/movielens-bench

    View full-size slide

  9. Query Times
    5-year old laptop (Intel Core2, 2 cores)
    Compression still slow things down

    View full-size slide

  10. Query Times
    3-year old laptop (Intel Ivy-Bridge, 2 cores)
    Compression speeds things up

    View full-size slide

  11. Streaming analytics with bcolz
    bcolz is meant to be simple: note the modular approach!
    map(), filter(),
    groupby(), sortby(),
    reduceby(),

    join()
    itertools,
    Dask,
    bquery,

    bcolz container
    (disk or memory)
    iter(), iterblocks(),

    where(), whereblocks(),
    __getitem__()
    bcolz

    iterators/filters
    with blocking

    View full-size slide

  12. –Alistair Miles!
    Head of Epidemiological Informatics for the Kwiatkowski group.
    Author of scikit-allel.
    “The future for me clearly involves lots of
    block-wise processing of multidimensional
    bcolz carrays"”

    View full-size slide

  13. Introducing Blosc2
    Next generation for Blosc
    Blosc2
    Header
    Chunk 1 (Blosc1)
    Chunk 2 (Blosc1)
    Chunk 3 (Blosc1)
    Chunk N (Blosc1)

    View full-size slide

  14. Blosc2
    • Blosc1 only works with fixed-length, equal-sized,
    chunks (blocks)
    • This can lead to a poor use of space to
    accommodate variable-length data (potentially
    large zero-paddings)
    • Blosc2 addresses this shortcoming by using
    superchunks of variable-length chunks

    View full-size slide

  15. ARM/NEON: a first-class citizen for Blosc2
    • At 3 GB/s, Blosc2 on ARM achieves one of the best
    bandwidth/Watt ratios in the market
    • Profound implications for the density of data storage
    devices (e.g. arrays of disks driven by ARM)
    Not using NEON Using NEON

    View full-size slide

  16. Other planned features for Blosc2
    •Looking into inter-chunk
    redundancies (delta filter)
    !
    •Support for more codecs
    and filters
    !
    •Serialized version of the
    super-chunk (disk, network)

    View full-size slide

  17. Blosc2 has its own repo
    https://github.com/Blosc/c-blosc2
    Meant to be usable only when heavily tested!

    View full-size slide