Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

Blosc es un compresor extremadamente rápido, mientras que bcolz es un contenedor de datos columnares que soporta compresión. Juntos pueden cambiar las reglas del juego actuales en almacenamiento y procesamiento de datos.

FrancescAlted

November 22, 2015
Tweet

More Decks by FrancescAlted

Other Decks in Programming

Transcript

  1. Blosc / bcolz
    Comprimiendo datos más allá de los límites de la memoria
    Francesc Alted
    Consultor Freelance
    (Departmento de Geo-ciencias, Universidad de Oslo)
    Charla para la PyConES 2015, Valencia, 2015

    View full-size slide

  2. Sobre Mi
    • Creador de librerías como PyTables, Blosc, bcolz.
    Mantengo Numexpr desde hace años.
    • Desarrollador y enseñante en áreas como:
    • Python (casi 15 años de experiencia)
    • Computación y almacenamiento de altas
    prestaciones.
    • Consultor en proyectos de procesamiento de datos.

    View full-size slide

  3. Motivación:

    El conjunto de datos
    MovieLens 

    Materiales en:

    https://github.com/FrancescAlted/
    PyConES2015

    View full-size slide

  4. The MovieLens Dataset
    • Datasets for movie ratings
    • Different sizes: 100K, 1M, 10M ratings (the 10M
    will be used in benchmarks ahead)
    • The datasets were collected over various
    periods of time

    View full-size slide

  5. Querying the MovieLens
    Dataset
    import pandas as pd

    import bcolz
    # Parse and load CSV files using pandas
    # Merge some files in a single dataframe

    lens = pd.merge(movies, ratings)
    # The pandas way of querying

    result = lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)”)['user_id']
    zlens = bcolz.ctable.fromdataframe(lens)
    # The bcolz way of querying (notice the use of the `where` iterator)

    result = [r.user_id for r in dblens.where(

    "(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['user_id'])]

    View full-size slide

  6. bcolz vs pandas (size)
    bcolz puede almacenar hasta 20x más
    cantidad de datos que pandas

    View full-size slide

  7. Query Times
    3-year old laptop (Intel Ivy-Bridge, 2 cores)
    Compression speeds things up

    View full-size slide

  8. ¿Qué?
    ¿Consultas sobre datos comprimidos yendo más rápido que sobre
    datos sin comprimir? ¿En serio?
    Data input
    Data output
    Decompression
    Compression
    Data
    process

    View full-size slide

  9. Query Times
    5-year old laptop (Intel Core2, 2 cores)
    Compression still slow things down

    View full-size slide

  10. Ver mi artículo:
    “Why Modern CPUs Are Starving And What You Can Do
    About It”
    Enorme diferencia de velocidad entre CPUs y memoria!

    View full-size slide

  11. Hierarchy of Memory

    By 2017 (Educated Guess)
    SSD SATA (persistent)
    L4
    RAM (addressable)
    XPoint (persistent)
    HDD (persistent)
    L3
    L2
    L1 9 levels will be common!
    SSD PCIe (persistent)

    View full-size slide

  12. ¿Cómo puede ayudar
    la compresión?

    View full-size slide

  13. The same data take less storage
    Transmission + decompression faster than direct
    transfer?
    Disk or Memory Bus
    Decompression
    Persistent (disk) or ephemeral (RAM) storage
    CPU Cache
    Original

    Dataset
    Compressed

    Dataset

    View full-size slide

  14. Conociendo Blosc:
    Un Compresor Diseñado Para
    CPU’s Modernas

    View full-size slide

  15. Blosc Outstanding Features
    • Uses multi-threading
    • The shuffle part is accelerated using SSE2 and
    AVX2 (if available)
    • Supports different compressor backends:
    blosclz, lz4, snappy and zlib
    • Fine-tuned for using internal caches (mainly L1
    and L2)

    View full-size slide

  16. Blosc: (de-)compressing faster than
    memory
    Reads from Blosc chunks up to 5x faster than memcpy()
    (on synthetic data)

    View full-size slide

  17. Multithreading & SIMD
    at work!
    Figure attr: Valentin Haenel
    How Blosc Works

    View full-size slide

  18. How Shuffling Works

    View full-size slide

  19. Compression matters!
    “Blosc compressors are the fastest ones out there at
    this point; there is no better publicly available option
    that I'm aware of. That's not just ‘yet another
    compressor library’ case.”
    — Ivan Smirnov
    (advocating for Blosc inclusion in h5py)

    View full-size slide

  20. Blosc ecosystem
    Small, but with big impact

    (thanks mainly to PyTables/pandas)
    Blosc
    PyTables
    pandas
    bcolz Castra
    h5py
    Bloscpack
    scikit-allel
    bquery
    C / C++ world
    (e.g. OpenVDB)

    View full-size slide

  21. –Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation
    “Blosc compresses almost as well as ZLIB, but
    it is much faster”
    Blosc In OpenVDB
    And Houdini

    View full-size slide

  22. What is bcolz?
    • Provides a storage layer that is both chunked
    and is compressible
    • It is meant for both memory and persistent
    storage (disk)
    • Containers come with two flavors: carray
    (multidimensional, homogeneous arrays) and
    ctable (tabular data, made of carrays)

    View full-size slide

  23. carray: Multidimensional
    Container for Homogeneous Data
    .
    .
    .
    NumPy container carray container
    chunk 1
    chunk 2
    chunk N
    Contiguous Memory Discontiguous Memory

    View full-size slide

  24. –Alistair Miles
    Head of Epidemiological Informatics for the Kwiatkowski group.
    Author of scikit-allel.
    “The future for me clearly involves lots of
    block-wise processing of multidimensional
    bcolz carrays"”

    View full-size slide

  25. The ctable Object
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    chunk
    carray
    new rows to append
    • Chunks follow column order
    • Very efficient for querying
    • Adding or removing columns is cheap too

    View full-size slide

  26. Persistency
    • carray and ctable objects can live on disk, not
    only in memory
    • bcolz allows every operation to be executed
    either in-memory or on-disk (out-of-core
    operations)
    • The recipe is to provide high performance
    iterators for carray and ctable, and then
    implement operations with these iterators

    View full-size slide

  27. bcolz And The Memory
    Hierarchical Model
    • All the components of bcolz (including Blosc)
    are designed with the memory hierarchy in mind
    to get the best performance
    • Basically, bcolz uses the blocking technique
    extensively so as to leverage the temporal and
    spatial localities all along the hierarchy

    View full-size slide

  28. Streaming analytics with bcolz
    bcolz is meant to be simple: note the modular approach!
    map(), filter(),
    groupby(), sortby(),
    reduceby(),

    join()
    itertools,
    Dask,
    bquery,

    bcolz container
    (disk or memory)
    iter(), iterblocks(),

    where(), whereblocks(),
    __getitem__()
    bcolz

    iterators/filters
    with blocking

    View full-size slide

  29. bquery - On-Disk GroupBy
    In-memory (pandas) vs on-disk (bquery+bcolz) groupby
    “Switching to bcolz enabled us to have a much better scalable

    architecture yet with near in-memory performance”

    — Carst Vaartjes, co-founder visualfabriq

    View full-size slide

  30. Introducing Blosc2
    Next generation for Blosc
    Blosc2
    Header
    Chunk 1 (Blosc1)
    Chunk 2 (Blosc1)
    Chunk 3 (Blosc1)
    Chunk N (Blosc1)

    View full-size slide

  31. Blosc2
    • Blosc1 only works with fixed-length, equal-sized,
    chunks (blocks)
    • This can lead to a poor use of space to
    accommodate variable-length data (potentially
    large zero-paddings)
    • Blosc2 addresses this shortcoming by using
    superchunks of variable-length chunks

    View full-size slide

  32. ARM/NEON: a first-class citizen for Blosc2
    • At 3 GB/s, Blosc2 on ARM achieves one of the best
    bandwidth/Watt ratios in the market
    • Profound implications for the density of data storage
    devices (e.g. arrays of disks driven by ARM)
    Not using NEON Using NEON

    View full-size slide

  33. Other planned features for Blosc2
    •Looking into inter-chunk
    redundancies (delta filter)
    •Support for more codecs
    and filters
    •Serialized version of the
    super-chunk (disk, network)

    View full-size slide

  34. Resumen
    • Debido a la evolución en las arquitectura
    modernas, la compresión puede ser efectiva por
    dos razones:
    • Se puede trabajar con más datos usando los
    mismos recursos
    • Se puede llegar a reducir el coste de la
    compresión a cero, e incluso más allá!

    View full-size slide