Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

Blosc es un compresor extremadamente rápido, mientras que bcolz es un contenedor de datos columnares que soporta compresión. Juntos pueden cambiar las reglas del juego actuales en almacenamiento y procesamiento de datos.

FrancescAlted

November 22, 2015
Tweet

More Decks by FrancescAlted

Other Decks in Programming

Transcript

  1. Blosc / bcolz
    Comprimiendo datos más allá de los límites de la memoria
    Francesc Alted
    Consultor Freelance
    (Departmento de Geo-ciencias, Universidad de Oslo)
    Charla para la PyConES 2015, Valencia, 2015

    View Slide

  2. Sobre Mi
    • Creador de librerías como PyTables, Blosc, bcolz.
    Mantengo Numexpr desde hace años.
    • Desarrollador y enseñante en áreas como:
    • Python (casi 15 años de experiencia)
    • Computación y almacenamiento de altas
    prestaciones.
    • Consultor en proyectos de procesamiento de datos.

    View Slide

  3. Motivación:

    El conjunto de datos
    MovieLens 

    Materiales en:

    https://github.com/FrancescAlted/
    PyConES2015

    View Slide

  4. The MovieLens Dataset
    • Datasets for movie ratings
    • Different sizes: 100K, 1M, 10M ratings (the 10M
    will be used in benchmarks ahead)
    • The datasets were collected over various
    periods of time

    View Slide

  5. Querying the MovieLens
    Dataset
    import pandas as pd

    import bcolz
    # Parse and load CSV files using pandas
    # Merge some files in a single dataframe

    lens = pd.merge(movies, ratings)
    # The pandas way of querying

    result = lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)”)['user_id']
    zlens = bcolz.ctable.fromdataframe(lens)
    # The bcolz way of querying (notice the use of the `where` iterator)

    result = [r.user_id for r in dblens.where(

    "(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['user_id'])]

    View Slide

  6. bcolz vs pandas (size)
    bcolz puede almacenar hasta 20x más
    cantidad de datos que pandas

    View Slide

  7. Query Times
    3-year old laptop (Intel Ivy-Bridge, 2 cores)
    Compression speeds things up

    View Slide

  8. ¿Qué?
    ¿Consultas sobre datos comprimidos yendo más rápido que sobre
    datos sin comprimir? ¿En serio?
    Data input
    Data output
    Decompression
    Compression
    Data
    process

    View Slide

  9. Query Times
    5-year old laptop (Intel Core2, 2 cores)
    Compression still slow things down

    View Slide

  10. Porqué?

    View Slide

  11. Ver mi artículo:
    “Why Modern CPUs Are Starving And What You Can Do
    About It”
    Enorme diferencia de velocidad entre CPUs y memoria!

    View Slide

  12. Hierarchy of Memory

    By 2017 (Educated Guess)
    SSD SATA (persistent)
    L4
    RAM (addressable)
    XPoint (persistent)
    HDD (persistent)
    L3
    L2
    L1 9 levels will be common!
    SSD PCIe (persistent)

    View Slide

  13. ¿Cómo puede ayudar
    la compresión?

    View Slide

  14. The same data take less storage
    Transmission + decompression faster than direct
    transfer?
    Disk or Memory Bus
    Decompression
    Persistent (disk) or ephemeral (RAM) storage
    CPU Cache
    Original

    Dataset
    Compressed

    Dataset

    View Slide

  15. Conociendo Blosc:
    Un Compresor Diseñado Para
    CPU’s Modernas

    View Slide

  16. Blosc Outstanding Features
    • Uses multi-threading
    • The shuffle part is accelerated using SSE2 and
    AVX2 (if available)
    • Supports different compressor backends:
    blosclz, lz4, snappy and zlib
    • Fine-tuned for using internal caches (mainly L1
    and L2)

    View Slide

  17. Blosc: (de-)compressing faster than
    memory
    Reads from Blosc chunks up to 5x faster than memcpy()
    (on synthetic data)

    View Slide

  18. Multithreading & SIMD
    at work!
    Figure attr: Valentin Haenel
    How Blosc Works

    View Slide

  19. How Shuffling Works

    View Slide

  20. Compression matters!
    “Blosc compressors are the fastest ones out there at
    this point; there is no better publicly available option
    that I'm aware of. That's not just ‘yet another
    compressor library’ case.”
    — Ivan Smirnov
    (advocating for Blosc inclusion in h5py)

    View Slide

  21. Blosc ecosystem
    Small, but with big impact

    (thanks mainly to PyTables/pandas)
    Blosc
    PyTables
    pandas
    bcolz Castra
    h5py
    Bloscpack
    scikit-allel
    bquery
    C / C++ world
    (e.g. OpenVDB)

    View Slide

  22. –Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation
    “Blosc compresses almost as well as ZLIB, but
    it is much faster”
    Blosc In OpenVDB
    And Houdini

    View Slide

  23. What is bcolz?
    • Provides a storage layer that is both chunked
    and is compressible
    • It is meant for both memory and persistent
    storage (disk)
    • Containers come with two flavors: carray
    (multidimensional, homogeneous arrays) and
    ctable (tabular data, made of carrays)

    View Slide

  24. carray: Multidimensional
    Container for Homogeneous Data
    .
    .
    .
    NumPy container carray container
    chunk 1
    chunk 2
    chunk N
    Contiguous Memory Discontiguous Memory

    View Slide

  25. –Alistair Miles
    Head of Epidemiological Informatics for the Kwiatkowski group.
    Author of scikit-allel.
    “The future for me clearly involves lots of
    block-wise processing of multidimensional
    bcolz carrays"”

    View Slide

  26. The ctable Object
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    chunk
    carray
    new rows to append
    • Chunks follow column order
    • Very efficient for querying
    • Adding or removing columns is cheap too

    View Slide

  27. Persistency
    • carray and ctable objects can live on disk, not
    only in memory
    • bcolz allows every operation to be executed
    either in-memory or on-disk (out-of-core
    operations)
    • The recipe is to provide high performance
    iterators for carray and ctable, and then
    implement operations with these iterators

    View Slide

  28. bcolz And The Memory
    Hierarchical Model
    • All the components of bcolz (including Blosc)
    are designed with the memory hierarchy in mind
    to get the best performance
    • Basically, bcolz uses the blocking technique
    extensively so as to leverage the temporal and
    spatial localities all along the hierarchy

    View Slide

  29. Streaming analytics with bcolz
    bcolz is meant to be simple: note the modular approach!
    map(), filter(),
    groupby(), sortby(),
    reduceby(),

    join()
    itertools,
    Dask,
    bquery,

    bcolz container
    (disk or memory)
    iter(), iterblocks(),

    where(), whereblocks(),
    __getitem__()
    bcolz

    iterators/filters
    with blocking

    View Slide

  30. bquery - On-Disk GroupBy
    In-memory (pandas) vs on-disk (bquery+bcolz) groupby
    “Switching to bcolz enabled us to have a much better scalable

    architecture yet with near in-memory performance”

    — Carst Vaartjes, co-founder visualfabriq

    View Slide

  31. Introducing Blosc2
    Next generation for Blosc
    Blosc2
    Header
    Chunk 1 (Blosc1)
    Chunk 2 (Blosc1)
    Chunk 3 (Blosc1)
    Chunk N (Blosc1)

    View Slide

  32. Blosc2
    • Blosc1 only works with fixed-length, equal-sized,
    chunks (blocks)
    • This can lead to a poor use of space to
    accommodate variable-length data (potentially
    large zero-paddings)
    • Blosc2 addresses this shortcoming by using
    superchunks of variable-length chunks

    View Slide

  33. ARM/NEON: a first-class citizen for Blosc2
    • At 3 GB/s, Blosc2 on ARM achieves one of the best
    bandwidth/Watt ratios in the market
    • Profound implications for the density of data storage
    devices (e.g. arrays of disks driven by ARM)
    Not using NEON Using NEON

    View Slide

  34. Other planned features for Blosc2
    •Looking into inter-chunk
    redundancies (delta filter)
    •Support for more codecs
    and filters
    •Serialized version of the
    super-chunk (disk, network)

    View Slide

  35. Resumen
    • Debido a la evolución en las arquitectura
    modernas, la compresión puede ser efectiva por
    dos razones:
    • Se puede trabajar con más datos usando los
    mismos recursos
    • Se puede llegar a reducir el coste de la
    compresión a cero, e incluso más allá!

    View Slide

  36. ¿Preguntas?
    [email protected]

    View Slide