Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Squeeze (Gently) Your (Big) Data

Squeeze (Gently) Your (Big) Data

The slides for my KeyNote at PyData Barcelona

FrancescAlted

May 21, 2017
Tweet

More Decks by FrancescAlted

Other Decks in Programming

Transcript

  1. Squeeze (Gently) Your
    (Big) Data
    Francesc Alted
    Freelance Consultant
    http://www.blosc.org/professional-services.html
    Barcelona, May 21st, 2017

    View Slide

  2. About Me
    • Physicist by training
    • Computer scientist by passion
    • Open Source enthusiast by philosophy
    • PyTables (2002 - 2011, 2017)
    • Blosc (2009 - now)
    • bcolz (2010 - now)

    View Slide

  3. –Manuel Oltra, music composer
    “The art is in the execution of an idea. Not in the
    idea. There is not much left just from an idea.”
    “Real artists ship”
    –Seth Godin, writer
    Why Open Source Projects?
    • Nice way to realize yourself while helping others

    View Slide

  4. OPSI
    Out-of-core
    Expressions
    Indexed
    Queries
    + a Twist

    View Slide

  5. OPSI
    Out-of-core
    Expressions
    Indexed
    Queries
    + a Twist

    View Slide

  6. Overview
    • Compression through the years
    • The need for speed: storing and processing as
    much data as possible with your existing resources
    • Chunked data containers
    • How machine learning can help compressing
    better and faster

    View Slide

  7. Compression
    Through the Years

    View Slide

  8. Compressing Usenet News
    • Circa 1993/1994
    • Initially getting the data stream at 9600 bauds and
    then upgraded to 64 Kbit/s (yeah, that was fast!).
    • HP 9000-730 with a speedy PA-7000 RISC
    microprocessor @ 66 MHz, running HP-UX.

    View Slide

  9. Compress for Improving
    Transmission Speed
    Transmission Line
    Decompression
    Remote News Server
    Local News Server
    Original

    News Set
    Compressed

    News Set
    Compression + transmission + decompression faster than
    direct transfer?
    Compression

    View Slide

  10. Compression Advantage at
    Different Bandwidths (1993)
    The fastest the transmission line, the lower the
    compression level so as to maximise the total amount
    of transmitted data (bandwidth).

    View Slide

  11. Nowadays Computers
    CPU’s are so fast that the memory bus is a bottleneck
    -> compression can improve memory bandwidth,
    and hence, potentially accelerate computations!

    View Slide

  12. Improving RAM Speed?
    Less data needs to be transmitted to the CPU
    Memory Bus
    Decompression
    Memory (RAM)
    CPU Cache
    Original

    Dataset
    Compressed

    Dataset
    Transmission + decompression faster than direct transfer?

    View Slide

  13. Recent Trends In
    Computer CPUs

    View Slide

  14. Memory Access Time
    vs CPU Cycle Time
    The gap is wide and still opening!

    View Slide

  15. Reported CPU Usage is
    Usually Wrong
    http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

    — Brendan Gregg
    • The goal is to reduce the ‘Waiting (“stalled”)’ state to a minimum.

    View Slide

  16. Computer Architecture
    Evolution
    Up to end 80’s 90’s and 2000’s 2010’s
    Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current
    Mechanical disk Mechanical disk Mechanical disk
    Speed
    Capacity
    Solid state disk
    Main memory
    Level 3 cache
    Level 2 cache
    Level 1 cache
    Level 2 cache
    Level 1 cache
    Main memory
    Main memory
    CPU
    CPU
    (a) (b) (c)
    Central
    processing
    unit (CPU)

    View Slide

  17. Hierarchy of Memory

    (Circa 2017)
    SSD SATA (persistent)
    RAM (addressable)
    3D XPoint (Optane) (persistent)
    HDD (persistent)
    L3
    L2
    L1 8 levels are possible!
    SSD PCIe (persistent)

    View Slide

  18. Forthcoming Trends
    CPU+GPU

    Integration

    View Slide

  19. Blosc: A compressor that Takes
    Advantage of Vector and Parallel
    Hardware
    (Image courtesy of Morgan Kaufmann, imprint of Elsevier, all rights reserved)

    View Slide

  20. Blosc: Decompressing
    Faster Than memcpy()

    View Slide

  21. Blosc: A Meta-Compressor
    With Many Knobs
    • Blosc accepts different:
    • Compression levels: from 0 to 9
    • Codecs: “blosclz”, “lz4”, “lz4hc”, “snappy”, “zlib” and “zstd”
    • Different filters: “shuffle” and “bitshuffle”
    • Number of threads
    • Block sizes (the chunk is split in blocks internally)
    Nice opportunity for fine tuning for a specific setup!

    View Slide

  22. Accelerating I/O With
    Blosc
    Blosc
    Main Memory
    Solid State Disk
    Capacity
    Speed
    CPU
    Level 2 Cache
    Level 1 Cache
    Mechanical Disk
    Level 3 Cache
    }
    }
    Other
    compressors

    View Slide

  23. Use Case:

    Handling Data Streams
    at 500K/s

    View Slide

  24. Requirements
    • Being able to handle and ingest several data
    streams simultaneously
    • Speed for the aggregated streams can be up to
    500K messages/sec
    • Each message can host between 10 and 100
    different fields (string, float, int, bool)

    View Slide

  25. Ingesting Streams
    Repeater
    Stream 1 Stream 2 Stream 3 Stream N
    Ingestor 1 Ingestor 2 Ingestor M
    HDF5
    File
    Blosc-compressed

    gRPC streams


    HDF5
    File
    HDF5
    File
    Blosc filter

    View Slide

  26. Subscriber
    Thread N +1
    Queue
    Subscriber
    Thread N + M
    Queue
    Detail of the Repeater
    Publisher
    Thread 1
    Publisher
    Thread 2
    Publisher
    Thread N


    Stream 1 Stream 2 Stream N
    Compressed gRPC streams

    View Slide

  27. Detail of the Queue
    • Every thread safe queue is compressed with Blosc after
    being filled up for improved memory consumption and faster
    transmission.
    Event Field 1
    Queue
    Event Field 2
    Event Field N
    . . .
    Compressed Field 1
    Compressed Field 2
    Compressed Field N
    Thread safe queues
    gRPC Streams
    . . .
    Blosc-compressed gRPC buffers

    View Slide

  28. The Role of Compression
    • Compression allowed a reduction of ~5x in both transmission
    and storage.
    • It was used throughout all the project:
    • In gRPC buffers, for improved memory consumption in the
    Repeater queues and faster transmission
    • In HDF5, so as to greatly reduce the disk usage and
    ingestion time
    • The system was able to ingest more than 500K mess/sec
    (~650K mess/sec in our setup using a single machine with
    >16 physical cores). Not possible without compression!

    View Slide

  29. Use Cases Where
    Compression
    Accelerates Computation

    View Slide

  30. Case 1: Compression in
    Machine Learning
    When Lempel-Ziv-Welch Meets Machine Learning:

    A Case Study of Accelerating Machine Learning
    using Coding
    Fengan Li et al. (mainly Google and UW-Madison)

    View Slide

  31. Case 1: Compression in
    Machine Learning
    TOC: Tuple Oriented Coding; a specialised LZW

    (article available in: https://arxiv.org/abs/1702.06943)

    View Slide

  32. Case 2: Transferring
    Compressed Data to GPUs
    http://dl.acm.org/citation.cfm?id=3076122
    Eyal Rozenberg, Peter Boncz

    CWI, Amsterdam

    View Slide

  33. Chunked Data
    Containers

    View Slide

  34. Some Examples of Chunked
    Data Containers
    • On-disk:
    • HDF5 (https://support.hdfgroup.org/HDF5/)
    • NetCDF4 (https://www.unidata.ucar.edu/software/
    netcdf/)
    • In-memory (although they can be on-disk too):
    • bcolz (https://github.com/Blosc/bcolz)
    • zarr (https://github.com/alimanfoo/zarr)

    View Slide

  35. HDF5, the Grand Daddy of
    On-disk, Chunked Containers
    • Started back in 1998 at NCSA
    (with the support of NASA)
    • Great adoption in many fields,
    including scientific, engineer
    and finance
    • Maintained by The HDF
    Group, a non-profit
    corporation
    • Two major Python wrappers:
    h5py and PyTables

    View Slide

  36. bcolz: Chunked,
    Compressed Tables
    • ctable objects in bcolz have the data arranged
    column-wise. Better performance for big tables, as
    well as for improving the compression ratio.
    • Efficient shrinks and appends: you can shrink or
    append more data at the end of the objects very
    efficiently.

    View Slide

  37. bcolz vs pandas (size)
    bcolz can store 20x more
    data than pandas by using

    compression
    Source: https://github.com/Blosc/movielens-bench

    View Slide

  38. Query Times for bcolz
    2012 laptop (Intel Ivy-Bridge, 2 cores)
    Compression speeds things up
    Source: https://github.com/Blosc/movielens-bench

    View Slide

  39. Query Times for bcolz
    2010 laptop (Intel Core2, 2 cores)
    Compression still slow things down
    Source: https://github.com/Blosc/movielens-bench

    View Slide

  40. zarr: Chunked, Compressed,
    N-dimensional arrays
    • Create N-dimensional arrays with any NumPy
    dtype.
    • Chunk arrays along any dimension.
    • Compress chunks using Blosc or alternatively zlib,
    BZ2 or LZMA.
    • Created by Alistair Miles from MRC Centre for
    Genomics and Global Health for handling genomic
    data in-memory.

    View Slide

  41. http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html

    View Slide

  42. http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html

    View Slide

  43. How Machine Learning
    Can Help Compressing
    Better and Faster

    View Slide

  44. Fine-Tuning Blosc
    • Blosc accepts different:
    • Compression levels: from 0 to 9
    • Codecs: “blosclz”, “lz4”, “lz4hc”, “snappy”, “zlib” and “zstd”
    • Different filters: “shuffle” and “bitshuffle”
    • Number of threads
    • Block sizes (the chunk is split in blocks internally)
    Question: how to choose the best candidates for maximum
    speed? Or for maximum compression? Or for a right balance?

    View Slide

  45. Answer: Use Machine
    Learning
    • The user gives hints on what she prefer:
    • Maximum compression ratio
    • Maximum compression speed
    • Maximum decompression speed
    • A balance between all the above
    • Based on that, and the characteristics of the data to be
    compressed, the training step gives hints on the optimal
    Blosc parameters to be used in new datasets.

    View Slide

  46. Predicting Params in New
    Datasets
    Attr: Alberto Sabater

    View Slide

  47. Prediction Time Still Large
    • We still need to shave 10x time before we can predict for every chunk.
    • Alternatively, one may reuse predictions for a several chunks in a row.

    View Slide

  48. Summary

    View Slide

  49. The Age of Compression Is
    Now
    • Due to the evolution in computer architecture, the
    compression can be effective for two reasons:
    • We can work with more data using the same
    resources.
    • We can reduce the overhead of compression to
    near zero, and even beyond that!
    We are definitely entering in an age where
    compression will be used much more ubiquitously.

    View Slide

  50. But Beware: We Need More Data
    Chunking In Our Infrastructure!
    • Not many data libraries focus on chunked data
    containers nowadays.
    • No silver bullet: we won’t be able to find a single
    container that makes everybody happy; it’s all
    about tradeoffs.
    • With chunked containers we can use persistent
    media (disk) as if it is ephemeral (memory) and the
    other way around -> independency of media!

    View Slide

  51. When you are short of memory, do not blindly try to
    use different nodes in parallel:
    First give compression an opportunity to squeeze
    all the capabilities out of your single box.
    (You are always in time to parallelise later on ;)

    View Slide

  52. ¡Gràcies!

    View Slide