Squeeze (Gently) Your (Big) Data

Squeeze (Gently) Your (Big) Data

The slides for my KeyNote at PyData Barcelona

65868d36f26f237938997dd28c2b2453?s=128

FrancescAlted

May 21, 2017
Tweet

Transcript

  1. Squeeze (Gently) Your (Big) Data Francesc Alted Freelance Consultant http://www.blosc.org/professional-services.html

    Barcelona, May 21st, 2017
  2. About Me • Physicist by training • Computer scientist by

    passion • Open Source enthusiast by philosophy • PyTables (2002 - 2011, 2017) • Blosc (2009 - now) • bcolz (2010 - now)
  3. –Manuel Oltra, music composer “The art is in the execution

    of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others
  4. OPSI Out-of-core Expressions Indexed Queries + a Twist

  5. OPSI Out-of-core Expressions Indexed Queries + a Twist

  6. Overview • Compression through the years • The need for

    speed: storing and processing as much data as possible with your existing resources • Chunked data containers • How machine learning can help compressing better and faster
  7. Compression Through the Years

  8. Compressing Usenet News • Circa 1993/1994 • Initially getting the

    data stream at 9600 bauds and then upgraded to 64 Kbit/s (yeah, that was fast!). • HP 9000-730 with a speedy PA-7000 RISC microprocessor @ 66 MHz, running HP-UX.
  9. Compress for Improving Transmission Speed Transmission Line Decompression Remote News

    Server Local News Server Original
 News Set Compressed
 News Set Compression + transmission + decompression faster than direct transfer? Compression
  10. Compression Advantage at Different Bandwidths (1993) The fastest the transmission

    line, the lower the compression level so as to maximise the total amount of transmitted data (bandwidth).
  11. Nowadays Computers CPU’s are so fast that the memory bus

    is a bottleneck -> compression can improve memory bandwidth, and hence, potentially accelerate computations!
  12. Improving RAM Speed? Less data needs to be transmitted to

    the CPU Memory Bus Decompression Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  13. Recent Trends In Computer CPUs

  14. Memory Access Time vs CPU Cycle Time The gap is

    wide and still opening!
  15. Reported CPU Usage is Usually Wrong http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
 — Brendan Gregg

    • The goal is to reduce the ‘Waiting (“stalled”)’ state to a minimum.
  16. Computer Architecture Evolution Up to end 80’s 90’s and 2000’s

    2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
  17. Hierarchy of Memory
 (Circa 2017) SSD SATA (persistent) RAM (addressable)

    3D XPoint (Optane) (persistent) HDD (persistent) L3 L2 L1 8 levels are possible! SSD PCIe (persistent)
  18. Forthcoming Trends CPU+GPU
 Integration

  19. Blosc: A compressor that Takes Advantage of Vector and Parallel

    Hardware (Image courtesy of Morgan Kaufmann, imprint of Elsevier, all rights reserved)
  20. Blosc: Decompressing Faster Than memcpy()

  21. Blosc: A Meta-Compressor With Many Knobs • Blosc accepts different:

    • Compression levels: from 0 to 9 • Codecs: “blosclz”, “lz4”, “lz4hc”, “snappy”, “zlib” and “zstd” • Different filters: “shuffle” and “bitshuffle” • Number of threads • Block sizes (the chunk is split in blocks internally) Nice opportunity for fine tuning for a specific setup!
  22. Accelerating I/O With Blosc Blosc Main Memory Solid State Disk

    Capacity Speed CPU Level 2 Cache Level 1 Cache Mechanical Disk Level 3 Cache } } Other compressors
  23. Use Case:
 Handling Data Streams at 500K/s

  24. Requirements • Being able to handle and ingest several data

    streams simultaneously • Speed for the aggregated streams can be up to 500K messages/sec • Each message can host between 10 and 100 different fields (string, float, int, bool)
  25. Ingesting Streams Repeater Stream 1 Stream 2 Stream 3 Stream

    N Ingestor 1 Ingestor 2 Ingestor M HDF5 File Blosc-compressed
 gRPC streams … … HDF5 File HDF5 File Blosc filter
  26. Subscriber Thread N +1 Queue Subscriber Thread N + M

    Queue Detail of the Repeater Publisher Thread 1 Publisher Thread 2 Publisher Thread N … … Stream 1 Stream 2 Stream N Compressed gRPC streams
  27. Detail of the Queue • Every thread safe queue is

    compressed with Blosc after being filled up for improved memory consumption and faster transmission. Event Field 1 Queue Event Field 2 Event Field N . . . Compressed Field 1 Compressed Field 2 Compressed Field N Thread safe queues gRPC Streams . . . Blosc-compressed gRPC buffers
  28. The Role of Compression • Compression allowed a reduction of

    ~5x in both transmission and storage. • It was used throughout all the project: • In gRPC buffers, for improved memory consumption in the Repeater queues and faster transmission • In HDF5, so as to greatly reduce the disk usage and ingestion time • The system was able to ingest more than 500K mess/sec (~650K mess/sec in our setup using a single machine with >16 physical cores). Not possible without compression!
  29. Use Cases Where Compression Accelerates Computation

  30. Case 1: Compression in Machine Learning When Lempel-Ziv-Welch Meets Machine

    Learning:
 A Case Study of Accelerating Machine Learning using Coding Fengan Li et al. (mainly Google and UW-Madison)
  31. Case 1: Compression in Machine Learning TOC: Tuple Oriented Coding;

    a specialised LZW
 (article available in: https://arxiv.org/abs/1702.06943)
  32. Case 2: Transferring Compressed Data to GPUs http://dl.acm.org/citation.cfm?id=3076122 Eyal Rozenberg,

    Peter Boncz
 CWI, Amsterdam
  33. Chunked Data Containers

  34. Some Examples of Chunked Data Containers • On-disk: • HDF5

    (https://support.hdfgroup.org/HDF5/) • NetCDF4 (https://www.unidata.ucar.edu/software/ netcdf/) • In-memory (although they can be on-disk too): • bcolz (https://github.com/Blosc/bcolz) • zarr (https://github.com/alimanfoo/zarr)
  35. HDF5, the Grand Daddy of On-disk, Chunked Containers • Started

    back in 1998 at NCSA (with the support of NASA) • Great adoption in many fields, including scientific, engineer and finance • Maintained by The HDF Group, a non-profit corporation • Two major Python wrappers: h5py and PyTables
  36. bcolz: Chunked, Compressed Tables • ctable objects in bcolz have

    the data arranged column-wise. Better performance for big tables, as well as for improving the compression ratio. • Efficient shrinks and appends: you can shrink or append more data at the end of the objects very efficiently.
  37. bcolz vs pandas (size) bcolz can store 20x more data

    than pandas by using
 compression Source: https://github.com/Blosc/movielens-bench
  38. Query Times for bcolz 2012 laptop (Intel Ivy-Bridge, 2 cores)

    Compression speeds things up Source: https://github.com/Blosc/movielens-bench
  39. Query Times for bcolz 2010 laptop (Intel Core2, 2 cores)

    Compression still slow things down Source: https://github.com/Blosc/movielens-bench
  40. zarr: Chunked, Compressed, N-dimensional arrays • Create N-dimensional arrays with

    any NumPy dtype. • Chunk arrays along any dimension. • Compress chunks using Blosc or alternatively zlib, BZ2 or LZMA. • Created by Alistair Miles from MRC Centre for Genomics and Global Health for handling genomic data in-memory.
  41. http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html

  42. http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html

  43. How Machine Learning Can Help Compressing Better and Faster

  44. Fine-Tuning Blosc • Blosc accepts different: • Compression levels: from

    0 to 9 • Codecs: “blosclz”, “lz4”, “lz4hc”, “snappy”, “zlib” and “zstd” • Different filters: “shuffle” and “bitshuffle” • Number of threads • Block sizes (the chunk is split in blocks internally) Question: how to choose the best candidates for maximum speed? Or for maximum compression? Or for a right balance?
  45. Answer: Use Machine Learning • The user gives hints on

    what she prefer: • Maximum compression ratio • Maximum compression speed • Maximum decompression speed • A balance between all the above • Based on that, and the characteristics of the data to be compressed, the training step gives hints on the optimal Blosc parameters to be used in new datasets.
  46. Predicting Params in New Datasets Attr: Alberto Sabater

  47. Prediction Time Still Large • We still need to shave

    10x time before we can predict for every chunk. • Alternatively, one may reuse predictions for a several chunks in a row.
  48. Summary

  49. The Age of Compression Is Now • Due to the

    evolution in computer architecture, the compression can be effective for two reasons: • We can work with more data using the same resources. • We can reduce the overhead of compression to near zero, and even beyond that! We are definitely entering in an age where compression will be used much more ubiquitously.
  50. But Beware: We Need More Data Chunking In Our Infrastructure!

    • Not many data libraries focus on chunked data containers nowadays. • No silver bullet: we won’t be able to find a single container that makes everybody happy; it’s all about tradeoffs. • With chunked containers we can use persistent media (disk) as if it is ephemeral (memory) and the other way around -> independency of media!
  51. When you are short of memory, do not blindly try

    to use different nodes in parallel: First give compression an opportunity to squeeze all the capabilities out of your single box. (You are always in time to parallelise later on ;)
  52. ¡Gràcies!