Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Francesc Alted

Francesc Alted

bcolz, a data container that can beat memory speed by using extremely fast data compression



July 24, 2015


  1. bcolz Faster Than Memory Storage Francesc Alted francesc@blosc.org Lightning talk

  2. What is bcolz? • Provides a storage layer that is

    both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)
  3. Contiguous vs Chunked NumPy container Contiguous memory bcolz container chunk

    1 chunk 2 Discontiguous memory chunk N . . .
  4. Why Compression (I)? Compressed Dataset Original Dataset More data can

    be packed using the same storage
  5. Why Compression (II)? Less data needs to be transmitted to

    the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  6. Blosc: Compressing Faster Than memcpy() • bcolz is using Blosc

    to compress chunks
  7. Memory Access Time vs CPU Cycle Time

  8. The MovieLens Dataset bcolz vs pandas (size) • Compression means

    ~20x less space
  9. Query Times
 3 years-old laptop (Ivy Bridge) • Compression leads

    to better performance, even for in-memory bcolz data containers
  10. Query Times
 5-years old laptop (Core2) • Compression still makes

    things slower on old boxes, but not necessarily in newer ones
  11. Streaming Analytics With bcolz bcolz container (disk or memory) iter(),

 where(), whereblocks(), __getitem__() map(), filter(), groupby(), sortby(), reduceby(),
 join() bcolz
 iterators/filters with blocking itertools, PyToolz, Dask
  12. Give bcolz a try! http://bcolz.blosc.org/ http://www.blosc.org