Slide 1

Slide 1 text

bcolz Faster Than Memory Storage Francesc Alted [email protected] Lightning talk

Slide 2

Slide 2 text

What is bcolz? • Provides a storage layer that is both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Main goal: to demonstrate that compression can accelerate data access (both on disk and in-memory)

Slide 3

Slide 3 text

Contiguous vs Chunked NumPy container Contiguous memory bcolz container chunk 1 chunk 2 Discontiguous memory chunk N . . .

Slide 4

Slide 4 text

Why Compression (I)? Compressed Dataset Original Dataset More data can be packed using the same storage

Slide 5

Slide 5 text

Why Compression (II)? Less data needs to be transmitted to the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?

Slide 6

Slide 6 text

Blosc: Compressing Faster Than memcpy() • bcolz is using Blosc to compress chunks

Slide 7

Slide 7 text

Memory Access Time vs CPU Cycle Time

Slide 8

Slide 8 text

The MovieLens Dataset bcolz vs pandas (size) • Compression means ~20x less space

Slide 9

Slide 9 text

Query Times
 3 years-old laptop (Ivy Bridge) • Compression leads to better performance, even for in-memory bcolz data containers

Slide 10

Slide 10 text

Query Times
 5-years old laptop (Core2) • Compression still makes things slower on old boxes, but not necessarily in newer ones

Slide 11

Slide 11 text

Streaming Analytics With bcolz bcolz container (disk or memory) iter(), iterblocks(),
 where(), whereblocks(), __getitem__() map(), filter(), groupby(), sortby(), reduceby(),
 join() bcolz
 iterators/filters with blocking itertools, PyToolz, Dask

Slide 12

Slide 12 text

Give bcolz a try! http://bcolz.blosc.org/ http://www.blosc.org