Slide 1

Slide 1 text

Handling Big Data on Modern Computers A Developer's View Francesc Alted Freelance Consultant http://www.blosc.org/professional-services.html Python & HDF5 hackfest Curtin University, August 8th - 11th, 2016

Slide 2

Slide 2 text

“No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — My own rephrasing “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” — Isaac Asimov

Slide 3

Slide 3 text

About Me • Physicist by training • Computer scientist by passion • Open Source enthusiast by philosophy • PyTables (2002 - now) • Blosc (2009 - now) • bcolz (2010 - now)

Slide 4

Slide 4 text

–Manuel Oltra, music composer “The art is in the execution of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others

Slide 5

Slide 5 text

OPSI Out-of-core Expressions Indexed Queries + a Twist

Slide 6

Slide 6 text

PyTables + h5py A group of people are gathering in Perth for simplifying the Python stack. Thanks to Andrea Bedini at Curtin University for organizing this!

Slide 7

Slide 7 text

Overview • Why data arrangement is critical for efficient I/O • Recent trends in computer architecture • Blosc / bcolz: examples of data containers for large datasets following the principles of newer computer architectures

Slide 8

Slide 8 text

Example from Neutrino Detectors • 12 PhotoMulTipliers (PMT) • The shape of the signal (48000 int16) for each event is registered for each PMT • Each event have associated metadata that should be recorded • Question: How to store the data so that we can store as much as possible without losing speed?

Slide 9

Slide 9 text

Two Schemas in PyTables Event
 ID Meta
 1 PMT
 ID Raw Data Single Table Event
 ID Meta
 1 Raw Data Table Array +

Slide 10

Slide 10 text

• Both approaches seem to be able to host the data and apparently the Single Table wins (it’s simpler) • But let’s experiment and see how each one behaves… Two Schemas in PyTables Notebook available:
 https://github.com/PyTables/PyTables/blob/develop/examples/Single_Table-vs-EArray_Table.ipynb

Slide 11

Slide 11 text

Difference in Size

Slide 12

Slide 12 text

Difference in Speed

Slide 13

Slide 13 text

Data Arrangement is Critical for Performance • This happens in many cases, but specially when we want to use compression, the way in which we put data together is very important for our goals

Slide 14

Slide 14 text

Pending Questions • Why data arrangement is so important? • Why compression can bring us performance that is close to the non-compressed scenario?

Slide 15

Slide 15 text

Trends In Computer CPUs

Slide 16

Slide 16 text

Memory Access Time vs CPU Cycle Time The gap is wide and still opening!

Slide 17

Slide 17 text

Computer Architecture Evolution Up to end 80’s 90’s and 2000’s 2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)

Slide 18

Slide 18 text

Hierarchy of Memory
 By 2017 (Educated Guess) SSD SATA (persistent) L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)

Slide 19

Slide 19 text

Forthcoming Trends CPU+GPU
 Integration

Slide 20

Slide 20 text

Trends in Computer Storage

Slide 21

Slide 21 text

The growing gap between DRAM and HDD is facilitating the introduction of
 new SDD devices Forthcoming Trends BGA SSD M.2 SSD PCIe SSD

Slide 22

Slide 22 text

Latency Numbers Every Programmer Should Know Latency Comparison Numbers -------------------------- L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Source: Jeff Dean and Peter Norvig (Google), with some additions http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Slide 23

Slide 23 text

tref ttrans CPU cache CPU cache Block in storage to transmit to CPU Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access

Slide 24

Slide 24 text

Not All Storage Layers Are Created Equal Memory: tref: 100 ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms But essentially, a blocked data access is mandatory for speed! The slower the media, the larger the block 
 that is worth to transmit

Slide 25

Slide 25 text

We Need More Data Blocking In Our Infrastructure! • Not many data containers allows for efficient blocking access yet (but e.g. HDF5 does) • With blocked access we can use persistent media (disk) as it is ephemeral (memory) and the other way around -> independency of media! • No silver bullet: we won’t be able to find a single container that makes everybody happy; it’s all about tradeoffs

Slide 26

Slide 26 text

We know that CPUs are in many cases waiting for data to arrive due mainly to bandwidth limitations But… Could we get better bandwidth than hardware allows? Question

Slide 27

Slide 27 text

Compression for Random & Sequential Access in SSDs • Compression does help performance! (65MB/s) (240MB/s) (180MB/s) (200MB/s)

Slide 28

Slide 28 text

Compression for Random & Sequential Access in SSDs • Compression does help performance! • However, limited by SATA bandwidth (65MB/s) (240MB/s) (180MB/s) (200MB/s)

Slide 29

Slide 29 text

Leveraging Compression Straight To CPU Less data needs to be transmitted to the CPU Disk bus Decompression Disk CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?

Slide 30

Slide 30 text

When we have a fast enough compressor we can overcome the limitations of the bus bandwidth. And by bus we mean any kind of it (memory bus too!)

Slide 31

Slide 31 text

Example with actual data (satellite images):
 Blosc compression does not degrade I/O performance Thanks to: Rui Yang, Pablo Larraondo (NCI Australia)

Slide 32

Slide 32 text

Reading satellite images: Blosc decompression accelerates I/O Thanks to: Rui Yang, Pablo Larraondo (NCI Australia)

Slide 33

Slide 33 text

Time to Answer Pending Questions

Slide 34

Slide 34 text

Time to Answer Pending Questions (I) • When using no compression, the single Table the takes more time than using Table + EArray. • Not completely sure why this happens but probably due to memory alignment issues. • Take home message: when you want to squeeze all the performance out of computer, don’t be afraid of experimenting. You will find surprises!

Slide 35

Slide 35 text

Time to Answer Pending Questions (II) • For the single Table the rows are too large to allow the shuffle filter to put similar significant bytes in groups, although ZLib can still do a good job at that. • For the Table + EArray, the raw data is arranged so that shuffle can group similar bytes together, allowing Blosc to perform much better. • Take home message: using the correct schema inside the data container is critical for getting the best performance.

Slide 36

Slide 36 text

Can CPU-based Compression Alleviate The Memory Bottleneck?

Slide 37

Slide 37 text

Blosc: Compressing Faster Than memcpy()

Slide 38

Slide 38 text

Principles of Blosc • Split data chunks in blocks internally (better cache utilization) • Supports Shuffle and BitShuffle filters (see later) • Use parallelism at two levels: • Use multicores (multithreading) • Use SIMD in Intel/AMD processors (SSE2, AVX2) and ARM (NEON)

Slide 39

Slide 39 text

The Shuffle filter • Shuffle works at byte level, and works well for integers or floats that vary smoothly • There is also support for a BitShuffle filter that works at bit level

Slide 40

Slide 40 text

Improving RAM Speed? Less data needs to be transmitted to the CPU Memory Bus Decompression Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?

Slide 41

Slide 41 text

Interesting columns a (String) b (Int32) c (Float64) d (String) Chunk 1 Chunk N CPU cache . . . Query: (b == 5) & (d == ‘some string’) d b result Chunked Query . . . . . . . . . . . . . . . . . . . . . . . . Iterator . . . Interesting rows Very efficient when query selectivity is high and decompression is fast

Slide 42

Slide 42 text

Query Times in bcolz Reference: https://github.com/Blosc/movielens-bench/blob/master/querying-ep14.ipynb Recent server (Intel Xeon Skylake, 4 cores) Compression speeds things up

Slide 43

Slide 43 text

Query Times in bcolz 4-year old laptop (Intel Ivy-Bridge, 2 cores) Compression speeds things up

Slide 44

Slide 44 text

Query Times in bcolz 2010 laptop (Intel Core2, 2 cores)
 Compression slows things down

Slide 45

Slide 45 text

Sizes in bcolz Do not forget compression main strength:
 We can store more data using same resources

Slide 46

Slide 46 text

Beware: Compression does not Always Goes Faster

Slide 47

Slide 47 text

Accelerating I/O With Blosc Blosc Main Memory Solid State Disk Capacity Speed CPU Level 2 Cache Level 1 Cache Mechanical Disk Level 3 Cache } } Other compressors

Slide 48

Slide 48 text

Bcolz: An Example Of Data Containers Applying The Principles Of New Hardware

Slide 49

Slide 49 text

What is bcolz? • bcolz provides data containers that can be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous • Two flavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar

Slide 50

Slide 50 text

Contiguous vs Chunked NumPy container Contiguous memory carray container chunk 1 chunk 2 Discontiguous memory chunk N .. .

Slide 51

Slide 51 text

Why Columnar? • Because it adapts better to newer computer architectures

Slide 52

Slide 52 text

String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)

Slide 53

Slide 53 text

String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!

Slide 54

Slide 54 text

Some Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):
 https://github.com/visualfabriq/bquery • Scikit-allel:
 http://scikit-allel.readthedocs.org/ • Quantopian: 
 http://quantopian.github.io/talks/NeedForSpeed/ slides#/


Slide 55

Slide 55 text

Closing Notes • Pay attention to hardware and software trends and make informed decisions for your current developments (which, btw, will be deployed in the future) • If you need a data container that fits your needs, look for already nice libraries out there (NumPy, DyND, Pandas, xarray, HDF5, bcolz, RDBM…) • Compression does help during I/O. Make sure you have it among your tools for data handling.

Slide 56

Slide 56 text

–Marvin Minsky “In science, one can learn the most by studying what seems the least.”

Slide 57

Slide 57 text

Thank you!