Handling Big Data on Modern Computers

Handling Big Data on Modern Computers A Developer's View Francesc
Alted Freelance Consultant http://www.blosc.org/professional-services.html Python & HDF5 hackfest Curtin University, August 8th - 11th, 2016

“No sensible decision can be made any longer without taking
into account not only the computer as it is, but the computer as it will be.” — My own rephrasing “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” — Isaac Asimov

About Me • Physicist by training • Computer scientist by
passion • Open Source enthusiast by philosophy • PyTables (2002 - now) • Blosc (2009 - now) • bcolz (2010 - now)

–Manuel Oltra, music composer “The art is in the execution
of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others

OPSI Out-of-core Expressions Indexed Queries + a Twist

PyTables + h5py A group of people are gathering in
Perth for simplifying the Python stack. Thanks to Andrea Bedini at Curtin University for organizing this!

Overview • Why data arrangement is critical for efﬁcient I/O
• Recent trends in computer architecture • Blosc / bcolz: examples of data containers for large datasets following the principles of newer computer architectures

Example from Neutrino Detectors • 12 PhotoMulTipliers (PMT) • The
shape of the signal (48000 int16) for each event is registered for each PMT • Each event have associated metadata that should be recorded • Question: How to store the data so that we can store as much as possible without losing speed?

Two Schemas in PyTables Event  ID Meta  1 PMT  ID
Raw Data Single Table Event  ID Meta  1 Raw Data Table Array +

• Both approaches seem to be able to host the
data and apparently the Single Table wins (it’s simpler) • But let’s experiment and see how each one behaves… Two Schemas in PyTables Notebook available:  https://github.com/PyTables/PyTables/blob/develop/examples/Single_Table-vs-EArray_Table.ipynb

Difference in Size

Difference in Speed

Data Arrangement is Critical for Performance • This happens in
many cases, but specially when we want to use compression, the way in which we put data together is very important for our goals

Pending Questions • Why data arrangement is so important? •
Why compression can bring us performance that is close to the non-compressed scenario?

Trends In Computer CPUs

Memory Access Time vs CPU Cycle Time The gap is
wide and still opening!

Computer Architecture Evolution Up to end 80’s 90’s and 2000’s
2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)

Hierarchy of Memory  By 2017 (Educated Guess) SSD SATA (persistent)
L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)

Forthcoming Trends CPU+GPU  Integration

Trends in Computer Storage

The growing gap between DRAM and HDD is facilitating the
introduction of  new SDD devices Forthcoming Trends BGA SSD M.2 SSD PCIe SSD

Latency Numbers Every Programmer Should Know Latency Comparison Numbers --------------------------
L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Source: Jeff Dean and Peter Norvig (Google), with some additions http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

tref ttrans CPU cache CPU cache Block in storage to
transmit to CPU Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access

Not All Storage Layers Are Created Equal Memory: tref: 100
ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms But essentially, a blocked data access is mandatory for speed! The slower the media, the larger the block   that is worth to transmit

We Need More Data Blocking In Our Infrastructure! • Not
many data containers allows for efﬁcient blocking access yet (but e.g. HDF5 does) • With blocked access we can use persistent media (disk) as it is ephemeral (memory) and the other way around -> independency of media! • No silver bullet: we won’t be able to ﬁnd a single container that makes everybody happy; it’s all about tradeoffs

We know that CPUs are in many cases waiting for
data to arrive due mainly to bandwidth limitations But… Could we get better bandwidth than hardware allows? Question

Compression for Random & Sequential Access in SSDs • Compression
does help performance! (65MB/s) (240MB/s) (180MB/s) (200MB/s)

Compression for Random & Sequential Access in SSDs • Compression
does help performance! • However, limited by SATA bandwidth (65MB/s) (240MB/s) (180MB/s) (200MB/s)

Leveraging Compression Straight To CPU Less data needs to be
transmitted to the CPU Disk bus Decompression Disk CPU Cache Original  Dataset Compressed  Dataset Transmission + decompression faster than direct transfer?

When we have a fast enough compressor we can overcome
the limitations of the bus bandwidth. And by bus we mean any kind of it (memory bus too!)

Example with actual data (satellite images):  Blosc compression does not
degrade I/O performance Thanks to: Rui Yang, Pablo Larraondo (NCI Australia)

Reading satellite images: Blosc decompression accelerates I/O Thanks to: Rui
Yang, Pablo Larraondo (NCI Australia)

Time to Answer Pending Questions

Time to Answer Pending Questions (I) • When using no
compression, the single Table the takes more time than using Table + EArray. • Not completely sure why this happens but probably due to memory alignment issues. • Take home message: when you want to squeeze all the performance out of computer, don’t be afraid of experimenting. You will ﬁnd surprises!

Time to Answer Pending Questions (II) • For the single
Table the rows are too large to allow the shuffle filter to put similar significant bytes in groups, although ZLib can still do a good job at that. • For the Table + EArray, the raw data is arranged so that shuffle can group similar bytes together, allowing Blosc to perform much better. • Take home message: using the correct schema inside the data container is critical for getting the best performance.

Can CPU-based Compression Alleviate The Memory Bottleneck?

Blosc: Compressing Faster Than memcpy()

Principles of Blosc • Split data chunks in blocks internally
(better cache utilization) • Supports Shuffle and BitShuffle filters (see later) • Use parallelism at two levels: • Use multicores (multithreading) • Use SIMD in Intel/AMD processors (SSE2, AVX2) and ARM (NEON)

The Shuffle filter • Shuffle works at byte level, and
works well for integers or floats that vary smoothly • There is also support for a BitShuffle filter that works at bit level

Improving RAM Speed? Less data needs to be transmitted to
the CPU Memory Bus Decompression Memory (RAM) CPU Cache Original  Dataset Compressed  Dataset Transmission + decompression faster than direct transfer?

Interesting columns a (String) b (Int32) c (Float64) d (String)
Chunk 1 Chunk N CPU cache . . . Query: (b == 5) & (d == ‘some string’) d b result Chunked Query . . . . . . . . . . . . . . . . . . . . . . . . Iterator . . . Interesting rows Very efﬁcient when query selectivity is high and decompression is fast

Query Times in bcolz Reference: https://github.com/Blosc/movielens-bench/blob/master/querying-ep14.ipynb Recent server (Intel Xeon
Skylake, 4 cores) Compression speeds things up

Query Times in bcolz 4-year old laptop (Intel Ivy-Bridge, 2
cores) Compression speeds things up

Query Times in bcolz 2010 laptop (Intel Core2, 2 cores) 
Compression slows things down

Sizes in bcolz Do not forget compression main strength:  We
can store more data using same resources

Beware: Compression does not Always Goes Faster

Accelerating I/O With Blosc Blosc Main Memory Solid State Disk
Capacity Speed CPU Level 2 Cache Level 1 Cache Mechanical Disk Level 3 Cache } } Other compressors

Bcolz: An Example Of Data Containers Applying The Principles Of
New Hardware

What is bcolz? • bcolz provides data containers that can
be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous • Two ﬂavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar

Contiguous vs Chunked NumPy container Contiguous memory carray container chunk
1 chunk 2 Discontiguous memory chunk N .. .

Why Columnar? • Because it adapts better to newer computer
architectures

String … String Int32 Float64 Int16 String … String Int32
Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)

String … String Int32 Float64 Int16 String … String Int32
Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!

Some Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):  https://github.com/visualfabriq/bquery
• Scikit-allel:  http://scikit-allel.readthedocs.org/ • Quantopian:   http://quantopian.github.io/talks/NeedForSpeed/ slides#/ 

Closing Notes • Pay attention to hardware and software trends
and make informed decisions for your current developments (which, btw, will be deployed in the future) • If you need a data container that ﬁts your needs, look for already nice libraries out there (NumPy, DyND, Pandas, xarray, HDF5, bcolz, RDBM…) • Compression does help during I/O. Make sure you have it among your tools for data handling.

–Marvin Minsky “In science, one can learn the most by
studying what seems the least.”

Thank you!

Handling Big Data on Modern Computers

Handling Big Data on Modern Computers

More Decks by FrancescAlted

Other Decks in Technology

Featured

Transcript