New Computer Trends - Speaker Deck

Slide 1

Slide 1 text

New Computer Trends And How This Affect Us Francesc Alted Freelance Consultant http://www.blosc.org/professional-services.html April 10rd, 2016

Slide 2

Slide 2 text

“No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — My own rephrasing “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” — Isaac Asimov

Slide 3

Slide 3 text

About Me • Physicist by training • Computer scientist by passion • Open Source enthusiast by philosophy • PyTables (2002 - 2011) • Blosc (2009 - now) • bcolz (2010 - now)

Slide 4

Slide 4 text

–Manuel Oltra, music composer “The art is in the execution of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others

Slide 5

Slide 5 text

OPSI Out-of-core Expressions Indexed Queries + a Twist

Slide 6

Slide 6 text

Overview • Recent trends in computer architecture • The need for speed: storing and processing as much data as possible with your existing resources • Blosc & bcolz as examples of compressor and data containers for large datasets that follow the principles of the newer computer architectures

Slide 7

Slide 7 text

Trends in Computer Storage

Slide 8

Slide 8 text

The growing gap between DRAM and HDD is facilitating the introduction of  new SDD devices Forthcoming Trends BGA SSD M.2 SSD PCIe SSD

Slide 9

Slide 9 text

Latency Numbers Every Programmer Should Know Latency Comparison Numbers -------------------------- L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Source: Jeff Dean and Peter Norvig (Google), with some additions http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Slide 10

Slide 10 text

tref ttrans CPU cache CPU cache Block in storage to transmit to CPU Reference Time vs Transmission Time tref ~= ttrans => optimizes storage access

Slide 11

Slide 11 text

Not All Storage Layers Are Created Equal Memory: tref: 100 ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms But essentially, a blocked data access is mandatory for speed! The slower the media, the larger the block   that is worth to transmit

Slide 12

Slide 12 text

We Need More Data Blocking In Our Infrastructure! • Not many data containers focused on blocking access • No silver bullet: we won’t be able to ﬁnd a single container that makes everybody happy; it’s all about tradeoffs • With blocked access we can use persistent media (disk) as it is ephemeral (memory) and the other way around -> independency of media!

Slide 13

Slide 13 text

Can We Get Better Bandwidth Than Hardware Allows?

Slide 14

Slide 14 text

Compression for Random & Sequential Access in SSDs • Compression does help performance! (65MB/s) (240MB/s) (180MB/s) (200MB/s)

Slide 15

Slide 15 text

Compression for Random & Sequential Access in SSDs • Compression does help performance! • However, limited by SATA bandwidth (65MB/s) (240MB/s) (180MB/s) (200MB/s)

Slide 16

Slide 16 text

Leveraging Compression Straight To CPU Less data needs to be transmitted to the CPU Disk bus Decompression Disk CPU Cache Original  Dataset Compressed  Dataset Transmission + decompression faster than direct transfer?

Slide 17

Slide 17 text

When we have a fast enough compressor we can get rid of the limitations of the bus bandwidth. How to get maximum compression performance?

Slide 18

Slide 18 text

Recent Trends In Computer CPUs

Slide 19

Slide 19 text

Memory Access Time vs CPU Cycle Time The gap is wide and still opening!

Slide 20

Slide 20 text

Computer Architecture Evolution Up to end 80’s 90’s and 2000’s 2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)

Slide 21

Slide 21 text

Hierarchy of Memory  By 2018 (Educated Guess) SSD SATA (persistent) L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)

Slide 22

Slide 22 text

Forthcoming Trends CPU+GPU  Integration

Slide 23

Slide 23 text

Blosc: Compressing Faster Than memcpy()

Slide 24

Slide 24 text

Improving RAM Speed? Less data needs to be transmitted to the CPU Memory Bus Decompression Memory (RAM) CPU Cache Original  Dataset Compressed  Dataset Transmission + decompression faster than direct transfer?

Slide 25

Slide 25 text

Query Times 2012 old laptop (Intel Ivy-Bridge, 2 cores) Compression speeds things up Source: https://github.com/Blosc/movielens-bench

Slide 26

Slide 26 text

Query Times 2010 laptop (Intel Core2, 2 cores) Compression still slow things down Source: https://github.com/Blosc/movielens-bench

Slide 27

Slide 27 text

bcolz vs pandas (size) bcolz can store 20x more data than pandas by using  compression

Slide 28

Slide 28 text

Accelerating I/O With Blosc Blosc Main Memory Solid State Disk Capacity Speed CPU Level 2 Cache Level 1 Cache Mechanical Disk Level 3 Cache } } Other compressors

Slide 29

Slide 29 text

Compression matters! “Blosc compressors are the fastest ones out there at this point; there is no better publicly available option that I'm aware of. That's not just ‘yet another compressor library’ case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)

Slide 30

Slide 30 text

Bcolz: An Example Of Data Containers Applying The Principles Of New Hardware

Slide 31

Slide 31 text

What is bcolz? • bcolz provides data containers that can be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous • Two ﬂavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar

Slide 32

Slide 32 text

Contiguous vs Chunked NumPy container Contiguous memory carray container chunk 1 chunk 2 Discontiguous memory chunk N .. .

Slide 33

Slide 33 text

Why Columnar? • Because it adapts better to newer computer architectures

Slide 34

Slide 34 text

String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)

Slide 35

Slide 35 text

String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU! Less entropy so much more compressible!

Slide 36

Slide 36 text

Some Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):  https://github.com/visualfabriq/bquery • Scikit-allel:  http://scikit-allel.readthedocs.org/ • Quantopian:   http://quantopian.github.io/talks/NeedForSpeed/ slides#/ 

Slide 37

Slide 37 text

bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby “Switching to bcolz enabled us to have a much better scalable  architecture yet with near in-memory performance”  — Carst Vaartjes, co-founder visualfabriq

Slide 38

Slide 38 text

–Alistair Miles Head of Epidemiological Informatics for the Kwiatkowski group. Author of scikit-allel. “The future for me clearly involves lots of block- wise processing of multidimensional bcolz carrays"”

Slide 39

Slide 39 text

Introducing Blosc2 Next generation of Blosc Blosc2 Header Chunk 1 (Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)

Slide 40

Slide 40 text

Planned features for Blosc2 • Looking into inter- chunk redundancies (delta ﬁlter) • Support for more codecs (Zstd is there already!) • Serialized version of the super-chunk (disk, network) …

Slide 41

Slide 41 text

• At 3 GB/s, Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON

Slide 42

Slide 42 text

Blosc2 has its own repo https://github.com/Blosc/c-blosc2 Meant to be usable only when heavily tested! (bcolz2 will follow after Blosc2)

Slide 43

Slide 43 text

Closing Notes • Due to the evolution in computer architecture, the compression can be effective for two reasons: • We can work with more data using the same resources. • We can reduce the overhead of compression to near zero, and even beyond than that!

Slide 44

Slide 44 text

–Marvin Minsky “In science, one can learn the most by studying what seems the least.”

Slide 45

Slide 45 text

¡Gracias!