Handling Big Data on Modern Computers

Handling Big Data on Modern Computers

Nowadays computers are being designed quite differently as they were made more than a decade ago; however, very little in software architecture has changed in order to accommodate for the changes in the hardware architecture. During my talk I am going to describe which those fundamental changes are and how to deal with them from the point of view of a long-time developer.

65868d36f26f237938997dd28c2b2453?s=128

FrancescAlted

August 23, 2016
Tweet

Transcript

  1. Handling Big Data on Modern Computers A Developer's View Francesc

    Alted Freelance Consultant http://www.blosc.org/professional-services.html Python & HDF5 hackfest Curtin University, August 8th - 11th, 2016
  2. “No sensible decision can be made any longer without taking

    into account not only the computer as it is, but the computer as it will be.” — My own rephrasing “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” — Isaac Asimov
  3. About Me • Physicist by training • Computer scientist by

    passion • Open Source enthusiast by philosophy • PyTables (2002 - now) • Blosc (2009 - now) • bcolz (2010 - now)
  4. –Manuel Oltra, music composer “The art is in the execution

    of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others
  5. OPSI Out-of-core Expressions Indexed Queries + a Twist

  6. PyTables + h5py A group of people are gathering in

    Perth for simplifying the Python stack. Thanks to Andrea Bedini at Curtin University for organizing this!
  7. Overview • Why data arrangement is critical for efficient I/O

    • Recent trends in computer architecture • Blosc / bcolz: examples of data containers for large datasets following the principles of newer computer architectures
  8. Example from Neutrino Detectors • 12 PhotoMulTipliers (PMT) • The

    shape of the signal (48000 int16) for each event is registered for each PMT • Each event have associated metadata that should be recorded • Question: How to store the data so that we can store as much as possible without losing speed?
  9. Two Schemas in PyTables Event
 ID Meta
 1 PMT
 ID

    Raw Data Single Table Event
 ID Meta
 1 Raw Data Table Array +
  10. • Both approaches seem to be able to host the

    data and apparently the Single Table wins (it’s simpler) • But let’s experiment and see how each one behaves… Two Schemas in PyTables Notebook available:
 https://github.com/PyTables/PyTables/blob/develop/examples/Single_Table-vs-EArray_Table.ipynb
  11. Difference in Size

  12. Difference in Speed

  13. Data Arrangement is Critical for Performance • This happens in

    many cases, but specially when we want to use compression, the way in which we put data together is very important for our goals
  14. Pending Questions • Why data arrangement is so important? •

    Why compression can bring us performance that is close to the non-compressed scenario?
  15. Trends In Computer CPUs

  16. Memory Access Time vs CPU Cycle Time The gap is

    wide and still opening!
  17. Computer Architecture Evolution Up to end 80’s 90’s and 2000’s

    2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
  18. Hierarchy of Memory
 By 2017 (Educated Guess) SSD SATA (persistent)

    L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)
  19. Forthcoming Trends CPU+GPU
 Integration

  20. Trends in Computer Storage

  21. The growing gap between DRAM and HDD is facilitating the

    introduction of
 new SDD devices Forthcoming Trends BGA SSD M.2 SSD PCIe SSD
  22. Latency Numbers Every Programmer Should Know Latency Comparison Numbers --------------------------

    L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Source: Jeff Dean and Peter Norvig (Google), with some additions http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
  23. tref ttrans CPU cache CPU cache Block in storage to

    transmit to CPU Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access
  24. Not All Storage Layers Are Created Equal Memory: tref: 100

    ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms But essentially, a blocked data access is mandatory for speed! The slower the media, the larger the block 
 that is worth to transmit
  25. We Need More Data Blocking In Our Infrastructure! • Not

    many data containers allows for efficient blocking access yet (but e.g. HDF5 does) • With blocked access we can use persistent media (disk) as it is ephemeral (memory) and the other way around -> independency of media! • No silver bullet: we won’t be able to find a single container that makes everybody happy; it’s all about tradeoffs
  26. We know that CPUs are in many cases waiting for

    data to arrive due mainly to bandwidth limitations But… Could we get better bandwidth than hardware allows? Question
  27. Compression for Random & Sequential Access in SSDs • Compression

    does help performance! (65MB/s) (240MB/s) (180MB/s) (200MB/s)
  28. Compression for Random & Sequential Access in SSDs • Compression

    does help performance! • However, limited by SATA bandwidth (65MB/s) (240MB/s) (180MB/s) (200MB/s)
  29. Leveraging Compression Straight To CPU Less data needs to be

    transmitted to the CPU Disk bus Decompression Disk CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  30. When we have a fast enough compressor we can overcome

    the limitations of the bus bandwidth. And by bus we mean any kind of it (memory bus too!)
  31. Example with actual data (satellite images):
 Blosc compression does not

    degrade I/O performance Thanks to: Rui Yang, Pablo Larraondo (NCI Australia)
  32. Reading satellite images: Blosc decompression accelerates I/O Thanks to: Rui

    Yang, Pablo Larraondo (NCI Australia)
  33. Time to Answer Pending Questions

  34. Time to Answer Pending Questions (I) • When using no

    compression, the single Table the takes more time than using Table + EArray. • Not completely sure why this happens but probably due to memory alignment issues. • Take home message: when you want to squeeze all the performance out of computer, don’t be afraid of experimenting. You will find surprises!
  35. Time to Answer Pending Questions (II) • For the single

    Table the rows are too large to allow the shuffle filter to put similar significant bytes in groups, although ZLib can still do a good job at that. • For the Table + EArray, the raw data is arranged so that shuffle can group similar bytes together, allowing Blosc to perform much better. • Take home message: using the correct schema inside the data container is critical for getting the best performance.
  36. Can CPU-based Compression Alleviate The Memory Bottleneck?

  37. Blosc: Compressing Faster Than memcpy()

  38. Principles of Blosc • Split data chunks in blocks internally

    (better cache utilization) • Supports Shuffle and BitShuffle filters (see later) • Use parallelism at two levels: • Use multicores (multithreading) • Use SIMD in Intel/AMD processors (SSE2, AVX2) and ARM (NEON)
  39. The Shuffle filter • Shuffle works at byte level, and

    works well for integers or floats that vary smoothly • There is also support for a BitShuffle filter that works at bit level
  40. Improving RAM Speed? Less data needs to be transmitted to

    the CPU Memory Bus Decompression Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  41. Interesting columns a (String) b (Int32) c (Float64) d (String)

    Chunk 1 Chunk N CPU cache . . . Query: (b == 5) & (d == ‘some string’) d b result Chunked Query . . . . . . . . . . . . . . . . . . . . . . . . Iterator . . . Interesting rows Very efficient when query selectivity is high and decompression is fast
  42. Query Times in bcolz Reference: https://github.com/Blosc/movielens-bench/blob/master/querying-ep14.ipynb Recent server (Intel Xeon

    Skylake, 4 cores) Compression speeds things up
  43. Query Times in bcolz 4-year old laptop (Intel Ivy-Bridge, 2

    cores) Compression speeds things up
  44. Query Times in bcolz 2010 laptop (Intel Core2, 2 cores)


    Compression slows things down
  45. Sizes in bcolz Do not forget compression main strength:
 We

    can store more data using same resources
  46. Beware: Compression does not Always Goes Faster

  47. Accelerating I/O With Blosc Blosc Main Memory Solid State Disk

    Capacity Speed CPU Level 2 Cache Level 1 Cache Mechanical Disk Level 3 Cache } } Other compressors
  48. Bcolz: An Example Of Data Containers Applying The Principles Of

    New Hardware
  49. What is bcolz? • bcolz provides data containers that can

    be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous • Two flavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar
  50. Contiguous vs Chunked NumPy container Contiguous memory carray container chunk

    1 chunk 2 Discontiguous memory chunk N .. .
  51. Why Columnar? • Because it adapts better to newer computer

    architectures
  52. String … String Int32 Float64 Int16 String … String Int32

    Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)
  53. String … String Int32 Float64 Int16 String … String Int32

    Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!
  54. Some Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):
 https://github.com/visualfabriq/bquery

    • Scikit-allel:
 http://scikit-allel.readthedocs.org/ • Quantopian: 
 http://quantopian.github.io/talks/NeedForSpeed/ slides#/

  55. Closing Notes • Pay attention to hardware and software trends

    and make informed decisions for your current developments (which, btw, will be deployed in the future) • If you need a data container that fits your needs, look for already nice libraries out there (NumPy, DyND, Pandas, xarray, HDF5, bcolz, RDBM…) • Compression does help during I/O. Make sure you have it among your tools for data handling.
  56. –Marvin Minsky “In science, one can learn the most by

    studying what seems the least.”
  57. Thank you!