Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling Big Data on Modern Computers

Handling Big Data on Modern Computers

Nowadays computers are being designed quite differently as they were made more than a decade ago; however, very little in software architecture has changed in order to accommodate for the changes in the hardware architecture. During my talk I am going to describe which those fundamental changes are and how to deal with them from the point of view of a long-time developer.

FrancescAlted

August 23, 2016
Tweet

More Decks by FrancescAlted

Other Decks in Technology

Transcript

  1. Handling Big Data on Modern Computers A Developer's View Francesc

    Alted Freelance Consultant http://www.blosc.org/professional-services.html Python & HDF5 hackfest Curtin University, August 8th - 11th, 2016
  2. “No sensible decision can be made any longer without taking

    into account not only the computer as it is, but the computer as it will be.” — My own rephrasing “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” — Isaac Asimov
  3. About Me • Physicist by training • Computer scientist by

    passion • Open Source enthusiast by philosophy • PyTables (2002 - now) • Blosc (2009 - now) • bcolz (2010 - now)
  4. –Manuel Oltra, music composer “The art is in the execution

    of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others
  5. PyTables + h5py A group of people are gathering in

    Perth for simplifying the Python stack. Thanks to Andrea Bedini at Curtin University for organizing this!
  6. Overview • Why data arrangement is critical for efficient I/O

    • Recent trends in computer architecture • Blosc / bcolz: examples of data containers for large datasets following the principles of newer computer architectures
  7. Example from Neutrino Detectors • 12 PhotoMulTipliers (PMT) • The

    shape of the signal (48000 int16) for each event is registered for each PMT • Each event have associated metadata that should be recorded • Question: How to store the data so that we can store as much as possible without losing speed?
  8. Two Schemas in PyTables Event
 ID Meta
 1 PMT
 ID

    Raw Data Single Table Event
 ID Meta
 1 Raw Data Table Array +
  9. • Both approaches seem to be able to host the

    data and apparently the Single Table wins (it’s simpler) • But let’s experiment and see how each one behaves… Two Schemas in PyTables Notebook available:
 https://github.com/PyTables/PyTables/blob/develop/examples/Single_Table-vs-EArray_Table.ipynb
  10. Data Arrangement is Critical for Performance • This happens in

    many cases, but specially when we want to use compression, the way in which we put data together is very important for our goals
  11. Pending Questions • Why data arrangement is so important? •

    Why compression can bring us performance that is close to the non-compressed scenario?
  12. Computer Architecture Evolution Up to end 80’s 90’s and 2000’s

    2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
  13. Hierarchy of Memory
 By 2017 (Educated Guess) SSD SATA (persistent)

    L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)
  14. The growing gap between DRAM and HDD is facilitating the

    introduction of
 new SDD devices Forthcoming Trends BGA SSD M.2 SSD PCIe SSD
  15. Latency Numbers Every Programmer Should Know Latency Comparison Numbers --------------------------

    L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Source: Jeff Dean and Peter Norvig (Google), with some additions http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
  16. tref ttrans CPU cache CPU cache Block in storage to

    transmit to CPU Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access
  17. Not All Storage Layers Are Created Equal Memory: tref: 100

    ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms But essentially, a blocked data access is mandatory for speed! The slower the media, the larger the block 
 that is worth to transmit
  18. We Need More Data Blocking In Our Infrastructure! • Not

    many data containers allows for efficient blocking access yet (but e.g. HDF5 does) • With blocked access we can use persistent media (disk) as it is ephemeral (memory) and the other way around -> independency of media! • No silver bullet: we won’t be able to find a single container that makes everybody happy; it’s all about tradeoffs
  19. We know that CPUs are in many cases waiting for

    data to arrive due mainly to bandwidth limitations But… Could we get better bandwidth than hardware allows? Question
  20. Compression for Random & Sequential Access in SSDs • Compression

    does help performance! (65MB/s) (240MB/s) (180MB/s) (200MB/s)
  21. Compression for Random & Sequential Access in SSDs • Compression

    does help performance! • However, limited by SATA bandwidth (65MB/s) (240MB/s) (180MB/s) (200MB/s)
  22. Leveraging Compression Straight To CPU Less data needs to be

    transmitted to the CPU Disk bus Decompression Disk CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  23. When we have a fast enough compressor we can overcome

    the limitations of the bus bandwidth. And by bus we mean any kind of it (memory bus too!)
  24. Example with actual data (satellite images):
 Blosc compression does not

    degrade I/O performance Thanks to: Rui Yang, Pablo Larraondo (NCI Australia)
  25. Time to Answer Pending Questions (I) • When using no

    compression, the single Table the takes more time than using Table + EArray. • Not completely sure why this happens but probably due to memory alignment issues. • Take home message: when you want to squeeze all the performance out of computer, don’t be afraid of experimenting. You will find surprises!
  26. Time to Answer Pending Questions (II) • For the single

    Table the rows are too large to allow the shuffle filter to put similar significant bytes in groups, although ZLib can still do a good job at that. • For the Table + EArray, the raw data is arranged so that shuffle can group similar bytes together, allowing Blosc to perform much better. • Take home message: using the correct schema inside the data container is critical for getting the best performance.
  27. Principles of Blosc • Split data chunks in blocks internally

    (better cache utilization) • Supports Shuffle and BitShuffle filters (see later) • Use parallelism at two levels: • Use multicores (multithreading) • Use SIMD in Intel/AMD processors (SSE2, AVX2) and ARM (NEON)
  28. The Shuffle filter • Shuffle works at byte level, and

    works well for integers or floats that vary smoothly • There is also support for a BitShuffle filter that works at bit level
  29. Improving RAM Speed? Less data needs to be transmitted to

    the CPU Memory Bus Decompression Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  30. Interesting columns a (String) b (Int32) c (Float64) d (String)

    Chunk 1 Chunk N CPU cache . . . Query: (b == 5) & (d == ‘some string’) d b result Chunked Query . . . . . . . . . . . . . . . . . . . . . . . . Iterator . . . Interesting rows Very efficient when query selectivity is high and decompression is fast
  31. Sizes in bcolz Do not forget compression main strength:
 We

    can store more data using same resources
  32. Accelerating I/O With Blosc Blosc Main Memory Solid State Disk

    Capacity Speed CPU Level 2 Cache Level 1 Cache Mechanical Disk Level 3 Cache } } Other compressors
  33. What is bcolz? • bcolz provides data containers that can

    be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous • Two flavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar
  34. String … String Int32 Float64 Int16 String … String Int32

    Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)
  35. String … String Int32 Float64 Int16 String … String Int32

    Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!
  36. Some Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):
 https://github.com/visualfabriq/bquery

    • Scikit-allel:
 http://scikit-allel.readthedocs.org/ • Quantopian: 
 http://quantopian.github.io/talks/NeedForSpeed/ slides#/

  37. Closing Notes • Pay attention to hardware and software trends

    and make informed decisions for your current developments (which, btw, will be deployed in the future) • If you need a data container that fits your needs, look for already nice libraries out there (NumPy, DyND, Pandas, xarray, HDF5, bcolz, RDBM…) • Compression does help during I/O. Make sure you have it among your tools for data handling.