Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New Trends in Storing Large Data Silos in Python

New Trends in Storing Large Data Silos in Python

It is increasingly important to understand the architecture of computers in order to design efficient data structures (or containers) for hosting large datasets, The bcolz case.

65868d36f26f237938997dd28c2b2453?s=128

FrancescAlted

July 21, 2015
Tweet

Transcript

  1. New Trends In Storing And Analyzing Large Data Silos With

    Python Francesc Alted! Freelance Consultant (Department of Geosciences, University of Oslo) July 20th, 2015 francesc@blosc.org
  2. About Me • Physicist by training • Computer scientist by

    passion • I believe in Open Source • PyTables (2002 - 2011) • Blosc (2009 - now) • bcolz (2010 - now)
  3. –Manuel Oltra, music composer “The art is in the execution

    of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Dreams And Reality • Doing Open Source is a nice way to fulfill yourself while helping others
  4. Overview • The need for speed: the goal is analyzing

    as much data as possible with your existing resources • New trends in computer hardware • bcolz: an example of data container for large datasets following the principles of newer computer architectures
  5. The Need For Speed

  6. Don’t Forget Python’s Real Strengths • Data-oriented libraries (NumPy, Pandas,

    Scikit- Learn…) • Performance (thanks to Cython, SWIG, f2py…) • Interactivity
  7. The Need For Speed • But interactivity without performance in

    Big Data is a no go • Designing code for data storage performance depends very much on the computer’s architecture • IMO, existing Python libraries need to invest more effort in getting the most out of existing and future computer architectures
  8. The Daily Python Working Scenario Quiz: which computer is best

    for interactivity?
  9. Although Modern Servers/Laptops Can Be Very Complex Beasts We need

    to know them better so as 
 to squeeze the most out of them!
  10. New Trends In Computer Hardware “There's Plenty of Room at

    the Bottom” 
 An Invitation to Enter a New Field of Physics —Talk by Richard Feynman at Caltech, 1959
  11. Memory Access Time vs CPU Cycle Time The gap is

    wide and still opening!
  12. Computer Architecture Evolution Up to late 80’s 90’s and 2000’s

    2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
  13. tref ttrans CPU CPU Block to transmit to CPU Memory

    / Disk Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access
  14. Not All Storage Layers Are Created Equal Memory: tref: 100

    ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms This has profound implications for how you access storage! The slower the media, the larger the block 
 that should be transmitted
  15. The growing gap between DRAM and HDD is facilitating the

    introduction of
 new SDD devices Trends On Storage
  16. Trends On CPUs CPU+GPU
 Integration

  17. Bcolz: An Example Of Data Containers Applying The Principles Of

    New Hardware
  18. What is bcolz? • bcolz provides data containers that can

    be used in a similar way as the ones in NumPy, Pandas, DyND or others • In bcolz data storage is chunked not contiguous, and chunks can be compressed! • Two flavors: • carray: homogenous types, n-dim data • ctable: heterogeneous types, columnar
  19. Contiguous vs Chunked NumPy container Contiguous memory carray container chunk

    1 chunk 2 Discontiguous memory chunk N . . .
  20. Why Chunking? • It is more difficult to handle data

    in chunks, so why bother? • Efficient enlarging and shrinking • Compression is feasible • Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)
  21. Copy! Array to be
 enlarged Final array
 object Data to

    append New memory
 allocation Both memory areas have to exist simultaneously Appending Data in NumPy
  22. Appending Data in bcolz final carray object chunk 1 chunk

    2 new chunk(s) carray to be enlarged chunk 1 chunk 2 data to append X compression Only compression on
 new data required! Blosc Less memory travels to CPU!
  23. Why Columnar? • Because it adapts better to newer computer

    architectures that try to fetch blocks of data (cache lines, typically 64 bytes) for every memory reference
  24. String … String Int32 Int16 Float64 String … String Int32

    Int16 Float64 String … String Int32 Int16 Float64 String … String Int32 Int16 Float64 Interesting column Desired Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)
  25. String … String Int32 Int16 Float64 String … String Int32

    Int16 Float64 String … String Int32 Int16 Float64 String … String Int32 Int16 Float64 Interesting column Desired Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!
  26. Why Compression (I)? Compressed Dataset Original Dataset More data can

    be packed using the same storage
  27. Why Compression (II)? Less data needs to be transmitted to

    the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  28. Blosc: Compressing Faster Than memcpy() • bcolz is using Blosc

    to compress chunks
  29. How Blosc Works Multithreading & SIMD at work! Figure: Valentin

    Haenel
  30. Accelerating I/O With Blosc Blosc     

               } } Other compressors
  31. –Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation “Blosc

    compresses almost as well as ZLIB, but it is much faster” Blosc In OpenVDB And Houdini
  32. Some Results Using The Movielens Dataset

  33. bcolz vs pandas (Size) • Compression means ~20x less space

  34. Query Times
 3 years-old laptop (Ivy Bridge) • Compression leads

    to better performance, even for in-memory bcolz data containers
  35. Query Times
 5-years old laptop (Core2) • Compression still makes

    things slower on old boxes
  36. Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):
 https://github.com/visualfabriq/bquery •

    Continuum’s Blaze:
 http://blaze.pydata.org/ • Quantopian: 
 http://quantopian.github.io/talks/NeedForSpeed/ slides#/ • Many more!
  37. bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby

    “Switching to bcolz enabled us to have a much better scalable
 architecture yet with near in-memory performance”
 — Carst Vaartjes, co-founder visualfabriq
  38. Quantopian’s Use Case “We set up a project to convert

    Quantopian’s production and development infrastructure to use bcolz” — Eddie Herbert
  39. Closing Notes • Chances are that there is a data

    container that fits your needs already out there (NumPy, DyND, Pandas, PyTables, bcolz…) • Pay attention to hardware and software trends and make informed decisions about your current development (which, btw, will be deployed in the future :) • Compression is a useful feature, not only to store more data, but to also process data faster under the right conditions.
  40. “It is change, continuing change, inevitable change, that is the

    dominant factor in Computer Sciences. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — My own paraphrase of a quote by Isaac Asimov
  41. Thank You!