Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New Trends in Storing Large Data Silos in Python

New Trends in Storing Large Data Silos in Python

It is increasingly important to understand the architecture of computers in order to design efficient data structures (or containers) for hosting large datasets, The bcolz case.


July 21, 2015

More Decks by FrancescAlted

Other Decks in Programming


  1. New Trends In Storing And Analyzing Large Data Silos With

    Python Francesc Alted! Freelance Consultant (Department of Geosciences, University of Oslo) July 20th, 2015 [email protected]
  2. About Me • Physicist by training • Computer scientist by

    passion • I believe in Open Source • PyTables (2002 - 2011) • Blosc (2009 - now) • bcolz (2010 - now)
  3. –Manuel Oltra, music composer “The art is in the execution

    of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Dreams And Reality • Doing Open Source is a nice way to fulfill yourself while helping others
  4. Overview • The need for speed: the goal is analyzing

    as much data as possible with your existing resources • New trends in computer hardware • bcolz: an example of data container for large datasets following the principles of newer computer architectures
  5. Don’t Forget Python’s Real Strengths • Data-oriented libraries (NumPy, Pandas,

    Scikit- Learn…) • Performance (thanks to Cython, SWIG, f2py…) • Interactivity
  6. The Need For Speed • But interactivity without performance in

    Big Data is a no go • Designing code for data storage performance depends very much on the computer’s architecture • IMO, existing Python libraries need to invest more effort in getting the most out of existing and future computer architectures
  7. Although Modern Servers/Laptops Can Be Very Complex Beasts We need

    to know them better so as 
 to squeeze the most out of them!
  8. New Trends In Computer Hardware “There's Plenty of Room at

    the Bottom” 
 An Invitation to Enter a New Field of Physics —Talk by Richard Feynman at Caltech, 1959
  9. Computer Architecture Evolution Up to late 80’s 90’s and 2000’s

    2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
  10. tref ttrans CPU CPU Block to transmit to CPU Memory

    / Disk Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access
  11. Not All Storage Layers Are Created Equal Memory: tref: 100

    ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms This has profound implications for how you access storage! The slower the media, the larger the block 
 that should be transmitted
  12. The growing gap between DRAM and HDD is facilitating the

    introduction of
 new SDD devices Trends On Storage
  13. What is bcolz? • bcolz provides data containers that can

    be used in a similar way as the ones in NumPy, Pandas, DyND or others • In bcolz data storage is chunked not contiguous, and chunks can be compressed! • Two flavors: • carray: homogenous types, n-dim data • ctable: heterogeneous types, columnar
  14. Why Chunking? • It is more difficult to handle data

    in chunks, so why bother? • Efficient enlarging and shrinking • Compression is feasible • Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)
  15. Copy! Array to be
 enlarged Final array
 object Data to

    append New memory
 allocation Both memory areas have to exist simultaneously Appending Data in NumPy
  16. Appending Data in bcolz final carray object chunk 1 chunk

    2 new chunk(s) carray to be enlarged chunk 1 chunk 2 data to append X compression Only compression on
 new data required! Blosc Less memory travels to CPU!
  17. Why Columnar? • Because it adapts better to newer computer

    architectures that try to fetch blocks of data (cache lines, typically 64 bytes) for every memory reference
  18. String … String Int32 Int16 Float64 String … String Int32

    Int16 Float64 String … String Int32 Int16 Float64 String … String Int32 Int16 Float64 Interesting column Desired Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)
  19. String … String Int32 Int16 Float64 String … String Int32

    Int16 Float64 String … String Int32 Int16 Float64 String … String Int32 Int16 Float64 Interesting column Desired Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!
  20. Why Compression (II)? Less data needs to be transmitted to

    the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  21. Accelerating I/O With Blosc Blosc     

               } } Other compressors
  22. –Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation “Blosc

    compresses almost as well as ZLIB, but it is much faster” Blosc In OpenVDB And Houdini
  23. Query Times
 3 years-old laptop (Ivy Bridge) • Compression leads

    to better performance, even for in-memory bcolz data containers
  24. Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):
 https://github.com/visualfabriq/bquery •

    Continuum’s Blaze:
 http://blaze.pydata.org/ • Quantopian: 
 http://quantopian.github.io/talks/NeedForSpeed/ slides#/ • Many more!
  25. bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby

    “Switching to bcolz enabled us to have a much better scalable
 architecture yet with near in-memory performance”
 — Carst Vaartjes, co-founder visualfabriq
  26. Quantopian’s Use Case “We set up a project to convert

    Quantopian’s production and development infrastructure to use bcolz” — Eddie Herbert
  27. Closing Notes • Chances are that there is a data

    container that fits your needs already out there (NumPy, DyND, Pandas, PyTables, bcolz…) • Pay attention to hardware and software trends and make informed decisions about your current development (which, btw, will be deployed in the future :) • Compression is a useful feature, not only to store more data, but to also process data faster under the right conditions.
  28. “It is change, continuing change, inevitable change, that is the

    dominant factor in Computer Sciences. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — My own paraphrase of a quote by Isaac Asimov