New Trends in Storing Large Data Silos in Python

New Trends In Storing And Analyzing Large Data Silos With
Python Francesc Alted! Freelance Consultant (Department of Geosciences, University of Oslo) July 20th, 2015 [email protected]

About Me • Physicist by training • Computer scientist by
passion • I believe in Open Source • PyTables (2002 - 2011) • Blosc (2009 - now) • bcolz (2010 - now)

–Manuel Oltra, music composer “The art is in the execution
of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Dreams And Reality • Doing Open Source is a nice way to fulﬁll yourself while helping others

Overview • The need for speed: the goal is analyzing
as much data as possible with your existing resources • New trends in computer hardware • bcolz: an example of data container for large datasets following the principles of newer computer architectures

The Need For Speed

Don’t Forget Python’s Real Strengths • Data-oriented libraries (NumPy, Pandas,
Scikit- Learn…) • Performance (thanks to Cython, SWIG, f2py…) • Interactivity

The Need For Speed • But interactivity without performance in
Big Data is a no go • Designing code for data storage performance depends very much on the computer’s architecture • IMO, existing Python libraries need to invest more effort in getting the most out of existing and future computer architectures

The Daily Python Working Scenario Quiz: which computer is best
for interactivity?

Although Modern Servers/Laptops Can Be Very Complex Beasts We need
to know them better so as   to squeeze the most out of them!

New Trends In Computer Hardware “There's Plenty of Room at
the Bottom”   An Invitation to Enter a New Field of Physics —Talk by Richard Feynman at Caltech, 1959

Memory Access Time vs CPU Cycle Time The gap is
wide and still opening!

Computer Architecture Evolution Up to late 80’s 90’s and 2000’s
2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)

tref ttrans CPU CPU Block to transmit to CPU Memory
/ Disk Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access

Not All Storage Layers Are Created Equal Memory: tref: 100
ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms This has profound implications for how you access storage! The slower the media, the larger the block   that should be transmitted

The growing gap between DRAM and HDD is facilitating the
introduction of  new SDD devices Trends On Storage

Trends On CPUs CPU+GPU  Integration

Bcolz: An Example Of Data Containers Applying The Principles Of
New Hardware

What is bcolz? • bcolz provides data containers that can
be used in a similar way as the ones in NumPy, Pandas, DyND or others • In bcolz data storage is chunked not contiguous, and chunks can be compressed! • Two ﬂavors: • carray: homogenous types, n-dim data • ctable: heterogeneous types, columnar

Contiguous vs Chunked NumPy container Contiguous memory carray container chunk
1 chunk 2 Discontiguous memory chunk N . . .

Why Chunking? • It is more difﬁcult to handle data
in chunks, so why bother? • Efﬁcient enlarging and shrinking • Compression is feasible • Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)

Copy! Array to be  enlarged Final array  object Data to
append New memory  allocation Both memory areas have to exist simultaneously Appending Data in NumPy

Appending Data in bcolz ﬁnal carray object chunk 1 chunk
2 new chunk(s) carray to be enlarged chunk 1 chunk 2 data to append X compression Only compression on  new data required! Blosc Less memory travels to CPU!

Why Columnar? • Because it adapts better to newer computer
architectures that try to fetch blocks of data (cache lines, typically 64 bytes) for every memory reference

String … String Int32 Int16 Float64 String … String Int32
Int16 Float64 String … String Int32 Int16 Float64 String … String Int32 Int16 Float64 Interesting column Desired Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)

String … String Int32 Int16 Float64 String … String Int32
Int16 Float64 String … String Int32 Int16 Float64 String … String Int32 Int16 Float64 Interesting column Desired Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!

Why Compression (I)? Compressed Dataset Original Dataset More data can
be packed using the same storage

Why Compression (II)? Less data needs to be transmitted to
the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original  Dataset Compressed  Dataset Transmission + decompression faster than direct transfer?

Blosc: Compressing Faster Than memcpy() • bcolz is using Blosc
to compress chunks

How Blosc Works Multithreading & SIMD at work! Figure: Valentin
Haenel

Accelerating I/O With Blosc Blosc
} } Other compressors

–Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation “Blosc
compresses almost as well as ZLIB, but it is much faster” Blosc In OpenVDB And Houdini

Some Results Using The Movielens Dataset

bcolz vs pandas (Size) • Compression means ~20x less space

Query Times  3 years-old laptop (Ivy Bridge) • Compression leads
to better performance, even for in-memory bcolz data containers

Query Times  5-years old laptop (Core2) • Compression still makes
things slower on old boxes

Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):  https://github.com/visualfabriq/bquery •
Continuum’s Blaze:  http://blaze.pydata.org/ • Quantopian:   http://quantopian.github.io/talks/NeedForSpeed/ slides#/ • Many more!

bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby
“Switching to bcolz enabled us to have a much better scalable  architecture yet with near in-memory performance”  — Carst Vaartjes, co-founder visualfabriq

Quantopian’s Use Case “We set up a project to convert
Quantopian’s production and development infrastructure to use bcolz” — Eddie Herbert

Closing Notes • Chances are that there is a data
container that ﬁts your needs already out there (NumPy, DyND, Pandas, PyTables, bcolz…) • Pay attention to hardware and software trends and make informed decisions about your current development (which, btw, will be deployed in the future :) • Compression is a useful feature, not only to store more data, but to also process data faster under the right conditions.

“It is change, continuing change, inevitable change, that is the
dominant factor in Computer Sciences. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — My own paraphrase of a quote by Isaac Asimov

Thank You!

New Trends in Storing Large Data Silos in Python

New Trends in Storing Large Data Silos in Python

More Decks by FrancescAlted

Other Decks in Programming

Featured

Transcript