Python for Data Science

Python for Data Science ❧ A Bestiary ❧

Motivation

Python is the go-to tool for the data scientist.

No universally-recognized tool for distributed computing

(end of the motivation)

Part I — Numerical Computing

IPython » Evolution of the Interactive REPL » Notebook »
Born with Wolfram Mathematica [citation needed]

NumPy » Rather thin wrapper around high-performance C/Fortran » Standard
de-facto for numerical analysis in Python >>> import numpy as np >>> arr = np.random.randn(100,100) >>> arr.shape ...

DataFrames » Born in R » Simple filtering and column
access for data sets » Pandas

No, not that kind of pandas

Pandas Dataframes >>> DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5,
6])]) A B 0 1 4 1 2 5 2 3 6 >>> DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['one', 'two', 'three']) one two three A 1 2 3 B 4 5 6

>>> iris = read_csv('iris.data') >>> iris.head() SepalLength SepalWidth PetalLength PetalWidth
Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa array-like indexing >>> iris[ iris.SepalLength < 5.0 .head()

Integration with matplotlib iris.query('SepalLength > 5').assign( SepalRatio = lambda x:
x.SepalWidth / x.SepalLength, PetalRatio = lambda x: x.PetalWidth / x.PetalLength ).plot(kind='scatter', x='SepalRatio', y='PetalRatio')

Constraint: your data must fit your memory

Out-of-Core Data Processing » Blaze » Spartan » Dask »
DistArray » Ibis (also, Global Arrays, Numba, IPython Parallel, DPark...)

Blaze » Lazy execution » Represents a data source »
Abstracts from the many backends » Exposes a dataframe-like API » Backends implement the API

>>> t = Data([(1, 'Alice', 100), ... (2, 'Bob', -200),
... (3, 'Charlie', 300), ... (4, 'Denis', 400), ... (5, 'Edith', -500)], ... ﬁelds=['id', 'name', 'balance']) >>> t[t.balance < 0] id name balance 0 2 Bob -200 1 5 Edith -500

Blaze Expressions » Blaze expressions describe computational workflows symbolically »
The data object enqueues operations until compute() is invoked » This queue (or rather, a cflow graph) is maintained as a dictionary in memory

>>> from blaze import * >>> accounts = Symbol( ...
'accounts', ... 'var * {id: int, name: string, balance: int}') >>> deadbeats = accounts[accounts.balance < 0].name >>> deadbeats accounts[accounts.balance < 0].name >>> deadbeats.dshape dshape("var * string") >>> data = [[1, 'Alice', 100], ... [2, 'Bob', -200], ... [3, 'Charlie', 300]] >>> namespace = {accounts: data} >>> from blaze import compute >>> list(compute(deadbeats, namespace)) ['Bob']

>>> to_tree(deadbeats) { 'args':[ { 'args':[ {'args':[ 'accounts', 'var *
{id: int32, name: string, balance: int32}', None ], 'op':'Symbol'}, { 'args':[ { 'args':[ {'args':[ 'accounts', 'var * {id: int32, name: string, balance: int32', None ], 'op':'Symbol'}, 'balance' ], 'op':'Field' }, 0 ], 'op':'Lt' } ], 'op':'Selection' }, 'name' ], 'op':'Field' }

» Blaze is general » it is sort-of a "translation
layer" from a familiar, dataframe- like API to many "backends"

Spartan Spartan is a library for distributed array programming. Programmers
build up array expressions (using Numpy-like operations). These expressions are then compiled and optimized and run on a distributed array backend across multiple machines. — github.com/spartan-array/spartan

>>> x = spartan.ones((10, 10)) >>> x MapExpr { local_dag
= None, fn_kw = DictExpr { vals = {} }, children = ListExpr { vals = [ [0] = NdArrayExpr { combine_fn = None, dtype = <type 'ﬂoat'>, _shape = (10, 10), tile_hint = None, reduce_fn = None } ] }, map_fn = <function <lambda> at 0x3dbae60> }

Dask A project that sprout from Blaze — “Distributed Task”
dask.array “Multi-core / on-disk NumPy arrays” dask.dataframe “Multi-core / on-disk Pandas dataframes » Lazy execution » High-level representation of a distributed task (dask) » Standard Python APIs » Dictionary representation of the distributed task » The task is already partitioned and organized so that a scheduler may dispatch the execution to many workers

dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load,
'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}

dask.array » Dask Array implements a subset of the NumPy
ndarray interface using blocked algorithms » array is cut up into many small NumPy arrays. » blocked algorithms coordinated using dask graphs. » drop-in replacement for a subset of the NumPy library.

>>> arr = da.random.random((100,100), chunks=(50,50)) + 1 >>> arr.dask {('wrapped_1',
0, 0): (<random_sample of mtrand.RandomState object>, (50, 50)), ('wrapped_1', 0, 1): (<random_sample of mtrand.RandomState object>, (50, 50)), ('wrapped_1', 1, 0): (<random_sample of mtrand.RandomState object>, (50, 50)), ('wrapped_1', 1, 1): (<random_sample of mtrand.RandomState object>, (50, 50)), ('x_1', 0, 0): (<function f at 0x10e6e0410>, ('wrapped_1', 0, 0)), ('x_1', 0, 1): (<function f at 0x10e6e0410>, ('wrapped_1', 0, 1)), ('x_1', 1, 0): (<function f at 0x10e6e0410>, ('wrapped_1', 1, 0)), ('x_1', 1, 1): (<function f at 0x10e6e0410>, ('wrapped_1', 1, 1))}

>>> A = da.random.random((100,100), chunks=(50,50)) >>> B = da.random.random((100,100), chunks=(50,50))
>>> C = A.dot(B) >>> C.dask

dask.bag » stream-like API » map, filter, etc.

b = db.from_sequence( [1, 2, 3, 4, 5, 6], npartitions=2).map(lambda:
x + 1)

dask.dataframe lazy, distributed dataframe >>> import dask.dataframe as dd >>>
df = dd.read_csv('2014-*.csv.gz', compression='gzip') >>> df2 = df[df.y == 'a'].x + 1 >>> df2.compute() 0 2 3 5 Name: x, dtype: int64

dask.distributed » Centralized scheduler » Distributed workers » (Potentially many)
clients Communication transport: ZMQ

>>> from operator import add >>> dsk = {'x': 1,
'y': (add, 'x', 2)} >>> c.get(dsk, 'y') # causes distributed work

Known Limitations » Not fault tolerant » Almost no data
locality handling (except for linear chains) » Dynamic scheduler » hand-tuned MPI will probably be always faster -- source: dask.pydata.org/en/latest/distributed.html

DistArray » Similar to the others in that it provides
a NumPy-like API in a distributed context » Similar to Dask: » DistArrays are chunked » Operations are executed remotely by workers

However 1. DistArrays are not lazy 2. Each worker works
on its own chunk using native NumPy 3. Uses MPI for data transfer and coordination 4. Documented "Distributed Array Protocol" that describes messages (tries to minimize data-transfers) It basically does RPC between a “driver” client and the workers.

nparr = numpy.random.random((4, 5)) from distarray.globalapi import Context context =
Context() darr = context.fromarray(nparr) (darr + darr).sum() darr[:, 2::2] etc...

and, of course, there is Ibis

>>> import ibis >>> int_expr = ibis.literal(5) >>> (int_expr +
10).sqrt() Sqrt[double] Add[int8] Literal[int8] 5 Literal[int8] 10

>>> con = ibis.impala.connect('bottou01.sjc.cloudera.com') >>> db = con.database('ibis_testing') >>> (t
# DatabaseTable -> TableExpr .limit(20) # Limit -> TableExpr .tinyint_col # TableColumn -> Int8Array .add(10) # Add -> Int16Array .sqrt() # Sqrt -> DoubleArray .round(2) # Round -> DoubleArray .sum()) # Sum -> DoubleScalar

ref_0 DatabaseTable[table] name: ibis_testing.`functional_alltypes` schema: id : int32 bool_col :
boolean tinyint_col : int8 ... sum = Sum[double] Round[array(double)] Sqrt[array(double)] Add[array(int16)] tinyint_col = Column[array(int8)] 'tinyint_col' from table Limit[table] Table: ref_0 n: 20 offset: 0 Literal[int8] 10 Literal[int8] 2 None

❧ Intermezzo ❧

An Interesting Trend From a language design perspective » distributed
computation must be lazyfied » Lazyfication through meta-representation of a computation » Delayed execution through re-interpretation » Meta-repr is a tree or list for the AST or CFlow of the program

» Now, I'm not saying that all these people are
basically reimplementing LISP on top of Python. » That would be Hy github.com/hylang/hy => (print "Hy!") Hy!

Actually, yes, that's what I'm really saying

«Any sufficiently complicated C or Fortran program contains an ad
hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.» — Greenspun's tenth rule

Conclusion

We should all be writing Clojure That is all.

Just Kidding.

Julia is perfectly fine.

➜ ~ julia --lisp ; _ ; |_ _ _
|_ _ | . _ _ ; | (-||||_(_)|__|_)|_) ;-------------------|---------------------------------------------------------- > (+ 1 2) ; lol i can s-exprs 3 Really, try it.

Back to Python

Part II — Storage

Data-formats in Python land

CSV » PROS » human-readable » easy to produce »
widespread » CONS » non-compressed » non-columnar » actually hard to get right "this is a valid",""csv"",line

(Python-Consumable) Alternatives for Array/Tabular Data

NPY numpy.load(ﬁle, mmap_mode=None) Load arrays or pickled objects from .npy,
.npz or pickled files. » Binary format » Essentially a memory snapshot » Magic number » Header (ver, len) » Dict literal (metadata) » If the dtype contains Python objects, then the data is a Python pickle of the array. » Otherwise the data is the contiguous bytes of the array. » CONS: highly Python-specific

HDF5 » Hierarchical Data Format v5 » It can be
better described as a “file-system in a file” » Group. contains instances of zero or more groups or datasets, together with supporting metadata (basically, a directory) » Datasets. a multidimensional array of data elements, together with supporting metadata

Data types » Integer datatypes: 8-bit, 16-bit, 32-bit, and 64-bit
» Floating-point numbers: IEEE 32-bit and 64-bit » References » Strings Also compound data types

Problem: not distributable At least, not in the HDFS-way hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf
» In short: HDF5 cannot be easily chunked in a meaningful way » e.g.: CSV can be split at line boundaries » HDF5 is a file system » how do you "split" a file system ? » what if files cannot be broken ? » manual additional work is required to make the most out of HDF5+HDFS

Parallel HDF5 » MPI parallel I/O; HDF5 Parallel I/O e.g.:
» Create, open and close a file » Create, open, and close a dataset » Extend a dataset (increase dimension sizes) » R/W from/to a dataset (data xfer collective or independent.) » Once a file is opened by the processes of a communicator: » All parts of the file are accessible by all processes. » All objects in the file are accessible by all processes. » Multiple processes write to the same dataset. » Each process writes to a individual dataset.

NetCDF » Network Common Data Form » set of software
libraries and data formats for r/w array-oriented scientific data (unidata.ucar.edu/software/netcdf/) » Parallel dev to HDF » Today, the latest NetCDF spec (v4) is basically a restricted subset of HDF5

BColz columnar, chunked data containers that can be compressed either
in- memory and on-disk. » uses blosc for compression » not distributed # dask.dataframe >>> df = dd.from_bcolz('myﬁle.bcolz', chunksize=1000000)

Castra Special-purpose for DataFrames » on-disk, partitioned, compressed, column store.
» uses Blosc # dask dataframe >>> from castra import Castra >>> c = Castra(path='/my/castra/ﬁle') >>> df = c.to_dask()

Bonus: Tea Files » a binary file format to store
time series in binary flat files. » multiple language support (Python included) » specific for time series

>>> tf = TeaFile.create(""acme.tea"", ""Time Price Volume"", ""qdq"", ""ACME at
NYSE"", {""decimals"": 2, ""url"": ""www.acme.com"" }) >>> tf.write(DateTime(2011, 3, 4, 9, 0), 45.11, 4500) >>> tf.write(DateTime(2011, 3, 4, 10, 0), 46.33, 1100) >>> tf.close() >>> tf = TeaFile.openread(""acme.tea"") >>> tf.read() TPV(Time=2011-03-04 09:00:00:000, Price=45.11, Volume=4500) >>> tf.read() TPV(Time=2011-03-04 10:00:00:000, Price=46.33, Volume=1100) >>> tf.read() >>> tf.close()

(General) Cons » Most of these data-formats are not actually
meant to be distributed » Rather, they are meant for out-of-core, but single-machine, multi-core data processing

Conclusions

Conclusions » Most Python-related solutions are highly under development »
dask seems to be the most promising and complete » Striking similarities in the way they deal with the problem » There are many data formats available, none seems to be a definitive choice » Each comes with its pros and cons

Python for Data Science

Python for Data Science

More Decks by Edoardo Vacchi

Other Decks in Programming

Featured

Transcript