Python for Data Science

Slide 1

Slide 1 text

Python for Data Science ❧ A Bestiary ❧

Slide 2

Slide 2 text

Motivation

Slide 3

Slide 3 text

Python is the go-to tool for the data scientist.

Slide 4

Slide 4 text

No universally-recognized tool for distributed computing

Slide 5

Slide 5 text

(end of the motivation)

Slide 6

Slide 6 text

Part I — Numerical Computing

Slide 7

Slide 7 text

IPython » Evolution of the Interactive REPL » Notebook » Born with Wolfram Mathematica [citation needed]

Slide 8

Slide 8 text

NumPy » Rather thin wrapper around high-performance C/Fortran » Standard de-facto for numerical analysis in Python >>> import numpy as np >>> arr = np.random.randn(100,100) >>> arr.shape ...

Slide 9

Slide 9 text

DataFrames » Born in R » Simple filtering and column access for data sets » Pandas

Slide 10

Slide 10 text

No, not that kind of pandas

Slide 11

Slide 11 text

Pandas Dataframes >>> DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])]) A B 0 1 4 1 2 5 2 3 6 >>> DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['one', 'two', 'three']) one two three A 1 2 3 B 4 5 6

Slide 12

Slide 12 text

>>> iris = read_csv('iris.data') >>> iris.head() SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa array-like indexing >>> iris[ iris.SepalLength < 5.0 .head()

Slide 13

Slide 13 text

Integration with matplotlib iris.query('SepalLength > 5').assign( SepalRatio = lambda x: x.SepalWidth / x.SepalLength, PetalRatio = lambda x: x.PetalWidth / x.PetalLength ).plot(kind='scatter', x='SepalRatio', y='PetalRatio')

Slide 14

Slide 14 text

Constraint: your data must fit your memory

Slide 15

Slide 15 text

Out-of-Core Data Processing » Blaze » Spartan » Dask » DistArray » Ibis (also, Global Arrays, Numba, IPython Parallel, DPark...)

Slide 16

Slide 16 text

Blaze » Lazy execution » Represents a data source » Abstracts from the many backends » Exposes a dataframe-like API » Backends implement the API

Slide 17

Slide 17 text

>>> t = Data([(1, 'Alice', 100), ... (2, 'Bob', -200), ... (3, 'Charlie', 300), ... (4, 'Denis', 400), ... (5, 'Edith', -500)], ... ﬁelds=['id', 'name', 'balance']) >>> t[t.balance < 0] id name balance 0 2 Bob -200 1 5 Edith -500

Slide 18

Slide 18 text

Blaze Expressions » Blaze expressions describe computational workflows symbolically » The data object enqueues operations until compute() is invoked » This queue (or rather, a cflow graph) is maintained as a dictionary in memory

Slide 19

Slide 19 text

>>> from blaze import * >>> accounts = Symbol( ... 'accounts', ... 'var * {id: int, name: string, balance: int}') >>> deadbeats = accounts[accounts.balance < 0].name >>> deadbeats accounts[accounts.balance < 0].name >>> deadbeats.dshape dshape("var * string") >>> data = [[1, 'Alice', 100], ... [2, 'Bob', -200], ... [3, 'Charlie', 300]] >>> namespace = {accounts: data} >>> from blaze import compute >>> list(compute(deadbeats, namespace)) ['Bob']

Slide 20

Slide 20 text

>>> to_tree(deadbeats) { 'args':[ { 'args':[ {'args':[ 'accounts', 'var * {id: int32, name: string, balance: int32}', None ], 'op':'Symbol'}, { 'args':[ { 'args':[ {'args':[ 'accounts', 'var * {id: int32, name: string, balance: int32', None ], 'op':'Symbol'}, 'balance' ], 'op':'Field' }, 0 ], 'op':'Lt' } ], 'op':'Selection' }, 'name' ], 'op':'Field' }

Slide 21

Slide 21 text

» Blaze is general » it is sort-of a "translation layer" from a familiar, dataframe- like API to many "backends"

Slide 22

Slide 22 text

Spartan Spartan is a library for distributed array programming. Programmers build up array expressions (using Numpy-like operations). These expressions are then compiled and optimized and run on a distributed array backend across multiple machines. — github.com/spartan-array/spartan

Slide 23

Slide 23 text

>>> x = spartan.ones((10, 10)) >>> x MapExpr { local_dag = None, fn_kw = DictExpr { vals = {} }, children = ListExpr { vals = [ [0] = NdArrayExpr { combine_fn = None, dtype = , _shape = (10, 10), tile_hint = None, reduce_fn = None } ] }, map_fn = at 0x3dbae60> }

Slide 24

Slide 24 text

Dask A project that sprout from Blaze — “Distributed Task” dask.array “Multi-core / on-disk NumPy arrays” dask.dataframe “Multi-core / on-disk Pandas dataframes » Lazy execution » High-level representation of a distributed task (dask) » Standard Python APIs » Dictionary representation of the distributed task » The task is already partitioned and organized so that a scheduler may dispatch the execution to many workers

Slide 25

Slide 25 text

dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}

Slide 26

Slide 26 text

dask.array » Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms » array is cut up into many small NumPy arrays. » blocked algorithms coordinated using dask graphs. » drop-in replacement for a subset of the NumPy library.

Slide 27

Slide 27 text

>>> arr = da.random.random((100,100), chunks=(50,50)) + 1 >>> arr.dask {('wrapped_1', 0, 0): (, (50, 50)), ('wrapped_1', 0, 1): (, (50, 50)), ('wrapped_1', 1, 0): (, (50, 50)), ('wrapped_1', 1, 1): (, (50, 50)), ('x_1', 0, 0): (, ('wrapped_1', 0, 0)), ('x_1', 0, 1): (, ('wrapped_1', 0, 1)), ('x_1', 1, 0): (, ('wrapped_1', 1, 0)), ('x_1', 1, 1): (, ('wrapped_1', 1, 1))}

Slide 28

Slide 28 text

>>> A = da.random.random((100,100), chunks=(50,50)) >>> B = da.random.random((100,100), chunks=(50,50)) >>> C = A.dot(B) >>> C.dask

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

dask.bag » stream-like API » map, filter, etc.

Slide 31

Slide 31 text

b = db.from_sequence( [1, 2, 3, 4, 5, 6], npartitions=2).map(lambda: x + 1)

Slide 32

Slide 32 text

dask.dataframe lazy, distributed dataframe >>> import dask.dataframe as dd >>> df = dd.read_csv('2014-*.csv.gz', compression='gzip') >>> df2 = df[df.y == 'a'].x + 1 >>> df2.compute() 0 2 3 5 Name: x, dtype: int64

Slide 33

Slide 33 text

dask.distributed » Centralized scheduler » Distributed workers » (Potentially many) clients Communication transport: ZMQ

Slide 34

Slide 34 text

>>> from operator import add >>> dsk = {'x': 1, 'y': (add, 'x', 2)} >>> c.get(dsk, 'y') # causes distributed work

Slide 35

Slide 35 text

Known Limitations » Not fault tolerant » Almost no data locality handling (except for linear chains) » Dynamic scheduler » hand-tuned MPI will probably be always faster -- source: dask.pydata.org/en/latest/distributed.html

Slide 36

Slide 36 text

DistArray » Similar to the others in that it provides a NumPy-like API in a distributed context » Similar to Dask: » DistArrays are chunked » Operations are executed remotely by workers

Slide 37

Slide 37 text

However 1. DistArrays are not lazy 2. Each worker works on its own chunk using native NumPy 3. Uses MPI for data transfer and coordination 4. Documented "Distributed Array Protocol" that describes messages (tries to minimize data-transfers) It basically does RPC between a “driver” client and the workers.

Slide 38

Slide 38 text

nparr = numpy.random.random((4, 5)) from distarray.globalapi import Context context = Context() darr = context.fromarray(nparr) (darr + darr).sum() darr[:, 2::2] etc...

Slide 39

Slide 39 text

and, of course, there is Ibis

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

>>> import ibis >>> int_expr = ibis.literal(5) >>> (int_expr + 10).sqrt() Sqrt[double] Add[int8] Literal[int8] 5 Literal[int8] 10

Slide 42

Slide 42 text

>>> con = ibis.impala.connect('bottou01.sjc.cloudera.com') >>> db = con.database('ibis_testing') >>> (t # DatabaseTable -> TableExpr .limit(20) # Limit -> TableExpr .tinyint_col # TableColumn -> Int8Array .add(10) # Add -> Int16Array .sqrt() # Sqrt -> DoubleArray .round(2) # Round -> DoubleArray .sum()) # Sum -> DoubleScalar

Slide 43

Slide 43 text

ref_0 DatabaseTable[table] name: ibis_testing.`functional_alltypes` schema: id : int32 bool_col : boolean tinyint_col : int8 ... sum = Sum[double] Round[array(double)] Sqrt[array(double)] Add[array(int16)] tinyint_col = Column[array(int8)] 'tinyint_col' from table Limit[table] Table: ref_0 n: 20 offset: 0 Literal[int8] 10 Literal[int8] 2 None

Slide 44

Slide 44 text

❧ Intermezzo ❧

Slide 45

Slide 45 text

An Interesting Trend From a language design perspective » distributed computation must be lazyfied » Lazyfication through meta-representation of a computation » Delayed execution through re-interpretation » Meta-repr is a tree or list for the AST or CFlow of the program

Slide 46

Slide 46 text

» Now, I'm not saying that all these people are basically reimplementing LISP on top of Python. » That would be Hy github.com/hylang/hy => (print "Hy!") Hy!

Slide 47

Slide 47 text

Actually, yes, that's what I'm really saying

Slide 48

Slide 48 text

«Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.» — Greenspun's tenth rule

Slide 49

Slide 49 text

Conclusion

Slide 50

Slide 50 text

We should all be writing Clojure That is all.

Slide 51

Slide 51 text

Just Kidding.

Slide 52

Slide 52 text

Julia is perfectly fine.

Slide 53

Slide 53 text

➜ ~ julia --lisp ; _ ; |_ _ _ |_ _ | . _ _ ; | (-||||_(_)|__|_)|_) ;-------------------|---------------------------------------------------------- > (+ 1 2) ; lol i can s-exprs 3 Really, try it.

Slide 54

Slide 54 text

Back to Python

Slide 55

Slide 55 text

Part II — Storage

Slide 56

Slide 56 text

Data-formats in Python land

Slide 57

Slide 57 text

CSV » PROS » human-readable » easy to produce » widespread » CONS » non-compressed » non-columnar » actually hard to get right "this is a valid",""csv"",line

Slide 58

Slide 58 text

(Python-Consumable) Alternatives for Array/Tabular Data

Slide 59

Slide 59 text

NPY numpy.load(ﬁle, mmap_mode=None) Load arrays or pickled objects from .npy, .npz or pickled files. » Binary format » Essentially a memory snapshot » Magic number » Header (ver, len) » Dict literal (metadata) » If the dtype contains Python objects, then the data is a Python pickle of the array. » Otherwise the data is the contiguous bytes of the array. » CONS: highly Python-specific

Slide 60

Slide 60 text

HDF5 » Hierarchical Data Format v5 » It can be better described as a “file-system in a file” » Group. contains instances of zero or more groups or datasets, together with supporting metadata (basically, a directory) » Datasets. a multidimensional array of data elements, together with supporting metadata

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

Data types » Integer datatypes: 8-bit, 16-bit, 32-bit, and 64-bit » Floating-point numbers: IEEE 32-bit and 64-bit » References » Strings Also compound data types

Slide 63

Slide 63 text

Problem: not distributable At least, not in the HDFS-way hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf » In short: HDF5 cannot be easily chunked in a meaningful way » e.g.: CSV can be split at line boundaries » HDF5 is a file system » how do you "split" a file system ? » what if files cannot be broken ? » manual additional work is required to make the most out of HDF5+HDFS

Slide 64

Slide 64 text

Parallel HDF5 » MPI parallel I/O; HDF5 Parallel I/O e.g.: » Create, open and close a file » Create, open, and close a dataset » Extend a dataset (increase dimension sizes) » R/W from/to a dataset (data xfer collective or independent.) » Once a file is opened by the processes of a communicator: » All parts of the file are accessible by all processes. » All objects in the file are accessible by all processes. » Multiple processes write to the same dataset. » Each process writes to a individual dataset.

Slide 65

Slide 65 text

NetCDF » Network Common Data Form » set of software libraries and data formats for r/w array-oriented scientific data (unidata.ucar.edu/software/netcdf/) » Parallel dev to HDF » Today, the latest NetCDF spec (v4) is basically a restricted subset of HDF5

Slide 66

Slide 66 text

BColz columnar, chunked data containers that can be compressed either in- memory and on-disk. » uses blosc for compression » not distributed # dask.dataframe >>> df = dd.from_bcolz('myﬁle.bcolz', chunksize=1000000)

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Castra Special-purpose for DataFrames » on-disk, partitioned, compressed, column store. » uses Blosc # dask dataframe >>> from castra import Castra >>> c = Castra(path='/my/castra/ﬁle') >>> df = c.to_dask()

Slide 69

Slide 69 text

Bonus: Tea Files » a binary file format to store time series in binary flat files. » multiple language support (Python included) » specific for time series

Slide 70

Slide 70 text

>>> tf = TeaFile.create(""acme.tea"", ""Time Price Volume"", ""qdq"", ""ACME at NYSE"", {""decimals"": 2, ""url"": ""www.acme.com"" }) >>> tf.write(DateTime(2011, 3, 4, 9, 0), 45.11, 4500) >>> tf.write(DateTime(2011, 3, 4, 10, 0), 46.33, 1100) >>> tf.close() >>> tf = TeaFile.openread(""acme.tea"") >>> tf.read() TPV(Time=2011-03-04 09:00:00:000, Price=45.11, Volume=4500) >>> tf.read() TPV(Time=2011-03-04 10:00:00:000, Price=46.33, Volume=1100) >>> tf.read() >>> tf.close()

Slide 71

Slide 71 text

(General) Cons » Most of these data-formats are not actually meant to be distributed » Rather, they are meant for out-of-core, but single-machine, multi-core data processing

Slide 72

Slide 72 text

Conclusions

Slide 73

Slide 73 text

Conclusions » Most Python-related solutions are highly under development » dask seems to be the most promising and complete » Striking similarities in the way they deal with the problem » There are many data formats available, none seems to be a definitive choice » Each comes with its pros and cons

Slide 74

Slide 74 text

No content