Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python for Data Science

Python for Data Science

A bestiary of the Python tools for doing data science in a distributed environment

Edoardo Vacchi

October 06, 2015
Tweet

More Decks by Edoardo Vacchi

Other Decks in Programming

Transcript

  1. IPython » Evolution of the Interactive REPL » Notebook »

    Born with Wolfram Mathematica [citation needed]
  2. NumPy » Rather thin wrapper around high-performance C/Fortran » Standard

    de-facto for numerical analysis in Python >>> import numpy as np >>> arr = np.random.randn(100,100) >>> arr.shape ...
  3. Pandas Dataframes >>> DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5,

    6])]) A B 0 1 4 1 2 5 2 3 6 >>> DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['one', 'two', 'three']) one two three A 1 2 3 B 4 5 6
  4. >>> iris = read_csv('iris.data') >>> iris.head() SepalLength SepalWidth PetalLength PetalWidth

    Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa array-like indexing >>> iris[ iris.SepalLength < 5.0 .head()
  5. Integration with matplotlib iris.query('SepalLength > 5').assign( SepalRatio = lambda x:

    x.SepalWidth / x.SepalLength, PetalRatio = lambda x: x.PetalWidth / x.PetalLength ).plot(kind='scatter', x='SepalRatio', y='PetalRatio')
  6. Out-of-Core Data Processing » Blaze » Spartan » Dask »

    DistArray » Ibis (also, Global Arrays, Numba, IPython Parallel, DPark...)
  7. Blaze » Lazy execution » Represents a data source »

    Abstracts from the many backends » Exposes a dataframe-like API » Backends implement the API
  8. >>> t = Data([(1, 'Alice', 100), ... (2, 'Bob', -200),

    ... (3, 'Charlie', 300), ... (4, 'Denis', 400), ... (5, 'Edith', -500)], ... fields=['id', 'name', 'balance']) >>> t[t.balance < 0] id name balance 0 2 Bob -200 1 5 Edith -500
  9. Blaze Expressions » Blaze expressions describe computational workflows symbolically »

    The data object enqueues operations until compute() is invoked » This queue (or rather, a cflow graph) is maintained as a dictionary in memory
  10. >>> from blaze import * >>> accounts = Symbol( ...

    'accounts', ... 'var * {id: int, name: string, balance: int}') >>> deadbeats = accounts[accounts.balance < 0].name >>> deadbeats accounts[accounts.balance < 0].name >>> deadbeats.dshape dshape("var * string") >>> data = [[1, 'Alice', 100], ... [2, 'Bob', -200], ... [3, 'Charlie', 300]] >>> namespace = {accounts: data} >>> from blaze import compute >>> list(compute(deadbeats, namespace)) ['Bob']
  11. >>> to_tree(deadbeats) { 'args':[ { 'args':[ {'args':[ 'accounts', 'var *

    {id: int32, name: string, balance: int32}', None ], 'op':'Symbol'}, { 'args':[ { 'args':[ {'args':[ 'accounts', 'var * {id: int32, name: string, balance: int32', None ], 'op':'Symbol'}, 'balance' ], 'op':'Field' }, 0 ], 'op':'Lt' } ], 'op':'Selection' }, 'name' ], 'op':'Field' }
  12. » Blaze is general » it is sort-of a "translation

    layer" from a familiar, dataframe- like API to many "backends"
  13. Spartan Spartan is a library for distributed array programming. Programmers

    build up array expressions (using Numpy-like operations). These expressions are then compiled and optimized and run on a distributed array backend across multiple machines. — github.com/spartan-array/spartan
  14. >>> x = spartan.ones((10, 10)) >>> x MapExpr { local_dag

    = None, fn_kw = DictExpr { vals = {} }, children = ListExpr { vals = [ [0] = NdArrayExpr { combine_fn = None, dtype = <type 'float'>, _shape = (10, 10), tile_hint = None, reduce_fn = None } ] }, map_fn = <function <lambda> at 0x3dbae60> }
  15. Dask A project that sprout from Blaze — “Distributed Task”

    dask.array “Multi-core / on-disk NumPy arrays” dask.dataframe “Multi-core / on-disk Pandas dataframes » Lazy execution » High-level representation of a distributed task (dask) » Standard Python APIs » Dictionary representation of the distributed task » The task is already partitioned and organized so that a scheduler may dispatch the execution to many workers
  16. dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load,

    'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}
  17. dask.array » Dask Array implements a subset of the NumPy

    ndarray interface using blocked algorithms » array is cut up into many small NumPy arrays. » blocked algorithms coordinated using dask graphs. » drop-in replacement for a subset of the NumPy library.
  18. >>> arr = da.random.random((100,100), chunks=(50,50)) + 1 >>> arr.dask {('wrapped_1',

    0, 0): (<random_sample of mtrand.RandomState object>, (50, 50)), ('wrapped_1', 0, 1): (<random_sample of mtrand.RandomState object>, (50, 50)), ('wrapped_1', 1, 0): (<random_sample of mtrand.RandomState object>, (50, 50)), ('wrapped_1', 1, 1): (<random_sample of mtrand.RandomState object>, (50, 50)), ('x_1', 0, 0): (<function f at 0x10e6e0410>, ('wrapped_1', 0, 0)), ('x_1', 0, 1): (<function f at 0x10e6e0410>, ('wrapped_1', 0, 1)), ('x_1', 1, 0): (<function f at 0x10e6e0410>, ('wrapped_1', 1, 0)), ('x_1', 1, 1): (<function f at 0x10e6e0410>, ('wrapped_1', 1, 1))}
  19. dask.dataframe lazy, distributed dataframe >>> import dask.dataframe as dd >>>

    df = dd.read_csv('2014-*.csv.gz', compression='gzip') >>> df2 = df[df.y == 'a'].x + 1 >>> df2.compute() 0 2 3 5 Name: x, dtype: int64
  20. >>> from operator import add >>> dsk = {'x': 1,

    'y': (add, 'x', 2)} >>> c.get(dsk, 'y') # causes distributed work
  21. Known Limitations » Not fault tolerant » Almost no data

    locality handling (except for linear chains) » Dynamic scheduler » hand-tuned MPI will probably be always faster -- source: dask.pydata.org/en/latest/distributed.html
  22. DistArray » Similar to the others in that it provides

    a NumPy-like API in a distributed context » Similar to Dask: » DistArrays are chunked » Operations are executed remotely by workers
  23. However 1. DistArrays are not lazy 2. Each worker works

    on its own chunk using native NumPy 3. Uses MPI for data transfer and coordination 4. Documented "Distributed Array Protocol" that describes messages (tries to minimize data-transfers) It basically does RPC between a “driver” client and the workers.
  24. nparr = numpy.random.random((4, 5)) from distarray.globalapi import Context context =

    Context() darr = context.fromarray(nparr) (darr + darr).sum() darr[:, 2::2] etc...
  25. >>> import ibis >>> int_expr = ibis.literal(5) >>> (int_expr +

    10).sqrt() Sqrt[double] Add[int8] Literal[int8] 5 Literal[int8] 10
  26. >>> con = ibis.impala.connect('bottou01.sjc.cloudera.com') >>> db = con.database('ibis_testing') >>> (t

    # DatabaseTable -> TableExpr .limit(20) # Limit -> TableExpr .tinyint_col # TableColumn -> Int8Array .add(10) # Add -> Int16Array .sqrt() # Sqrt -> DoubleArray .round(2) # Round -> DoubleArray .sum()) # Sum -> DoubleScalar
  27. ref_0 DatabaseTable[table] name: ibis_testing.`functional_alltypes` schema: id : int32 bool_col :

    boolean tinyint_col : int8 ... sum = Sum[double] Round[array(double)] Sqrt[array(double)] Add[array(int16)] tinyint_col = Column[array(int8)] 'tinyint_col' from table Limit[table] Table: ref_0 n: 20 offset: 0 Literal[int8] 10 Literal[int8] 2 None
  28. An Interesting Trend From a language design perspective » distributed

    computation must be lazyfied » Lazyfication through meta-representation of a computation » Delayed execution through re-interpretation » Meta-repr is a tree or list for the AST or CFlow of the program
  29. » Now, I'm not saying that all these people are

    basically reimplementing LISP on top of Python. » That would be Hy github.com/hylang/hy => (print "Hy!") Hy!
  30. «Any sufficiently complicated C or Fortran program contains an ad

    hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.» — Greenspun's tenth rule
  31. ➜ ~ julia --lisp ; _ ; |_ _ _

    |_ _ | . _ _ ; | (-||||_(_)|__|_)|_) ;-------------------|---------------------------------------------------------- > (+ 1 2) ; lol i can s-exprs 3 Really, try it.
  32. CSV » PROS » human-readable » easy to produce »

    widespread » CONS » non-compressed » non-columnar » actually hard to get right "this is a valid",""csv"",line
  33. NPY numpy.load(file, mmap_mode=None) Load arrays or pickled objects from .npy,

    .npz or pickled files. » Binary format » Essentially a memory snapshot » Magic number » Header (ver, len) » Dict literal (metadata) » If the dtype contains Python objects, then the data is a Python pickle of the array. » Otherwise the data is the contiguous bytes of the array. » CONS: highly Python-specific
  34. HDF5 » Hierarchical Data Format v5 » It can be

    better described as a “file-system in a file” » Group. contains instances of zero or more groups or datasets, together with supporting metadata (basically, a directory) » Datasets. a multidimensional array of data elements, together with supporting metadata
  35. Data types » Integer datatypes: 8-bit, 16-bit, 32-bit, and 64-bit

    » Floating-point numbers: IEEE 32-bit and 64-bit » References » Strings Also compound data types
  36. Problem: not distributable At least, not in the HDFS-way hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf

    » In short: HDF5 cannot be easily chunked in a meaningful way » e.g.: CSV can be split at line boundaries » HDF5 is a file system » how do you "split" a file system ? » what if files cannot be broken ? » manual additional work is required to make the most out of HDF5+HDFS
  37. Parallel HDF5 » MPI parallel I/O; HDF5 Parallel I/O e.g.:

    » Create, open and close a file » Create, open, and close a dataset » Extend a dataset (increase dimension sizes) » R/W from/to a dataset (data xfer collective or independent.) » Once a file is opened by the processes of a communicator: » All parts of the file are accessible by all processes. » All objects in the file are accessible by all processes. » Multiple processes write to the same dataset. » Each process writes to a individual dataset.
  38. NetCDF » Network Common Data Form » set of software

    libraries and data formats for r/w array-oriented scientific data (unidata.ucar.edu/software/netcdf/) » Parallel dev to HDF » Today, the latest NetCDF spec (v4) is basically a restricted subset of HDF5
  39. BColz columnar, chunked data containers that can be compressed either

    in- memory and on-disk. » uses blosc for compression » not distributed # dask.dataframe >>> df = dd.from_bcolz('myfile.bcolz', chunksize=1000000)
  40. Castra Special-purpose for DataFrames » on-disk, partitioned, compressed, column store.

    » uses Blosc # dask dataframe >>> from castra import Castra >>> c = Castra(path='/my/castra/file') >>> df = c.to_dask()
  41. Bonus: Tea Files » a binary file format to store

    time series in binary flat files. » multiple language support (Python included) » specific for time series
  42. >>> tf = TeaFile.create(""acme.tea"", ""Time Price Volume"", ""qdq"", ""ACME at

    NYSE"", {""decimals"": 2, ""url"": ""www.acme.com"" }) >>> tf.write(DateTime(2011, 3, 4, 9, 0), 45.11, 4500) >>> tf.write(DateTime(2011, 3, 4, 10, 0), 46.33, 1100) >>> tf.close() >>> tf = TeaFile.openread(""acme.tea"") >>> tf.read() TPV(Time=2011-03-04 09:00:00:000, Price=45.11, Volume=4500) >>> tf.read() TPV(Time=2011-03-04 10:00:00:000, Price=46.33, Volume=1100) >>> tf.read() >>> tf.close()
  43. (General) Cons » Most of these data-formats are not actually

    meant to be distributed » Rather, they are meant for out-of-core, but single-machine, multi-core data processing
  44. Conclusions » Most Python-related solutions are highly under development »

    dask seems to be the most promising and complete » Striking similarities in the way they deal with the problem » There are many data formats available, none seems to be a definitive choice » Each comes with its pros and cons