Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Anaconda to light-up Dark Data

Using Anaconda to light-up Dark Data

Talk given as part of the BIDS seminar.

Travis E. Oliphant

September 18, 2015
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. © 2015 Continuum Analytics- Confidential & Proprietary BIDS Data Science

    Seminar Using Anaconda to light-up dark data. Travis E. Oliphant, PhD September 18, 2015
  2. Science led to Python 3 Raja Muthupillai Armando Manduca Richard

    Ehman Jim Greenleaf 1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j ⌅ = r ⇥ U
  3. 6 Dark Data: CSV, hdf5, npz, logs, emails, and other

    files in your company outside a traditional store
  4. 7 Dark Data: CSV, hdf5, npz, logs, emails, and other

    files in your company outside a traditional store
  5. 9 Bring the Database to the Data Data Sources Data

    Sources Clients Blaze (datashape, dask) NumPy, Pandas, SciPy, sklearn, etc. (for analytics)
  6. Anaconda — portable environments 10 conda Python'&'R'Open'Source'Analytics NumPy, SciPy, Pandas,

    Scikit;learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny, ggplot2, tidyr, caret, nnet and 330+ packages •Easy to install •Intuitive to discover •Quick to analyze •Simple to collaborate •Accessible to all
  7. Key (potential) benefits of dtype 12 •Turns imperative code into

    declarative code •Should provide a solid mechanism for u-func dispatch
  8. Imperative to Declarative 13 NumPyIO June 1998 My First Python

    Extension Reading Analyze Data-Format fread, fwrite Data Storage dtype arr[1:10,-5].field1
  9. Function dispatch 14 def func(*args): key = (arg.dtype for arg

    in args) return _funcmap[key](*args) Highly Simplified! — quite a few details to do well…
  10. 16

  11. 25

  12. 28 + - / * ^ [] join, groupby, filter

    map, sort, take where, topk datashape, dtype, shape, stride hdf5, json, csv, xls protobuf, avro, ... NumPy, Pandas, R, Julia, K, SQL, Spark, Mongo, Cassandra, ...
  13. 30 Blaze datashape odo DyND dask castra bcolz data description

    language dynamic, multidimensional arrays parallel computing data migration column store & query column store blaze interface to query data @mrocklin @cpcloud @quasiben @jcrist @cowlicks @FrancescAlted @mwiebe @izaid @eriknw @esc Blaze Ecosystem
  14. 31 numpy pandas sql DB Data Runtime Expressions spark datashape

    metadata storage odo parallel optimized dask numba DyND blaze castra bcolz
  15. 34 interface to query data on different storage systems http://blaze.pydata.org/en/latest/

    from blaze import Data Blaze iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv') S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
  16. 35 iris[['sepal_length', 'species']] Select columns log(iris.sepal_length * 10) Operate Reduce

    iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) T ext matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)] Blaze
  17. 36 datashape blaze Blaze uses datashape as its type system

    (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  18. 37 a structured data description language http://datashape.pydata.org/ dimension dtype unit

    types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels Data Shape
  19. 38 { flowersdb: { iris: var * { petal_length: float64,

    petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } datashape # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64] # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32
  20. iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape:

    "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris server.yaml YAML 39 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json Blaze Server — Lights up your Dark Data
  21. 40 Blaze Client >>> from blaze import Data >>> s

    = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa Blaze Server
  22. © 2015 Continuum Analytics- Confidential & Proprietary Compute recipes work

    with existing libraries and have multiple backends. • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 41
  23. © 2015 Continuum Analytics- Confidential & Proprietary 42 • Ideally,

    you can layer expressions over any data. • Write once, deploy anywhere. • Practically, expressions will work better on specific data structures, formats, and engines. • You will need to copy from one format and/or engine to another
  24. © 2015 Continuum Analytics- Confidential & Proprietary Odo • A

    library for turning things into other things • Factored out from the blaze project • Handles a huge variety of conversions • odo is cp with types, for data 44
  25. 45 data migration, ~ cp with types, for data http://odo.pydata.org/en/latest/

    from odo import odo odo(source, target) odo('iris.json', 'mongodb://localhost/mydb::iris') odo('iris.json', 'sqlite:///flowers.db::iris') odo('iris.csv', 'iris.json') odo('iris.csv', 'hdfs://hostname:iris.csv') odo('hive://hostname/default::iris_csv', 'hive://hostname/default::iris_parquet', stored_as='PARQUET', external=False) odo
  26. © 2015 Continuum Analytics- Confidential & Proprietary 47 Each node

    is a type (DataFrame, list, sqlalchemy.Table, etc...) Each edge is a conversion function
  27. © 2015 Continuum Analytics- Confidential & Proprietary DASK 49 Thanks

    to Christine Doig and Blake Griffith for slides
  28. 50 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed

    cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks dask
  29. 51 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed

    cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks numpy, pandas dask dask.distributed dask
  30. 52 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed

    cluster single core computing numpy, pandas dask dask.distributed threaded scheduler multiprocessing scheduler dask
  31. 53 numpy dask >>> import numpy as np >>> np_ones

    = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # Result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') dask array
  32. 54 pandas dask >>> import pandas as pd >>> df

    = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 dask dataframe
  33. 55 semi-structure data, like JSON blobs or log files >>>

    import dask.bag as db >>> import json # Get tweets as a dask.bag from compressed json files >>> b = db.from_filenames('*.json.gz').map(json.loads) # Take two items in dask.bag >>> b.take(2) ({u'contributors': None, u'coordinates': None, u'created_at': u'Fri Oct 10 17:19:35 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None … # Count the frequencies of user locations >>> freq = b.pluck('user').pluck('location').frequencies() # Get the result as a dataframe >>> df = freq.to_dataframe() >>> df.compute() 0 1 0 20916 1 Natal 2 2 Planet earth. Sheffield. 1 3 Mad, USERA 1 4 Brasilia DF - Brazil 2 5 Rondonia Cacoal 1 6 msftsrep || 4/5. 1 dask bag
  34. 56 >>> import dask >>> from dask.distributed import Client #

    client connected to 50 nodes, 2 workers per node. >>> dc = Client('tcp://localhost:9000') # or >>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000') >>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads) # use default single node scheduler >>> top_commits.compute() # use client with distributed cluster >>> top_commits.compute(get=dc.get) [(u'mirror-updates', 1463019), (u'KenanSulayman', 235300), (u'greatfirebot', 167558), (u'rydnr', 133323), (u'markkcc', 127625)] dask distributed
  35. 57 dask blaze e.g. we can drive dask arrays with

    blaze. >>> x = da.from_array(...) # Make a dask array >>> from blaze import Data, log, compute >>> d = Data(x) # Wrap with Blaze >>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual >>> result = compute(y) # Fall back to dask dask can be a backend/engine for blaze
  36. Space of Python Compilation 61 Ahead Of Time Just In

    Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba Numba HOPE Theano Pyjion Replaces CPython / libpython Nuitka (future) Pyston PyPy
  37. Compiler overview 62 Intermediate Representation (IR) x86 C++ ARM PTX

    C Fortran ObjC Code  Generation     Backend Parsing  Frontend
  38. Numba 63 Intermediate Representation (IR) x86 ARM PTX Python LLVM

    Numba Parsing  Frontend Code  Generation     Backend
  39. How Numba works 65 Bytecode Analysis Python Function Function Arguments

    Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  40. Numba Features 66 • Numba supports: Windows, OS X, and

    Linux 32 and 64-bit x86 CPUs and NVIDIA GPUs Python 2 and 3 NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available)
  41. Numba Modes 67 • object mode: Compiled code operates on

    Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN.
  42. How to Use Numba 68 1. Create a realistic benchmark

    test case.
 (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.
 (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
 (see rest of this talk and online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 
 (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement.
  43. A Whirlwind Tour of Numba Features 69 • Sometimes you

    can’t create a simple or efficient array expression or ufunc. Use Numba to work with array elements directly. • Example: Suppose you have a boolean grid and you want to find the maximum number neighbors a cell has in the grid:
  44. The Basics 71 Array Allocation Looping over ndarray x as

    an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator
 (nopython=True not required)
  45. Calling Other Functions 73 This function is not inlined This

    function is inlined 9.8x speedup compared to doing this with numpy functions
  46. Case-study -- j0 from scipy.special 76 • scipy.special was one

    of the first libraries I wrote (in 1999) • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
  47. Result --- equivalent to compiled code 78 In [6]: %timeit

    vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!
  48. Word starting to get out! 79 Recent  numba  mailing  list

     reports  experiments  of  a  SciPy  author  who  got  2x   speed-­‐up  by  removing  their  Cython  type  annotations  and  surrounding   function  with  numba.jit  (with  a  few  minor  changes  needed  to  the  code). As  soon  as  Numba’s  ahead-­‐of-­‐time  compilation  moves  beyond  experimental   stage  one  can  legitimately  use  Numba  to  create  a  library  that  you  ship  to   others  (who  then  don’t  need  to  have  Numba  installed  —  or  just  need  a  Numba   run-­‐time  installed). SciPy  (and  NumPy)  would  look  very  different  in  Numba  had  existed  16  years   ago  when  SciPy  was  getting  started….  —  and  you  would  all  be  happier.
  49. Releasing the GIL 81 Many  fret  about  the  GIL  in

     Python   With  PyData  Stack  you  often  have  multi-­‐threaded   In  PyData  Stack  we  quite  often  release  GIL   NumPy  does  it   SciPy  does  it  (quite  often)   Scikit-­‐learn  (now)  does  it   Pandas  (now)  does  it  when  possible   Cython  makes  it  easy   Numba  makes  it  easy
  50. CUDA Python (in open-source Numba!) 84 CUDA Development using Python

    syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  51. Black-Scholes: Results 86 core i7 GeForce GTX 560 Ti About

    9x faster on this GPU ~ same speed as CUDA-C
  52. Other interesting things 87 • CUDA Simulator to debug your

    code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/
  53. What Doesn’t Work? 88 (A non-comprehensive list) • Sets, lists,

    dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode).
  54. The (Near) Future 89 (Also a non-comprehensive list) • “JIT

    Classes” • Better support for strings/bytes, buffers, and parsing use- cases • More coverage of the Numpy API (advanced indexing, etc) • Documented extension API for adding your own types, low level function implementations, and targets. • Better debug workflows
  55. Recently Added Numba Features 90 • Recently Added Numba Features

    • A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Many speed improvements
  56. © 2015 Continuum Analytics- Confidential & Proprietary New Features •

    Support for ARMv7 (Raspbery Pi 2) • Python 3.5 support • NumPy 1.10 support • Faster loading of pre-compiled functions from the disk cache • ufunc compilation for multithreaded CPU and GPU targets (features only in NumbaPro previously). 91
  57. Conclusion 92 • Lots of progress in the past year!

    • Try out Numba on your numerical and Numpy-related projects: conda install numba • Your feedback helps us make Numba better!
 Tell us what you would like to see:
 
 https://github.com/numba/numba • Stay tuned for more exciting stuff this year…
  58. © 2015 Continuum Analytics- Confidential & Proprietary Thanks September 18,

    2015 •DARPA XDATA program (Chris White and Wade Shen) which helped fund Numba, Blaze, Dask and Odo. •Investors of Continuum. •Clients and Customers of Continuum who help support these projects. •Numfocus volunteers •PyData volunteers