Save 37% off PRO during our Black Friday Sale! »

Using Anaconda to light-up Dark Data

Using Anaconda to light-up Dark Data

Talk given as part of the BIDS seminar.

6c8561779fff34c62074c614d19980fc?s=128

Travis E. Oliphant

September 18, 2015
Tweet

Transcript

  1. © 2015 Continuum Analytics- Confidential & Proprietary BIDS Data Science

    Seminar Using Anaconda to light-up dark data. Travis E. Oliphant, PhD September 18, 2015
  2. Started as a Scientist / Engineer 2 Images from BYU

    CERS Lab
  3. Science led to Python 3 Raja Muthupillai Armando Manduca Richard

    Ehman Jim Greenleaf 1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j ⌅ = r ⇥ U
  4. “Distractions” led to my calling 4

  5. 5 Latest Cosmological Theory

  6. 6 Dark Data: CSV, hdf5, npz, logs, emails, and other

    files in your company outside a traditional store
  7. 7 Dark Data: CSV, hdf5, npz, logs, emails, and other

    files in your company outside a traditional store
  8. 8 Database Approach Data Sources Data Store Data Sources Clients

  9. 9 Bring the Database to the Data Data Sources Data

    Sources Clients Blaze (datashape, dask) NumPy, Pandas, SciPy, sklearn, etc. (for analytics)
  10. Anaconda — portable environments 10 conda Python'&'R'Open'Source'Analytics NumPy, SciPy, Pandas,

    Scikit;learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny, ggplot2, tidyr, caret, nnet and 330+ packages •Easy to install •Intuitive to discover •Quick to analyze •Simple to collaborate •Accessible to all
  11. © 2015 Continuum Analytics- Confidential & Proprietary DTYPE INNOVATION (AN

    ASIDE) 11
  12. Key (potential) benefits of dtype 12 •Turns imperative code into

    declarative code •Should provide a solid mechanism for u-func dispatch
  13. Imperative to Declarative 13 NumPyIO June 1998 My First Python

    Extension Reading Analyze Data-Format fread, fwrite Data Storage dtype arr[1:10,-5].field1
  14. Function dispatch 14 def func(*args): key = (arg.dtype for arg

    in args) return _funcmap[key](*args) Highly Simplified! — quite a few details to do well…
  15. WHY BLAZE? 15 Thanks to Peter Wang for slides.

  16. 16

  17. 17 Data

  18. 18 “Math” Data

  19. 19 Math Big Data

  20. 20 Math Big Data

  21. 21 Math Big Data

  22. 22 Math Big Data Programs

  23. 23 “General Purpose Programming”

  24. 24 Analytics System Domain-Specific Query Language

  25. 25

  26. 26 ?

  27. 27 Expressions Metadata Runtime

  28. 28 + - / * ^ [] join, groupby, filter

    map, sort, take where, topk datashape, dtype, shape, stride hdf5, json, csv, xls protobuf, avro, ... NumPy, Pandas, R, Julia, K, SQL, Spark, Mongo, Cassandra, ...
  29. BLAZE ECOSYSTEM 29 Thanks to Christine Doig for slides.

  30. 30 Blaze datashape odo DyND dask castra bcolz data description

    language dynamic, multidimensional arrays parallel computing data migration column store & query column store blaze interface to query data @mrocklin @cpcloud @quasiben @jcrist @cowlicks @FrancescAlted @mwiebe @izaid @eriknw @esc Blaze Ecosystem
  31. 31 numpy pandas sql DB Data Runtime Expressions spark datashape

    metadata storage odo parallel optimized dask numba DyND blaze castra bcolz
  32. 32 Data Runtime Expressions metadata storage/containers compute APIs, syntax, language

    datashape blaze dask odo parallelize optimize, JIT
  33. BLAZE LIBRARY 33 Thanks to Christine Doig and Phillip Cloud

    for slides.
  34. 34 interface to query data on different storage systems http://blaze.pydata.org/en/latest/

    from blaze import Data Blaze iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv') S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
  35. 35 iris[['sepal_length', 'species']] Select columns log(iris.sepal_length * 10) Operate Reduce

    iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) T ext matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)] Blaze
  36. 36 datashape blaze Blaze uses datashape as its type system

    (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  37. 37 a structured data description language http://datashape.pydata.org/ dimension dtype unit

    types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels Data Shape
  38. 38 { flowersdb: { iris: var * { petal_length: float64,

    petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } datashape # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64] # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32
  39. iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape:

    "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris server.yaml YAML 39 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json Blaze Server — Lights up your Dark Data
  40. 40 Blaze Client >>> from blaze import Data >>> s

    = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa Blaze Server
  41. © 2015 Continuum Analytics- Confidential & Proprietary Compute recipes work

    with existing libraries and have multiple backends. • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 41
  42. © 2015 Continuum Analytics- Confidential & Proprietary 42 • Ideally,

    you can layer expressions over any data. • Write once, deploy anywhere. • Practically, expressions will work better on specific data structures, formats, and engines. • You will need to copy from one format and/or engine to another
  43. ODO LIBRARY 43 Thanks to Phillip Cloud and Christine Doig

    for slides.
  44. © 2015 Continuum Analytics- Confidential & Proprietary Odo • A

    library for turning things into other things • Factored out from the blaze project • Handles a huge variety of conversions • odo is cp with types, for data 44
  45. 45 data migration, ~ cp with types, for data http://odo.pydata.org/en/latest/

    from odo import odo odo(source, target) odo('iris.json', 'mongodb://localhost/mydb::iris') odo('iris.json', 'sqlite:///flowers.db::iris') odo('iris.csv', 'iris.json') odo('iris.csv', 'hdfs://hostname:iris.csv') odo('hive://hostname/default::iris_csv', 'hive://hostname/default::iris_parquet', stored_as='PARQUET', external=False) odo
  46. © 2015 Continuum Analytics- Confidential & Proprietary 46 Through a

    network of conversions How Does It Work?
  47. © 2015 Continuum Analytics- Confidential & Proprietary 47 Each node

    is a type (DataFrame, list, sqlalchemy.Table, etc...) Each edge is a conversion function
  48. © 2015 Continuum Analytics- Confidential & Proprietary It’s extensible! 48

  49. © 2015 Continuum Analytics- Confidential & Proprietary DASK 49 Thanks

    to Christine Doig and Blake Griffith for slides
  50. 50 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed

    cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks dask
  51. 51 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed

    cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks numpy, pandas dask dask.distributed dask
  52. 52 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed

    cluster single core computing numpy, pandas dask dask.distributed threaded scheduler multiprocessing scheduler dask
  53. 53 numpy dask >>> import numpy as np >>> np_ones

    = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # Result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') dask array
  54. 54 pandas dask >>> import pandas as pd >>> df

    = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 dask dataframe
  55. 55 semi-structure data, like JSON blobs or log files >>>

    import dask.bag as db >>> import json # Get tweets as a dask.bag from compressed json files >>> b = db.from_filenames('*.json.gz').map(json.loads) # Take two items in dask.bag >>> b.take(2) ({u'contributors': None, u'coordinates': None, u'created_at': u'Fri Oct 10 17:19:35 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None … # Count the frequencies of user locations >>> freq = b.pluck('user').pluck('location').frequencies() # Get the result as a dataframe >>> df = freq.to_dataframe() >>> df.compute() 0 1 0 20916 1 Natal 2 2 Planet earth. Sheffield. 1 3 Mad, USERA 1 4 Brasilia DF - Brazil 2 5 Rondonia Cacoal 1 6 msftsrep || 4/5. 1 dask bag
  56. 56 >>> import dask >>> from dask.distributed import Client #

    client connected to 50 nodes, 2 workers per node. >>> dc = Client('tcp://localhost:9000') # or >>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000') >>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads) # use default single node scheduler >>> top_commits.compute() # use client with distributed cluster >>> top_commits.compute(get=dc.get) [(u'mirror-updates', 1463019), (u'KenanSulayman', 235300), (u'greatfirebot', 167558), (u'rydnr', 133323), (u'markkcc', 127625)] dask distributed
  57. 57 dask blaze e.g. we can drive dask arrays with

    blaze. >>> x = da.from_array(...) # Make a dask array >>> from blaze import Data, log, compute >>> d = Data(x) # Wrap with Blaze >>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual >>> result = compute(y) # Fall back to dask dask can be a backend/engine for blaze
  58. •Collections build task graphs •Schedulers execute task graphs •Graph specification

    = uniting interface 58
  59. Questions? http://dask.pydata.org 59

  60. NUMBA 60 Thanks to Stan Seibert for slides

  61. Space of Python Compilation 61 Ahead Of Time Just In

    Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba Numba HOPE Theano Pyjion Replaces CPython / libpython Nuitka (future) Pyston PyPy
  62. Compiler overview 62 Intermediate Representation (IR) x86 C++ ARM PTX

    C Fortran ObjC Code  Generation     Backend Parsing  Frontend
  63. Numba 63 Intermediate Representation (IR) x86 ARM PTX Python LLVM

    Numba Parsing  Frontend Code  Generation     Backend
  64. Example 64 Numba

  65. How Numba works 65 Bytecode Analysis Python Function Function Arguments

    Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  66. Numba Features 66 • Numba supports: Windows, OS X, and

    Linux 32 and 64-bit x86 CPUs and NVIDIA GPUs Python 2 and 3 NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available)
  67. Numba Modes 67 • object mode: Compiled code operates on

    Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN.
  68. How to Use Numba 68 1. Create a realistic benchmark

    test case.
 (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.
 (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
 (see rest of this talk and online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 
 (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement.
  69. A Whirlwind Tour of Numba Features 69 • Sometimes you

    can’t create a simple or efficient array expression or ufunc. Use Numba to work with array elements directly. • Example: Suppose you have a boolean grid and you want to find the maximum number neighbors a cell has in the grid:
  70. The Basics 70

  71. The Basics 71 Array Allocation Looping over ndarray x as

    an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator
 (nopython=True not required)
  72. Calling Other Functions 72

  73. Calling Other Functions 73 This function is not inlined This

    function is inlined 9.8x speedup compared to doing this with numpy functions
  74. Making Ufuncs 74

  75. Making Ufuncs 75 Monte Carlo simulating 500,000 tournaments in 50

    ms
  76. Case-study -- j0 from scipy.special 76 • scipy.special was one

    of the first libraries I wrote (in 1999) • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
  77. scipy.special.j0 wraps cephes algorithm 77 Don’t  need  this  anymore!

  78. Result --- equivalent to compiled code 78 In [6]: %timeit

    vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!
  79. Word starting to get out! 79 Recent  numba  mailing  list

     reports  experiments  of  a  SciPy  author  who  got  2x   speed-­‐up  by  removing  their  Cython  type  annotations  and  surrounding   function  with  numba.jit  (with  a  few  minor  changes  needed  to  the  code). As  soon  as  Numba’s  ahead-­‐of-­‐time  compilation  moves  beyond  experimental   stage  one  can  legitimately  use  Numba  to  create  a  library  that  you  ship  to   others  (who  then  don’t  need  to  have  Numba  installed  —  or  just  need  a  Numba   run-­‐time  installed). SciPy  (and  NumPy)  would  look  very  different  in  Numba  had  existed  16  years   ago  when  SciPy  was  getting  started….  —  and  you  would  all  be  happier.
  80. Generators 80

  81. Releasing the GIL 81 Many  fret  about  the  GIL  in

     Python   With  PyData  Stack  you  often  have  multi-­‐threaded   In  PyData  Stack  we  quite  often  release  GIL   NumPy  does  it   SciPy  does  it  (quite  often)   Scikit-­‐learn  (now)  does  it   Pandas  (now)  does  it  when  possible   Cython  makes  it  easy   Numba  makes  it  easy
  82. Releasing the GIL 82 Only nopython mode functions can release

    the GIL
  83. Releasing the GIL 83 2.8x speedup with 4 cores

  84. CUDA Python (in open-source Numba!) 84 CUDA Development using Python

    syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  85. Example: Black-Scholes 85

  86. Black-Scholes: Results 86 core i7 GeForce GTX 560 Ti About

    9x faster on this GPU ~ same speed as CUDA-C
  87. Other interesting things 87 • CUDA Simulator to debug your

    code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/
  88. What Doesn’t Work? 88 (A non-comprehensive list) • Sets, lists,

    dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode).
  89. The (Near) Future 89 (Also a non-comprehensive list) • “JIT

    Classes” • Better support for strings/bytes, buffers, and parsing use- cases • More coverage of the Numpy API (advanced indexing, etc) • Documented extension API for adding your own types, low level function implementations, and targets. • Better debug workflows
  90. Recently Added Numba Features 90 • Recently Added Numba Features

    • A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Many speed improvements
  91. © 2015 Continuum Analytics- Confidential & Proprietary New Features •

    Support for ARMv7 (Raspbery Pi 2) • Python 3.5 support • NumPy 1.10 support • Faster loading of pre-compiled functions from the disk cache • ufunc compilation for multithreaded CPU and GPU targets (features only in NumbaPro previously). 91
  92. Conclusion 92 • Lots of progress in the past year!

    • Try out Numba on your numerical and Numpy-related projects: conda install numba • Your feedback helps us make Numba better!
 Tell us what you would like to see:
 
 https://github.com/numba/numba • Stay tuned for more exciting stuff this year…
  93. © 2015 Continuum Analytics- Confidential & Proprietary Thanks September 18,

    2015 •DARPA XDATA program (Chris White and Wade Shen) which helped fund Numba, Blaze, Dask and Odo. •Investors of Continuum. •Clients and Customers of Continuum who help support these projects. •Numfocus volunteers •PyData volunteers