Using Anaconda to light-up Dark Data

© 2015 Continuum Analytics- Confidential & Proprietary BIDS Data Science
Seminar Using Anaconda to light-up dark data. Travis E. Oliphant, PhD September 18, 2015

Started as a Scientist / Engineer 2 Images from BYU
CERS Lab

Science led to Python 3 Raja Muthupillai Armando Manduca Richard
Ehman Jim Greenleaf 1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j ⌅ = r ⇥ U

“Distractions” led to my calling 4

5 Latest Cosmological Theory

6 Dark Data: CSV, hdf5, npz, logs, emails, and other
files in your company outside a traditional store

7 Dark Data: CSV, hdf5, npz, logs, emails, and other
files in your company outside a traditional store

8 Database Approach Data Sources Data Store Data Sources Clients

9 Bring the Database to the Data Data Sources Data
Sources Clients Blaze (datashape, dask) NumPy, Pandas, SciPy, sklearn, etc. (for analytics)

Anaconda — portable environments 10 conda Python'&'R'Open'Source'Analytics NumPy, SciPy, Pandas,
Scikit;learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny, ggplot2, tidyr, caret, nnet and 330+ packages •Easy to install •Intuitive to discover •Quick to analyze •Simple to collaborate •Accessible to all

© 2015 Continuum Analytics- Confidential & Proprietary DTYPE INNOVATION (AN
ASIDE) 11

Key (potential) benefits of dtype 12 •Turns imperative code into
declarative code •Should provide a solid mechanism for u-func dispatch

Imperative to Declarative 13 NumPyIO June 1998 My First Python
Extension Reading Analyze Data-Format fread, fwrite Data Storage dtype arr[1:10,-5].field1

Function dispatch 14 def func(*args): key = (arg.dtype for arg
in args) return _funcmap[key](*args) Highly Simplified! — quite a few details to do well…

WHY BLAZE? 15 Thanks to Peter Wang for slides.

17 Data

18 “Math” Data

19 Math Big Data

20 Math Big Data

21 Math Big Data

22 Math Big Data Programs

23 “General Purpose Programming”

24 Analytics System Domain-Speciﬁc Query Language

27 Expressions Metadata Runtime

28 + - / * ^ [] join, groupby, filter
map, sort, take where, topk datashape, dtype, shape, stride hdf5, json, csv, xls protobuf, avro, ... NumPy, Pandas, R, Julia, K, SQL, Spark, Mongo, Cassandra, ...

BLAZE ECOSYSTEM 29 Thanks to Christine Doig for slides.

30 Blaze datashape odo DyND dask castra bcolz data description
language dynamic, multidimensional arrays parallel computing data migration column store & query column store blaze interface to query data @mrocklin @cpcloud @quasiben @jcrist @cowlicks @FrancescAlted @mwiebe @izaid @eriknw @esc Blaze Ecosystem

31 numpy pandas sql DB Data Runtime Expressions spark datashape
metadata storage odo parallel optimized dask numba DyND blaze castra bcolz

32 Data Runtime Expressions metadata storage/containers compute APIs, syntax, language
datashape blaze dask odo parallelize optimize, JIT

BLAZE LIBRARY 33 Thanks to Christine Doig and Phillip Cloud
for slides.

34 interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data Blaze iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv') S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).

35 iris[['sepal_length', 'species']] Select columns log(iris.sepal_length * 10) Operate Reduce
iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) T ext matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)] Blaze

36 datashape blaze Blaze uses datashape as its type system
(like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")

37 a structured data description language http://datashape.pydata.org/ dimension dtype unit
types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels Data Shape

38 { flowersdb: { iris: var * { petal_length: float64,
petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } datashape # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64] # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32

iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape:
"var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris server.yaml YAML 39 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json Blaze Server — Lights up your Dark Data

40 Blaze Client >>> from blaze import Data >>> s
= Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa Blaze Server

© 2015 Continuum Analytics- Confidential & Proprietary Compute recipes work
with existing libraries and have multiple backends. • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 41

© 2015 Continuum Analytics- Confidential & Proprietary 42 • Ideally,
you can layer expressions over any data. • Write once, deploy anywhere. • Practically, expressions will work better on specific data structures, formats, and engines. • You will need to copy from one format and/or engine to another

ODO LIBRARY 43 Thanks to Phillip Cloud and Christine Doig
for slides.

© 2015 Continuum Analytics- Confidential & Proprietary Odo • A
library for turning things into other things • Factored out from the blaze project • Handles a huge variety of conversions • odo is cp with types, for data 44

45 data migration, ~ cp with types, for data http://odo.pydata.org/en/latest/
from odo import odo odo(source, target) odo('iris.json', 'mongodb://localhost/mydb::iris') odo('iris.json', 'sqlite:///flowers.db::iris') odo('iris.csv', 'iris.json') odo('iris.csv', 'hdfs://hostname:iris.csv') odo('hive://hostname/default::iris_csv', 'hive://hostname/default::iris_parquet', stored_as='PARQUET', external=False) odo

© 2015 Continuum Analytics- Confidential & Proprietary 46 Through a
network of conversions How Does It Work?

© 2015 Continuum Analytics- Confidential & Proprietary 47 Each node
is a type (DataFrame, list, sqlalchemy.Table, etc...) Each edge is a conversion function

© 2015 Continuum Analytics- Confidential & Proprietary DASK 49 Thanks
to Christine Doig and Blake Griffith for slides

50 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed
cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks dask

cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks numpy, pandas dask dask.distributed dask

cluster single core computing numpy, pandas dask dask.distributed threaded scheduler multiprocessing scheduler dask

53 numpy dask >>> import numpy as np >>> np_ones
= np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # Result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') dask array

54 pandas dask >>> import pandas as pd >>> df
= pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 dask dataframe

55 semi-structure data, like JSON blobs or log ﬁles >>>
import dask.bag as db >>> import json # Get tweets as a dask.bag from compressed json files >>> b = db.from_filenames('*.json.gz').map(json.loads) # Take two items in dask.bag >>> b.take(2) ({u'contributors': None, u'coordinates': None, u'created_at': u'Fri Oct 10 17:19:35 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None … # Count the frequencies of user locations >>> freq = b.pluck('user').pluck('location').frequencies() # Get the result as a dataframe >>> df = freq.to_dataframe() >>> df.compute() 0 1 0 20916 1 Natal 2 2 Planet earth. Sheffield. 1 3 Mad, USERA 1 4 Brasilia DF - Brazil 2 5 Rondonia Cacoal 1 6 msftsrep || 4/5. 1 dask bag

56 >>> import dask >>> from dask.distributed import Client #
client connected to 50 nodes, 2 workers per node. >>> dc = Client('tcp://localhost:9000') # or >>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000') >>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads) # use default single node scheduler >>> top_commits.compute() # use client with distributed cluster >>> top_commits.compute(get=dc.get) [(u'mirror-updates', 1463019), (u'KenanSulayman', 235300), (u'greatfirebot', 167558), (u'rydnr', 133323), (u'markkcc', 127625)] dask distributed

57 dask blaze e.g. we can drive dask arrays with
blaze. >>> x = da.from_array(...) # Make a dask array >>> from blaze import Data, log, compute >>> d = Data(x) # Wrap with Blaze >>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual >>> result = compute(y) # Fall back to dask dask can be a backend/engine for blaze

•Collections build task graphs •Schedulers execute task graphs •Graph specification
= uniting interface 58

Questions? http://dask.pydata.org 59

NUMBA 60 Thanks to Stan Seibert for slides

Space of Python Compilation 61 Ahead Of Time Just In
Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba Numba HOPE Theano Pyjion Replaces CPython / libpython Nuitka (future) Pyston PyPy

Compiler overview 62 Intermediate Representation (IR) x86 C++ ARM PTX
C Fortran ObjC Code Generation Backend Parsing Frontend

Numba 63 Intermediate Representation (IR) x86 ARM PTX Python LLVM
Numba Parsing Frontend Code Generation Backend

Example 64 Numba

How Numba works 65 Bytecode Analysis Python Function Function Arguments
Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT

Numba Features 66 • Numba supports: Windows, OS X, and
Linux 32 and 64-bit x86 CPUs and NVIDIA GPUs Python 2 and 3 NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter  (all of your existing Python libraries are still available)

Numba Modes 67 • object mode: Compiled code operates on
Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN.

How to Use Numba 68 1. Create a realistic benchmark
test case.  (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.  (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.  (see rest of this talk and online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions.   (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement.

A Whirlwind Tour of Numba Features 69 • Sometimes you
can’t create a simple or efficient array expression or ufunc. Use Numba to work with array elements directly. • Example: Suppose you have a boolean grid and you want to find the maximum number neighbors a cell has in the grid:

The Basics 70

The Basics 71 Array Allocation Looping over ndarray x as
an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator  (nopython=True not required)

Calling Other Functions 72

Calling Other Functions 73 This function is not inlined This
function is inlined 9.8x speedup compared to doing this with numpy functions

Making Ufuncs 74

Making Ufuncs 75 Monte Carlo simulating 500,000 tournaments in 50
ms

Case-study -- j0 from scipy.special 76 • scipy.special was one
of the ﬁrst libraries I wrote (in 1999) • extended “umath” module by adding new “universal functions” to compute many scientiﬁc functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧

scipy.special.j0 wraps cephes algorithm 77 Don’t need this anymore!

Result --- equivalent to compiled code 78 In [6]: %timeit
vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!

Word starting to get out! 79 Recent numba mailing list
reports experiments of a SciPy author who got 2x speed-‐up by removing their Cython type annotations and surrounding function with numba.jit (with a few minor changes needed to the code). As soon as Numba’s ahead-‐of-‐time compilation moves beyond experimental stage one can legitimately use Numba to create a library that you ship to others (who then don’t need to have Numba installed — or just need a Numba run-‐time installed). SciPy (and NumPy) would look very different in Numba had existed 16 years ago when SciPy was getting started…. — and you would all be happier.

Generators 80

Releasing the GIL 81 Many fret about the GIL in
Python With PyData Stack you often have multi-‐threaded In PyData Stack we quite often release GIL NumPy does it SciPy does it (quite often) Scikit-‐learn (now) does it Pandas (now) does it when possible Cython makes it easy Numba makes it easy

Releasing the GIL 82 Only nopython mode functions can release
the GIL

Releasing the GIL 83 2.8x speedup with 4 cores

CUDA Python (in open-source Numba!) 84 CUDA Development using Python
syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU

Example: Black-Scholes 85

Black-Scholes: Results 86 core i7 GeForce GTX 560 Ti About
9x faster on this GPU ~ same speed as CUDA-C

Other interesting things 87 • CUDA Simulator to debug your
code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/

What Doesn’t Work? 88 (A non-comprehensive list) • Sets, lists,
dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode).

The (Near) Future 89 (Also a non-comprehensive list) • “JIT
Classes” • Better support for strings/bytes, buffers, and parsing use- cases • More coverage of the Numpy API (advanced indexing, etc) • Documented extension API for adding your own types, low level function implementations, and targets. • Better debug workflows

Recently Added Numba Features 90 • Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Many speed improvements

© 2015 Continuum Analytics- Confidential & Proprietary New Features •
Support for ARMv7 (Raspbery Pi 2) • Python 3.5 support • NumPy 1.10 support • Faster loading of pre-compiled functions from the disk cache • ufunc compilation for multithreaded CPU and GPU targets (features only in NumbaPro previously). 91

Conclusion 92 • Lots of progress in the past year!
• Try out Numba on your numerical and Numpy-related projects: conda install numba • Your feedback helps us make Numba better!  Tell us what you would like to see:    https://github.com/numba/numba • Stay tuned for more exciting stuff this year…

© 2015 Continuum Analytics- Confidential & Proprietary Thanks September 18,
2015 •DARPA XDATA program (Chris White and Wade Shen) which helped fund Numba, Blaze, Dask and Odo. •Investors of Continuum. •Clients and Customers of Continuum who help support these projects. •Numfocus volunteers •PyData volunteers

Using Anaconda to light-up Dark Data

Using Anaconda to light-up Dark Data

More Decks by Travis E. Oliphant

Other Decks in Technology

Featured

Transcript