Using Anaconda to light-up Dark Data

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Started as a Scientist / Engineer 2 Images from BYU CERS Lab

Slide 3

Slide 3 text

Science led to Python 3 Raja Muthupillai Armando Manduca Richard Ehman Jim Greenleaf 1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j ⌅ = r ⇥ U

Slide 4

Slide 4 text

“Distractions” led to my calling 4

Slide 5

Slide 5 text

5 Latest Cosmological Theory

Slide 6

Slide 6 text

6 Dark Data: CSV, hdf5, npz, logs, emails, and other files in your company outside a traditional store

Slide 7

Slide 7 text

7 Dark Data: CSV, hdf5, npz, logs, emails, and other files in your company outside a traditional store

Slide 8

Slide 8 text

8 Database Approach Data Sources Data Store Data Sources Clients

Slide 9

Slide 9 text

9 Bring the Database to the Data Data Sources Data Sources Clients Blaze (datashape, dask) NumPy, Pandas, SciPy, sklearn, etc. (for analytics)

Slide 10

Slide 10 text

Anaconda — portable environments 10 conda Python'&'R'Open'Source'Analytics NumPy, SciPy, Pandas, Scikit;learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit;image, NLTK, NetworkX, IRKernel, dplyr, shiny, ggplot2, tidyr, caret, nnet and 330+ packages •Easy to install •Intuitive to discover •Quick to analyze •Simple to collaborate •Accessible to all

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Key (potential) benefits of dtype 12 •Turns imperative code into declarative code •Should provide a solid mechanism for u-func dispatch

Slide 13

Slide 13 text

Imperative to Declarative 13 NumPyIO June 1998 My First Python Extension Reading Analyze Data-Format fread, fwrite Data Storage dtype arr[1:10,-5].field1

Slide 14

Slide 14 text

Function dispatch 14 def func(*args): key = (arg.dtype for arg in args) return _funcmap[key](*args) Highly Simplified! — quite a few details to do well…

Slide 15

Slide 15 text

WHY BLAZE? 15 Thanks to Peter Wang for slides.

Slide 16

Slide 16 text

Slide 17

Slide 17 text

17 Data

Slide 18

Slide 18 text

18 “Math” Data

Slide 19

Slide 19 text

19 Math Big Data

Slide 20

Slide 20 text

20 Math Big Data

Slide 21

Slide 21 text

21 Math Big Data

Slide 22

Slide 22 text

22 Math Big Data Programs

Slide 23

Slide 23 text

23 “General Purpose Programming”

Slide 24

Slide 24 text

24 Analytics System Domain-Speciﬁc Query Language

Slide 25

Slide 25 text

Slide 26

Slide 26 text

26 ?

Slide 27

Slide 27 text

27 Expressions Metadata Runtime

Slide 28

Slide 28 text

28 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape, dtype, shape, stride hdf5, json, csv, xls protobuf, avro, ... NumPy, Pandas, R, Julia, K, SQL, Spark, Mongo, Cassandra, ...

Slide 29

Slide 29 text

BLAZE ECOSYSTEM 29 Thanks to Christine Doig for slides.

Slide 30

Slide 30 text

30 Blaze datashape odo DyND dask castra bcolz data description language dynamic, multidimensional arrays parallel computing data migration column store & query column store blaze interface to query data @mrocklin @cpcloud @quasiben @jcrist @cowlicks @FrancescAlted @mwiebe @izaid @eriknw @esc Blaze Ecosystem

Slide 31

Slide 31 text

31 numpy pandas sql DB Data Runtime Expressions spark datashape metadata storage odo parallel optimized dask numba DyND blaze castra bcolz

Slide 32

Slide 32 text

32 Data Runtime Expressions metadata storage/containers compute APIs, syntax, language datashape blaze dask odo parallelize optimize, JIT

Slide 33

Slide 33 text

BLAZE LIBRARY 33 Thanks to Christine Doig and Phillip Cloud for slides.

Slide 34

Slide 34 text

34 interface to query data on different storage systems http://blaze.pydata.org/en/latest/ from blaze import Data Blaze iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv') S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).

Slide 35

Slide 35 text

35 iris[['sepal_length', 'species']] Select columns log(iris.sepal_length * 10) Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) T ext matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)] Blaze

Slide 36

Slide 36 text

36 datashape blaze Blaze uses datashape as its type system (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")

Slide 37

Slide 37 text

37 a structured data description language http://datashape.pydata.org/ dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels Data Shape

Slide 38

Slide 38 text

38 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } datashape # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64] # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32

Slide 39

Slide 39 text

iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris server.yaml YAML 39 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json Blaze Server — Lights up your Dark Data

Slide 40

Slide 40 text

40 Blaze Client >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa Blaze Server

Slide 41

Slide 41 text

© 2015 Continuum Analytics- Confidential & Proprietary Compute recipes work with existing libraries and have multiple backends. • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 41

Slide 42

Slide 42 text

© 2015 Continuum Analytics- Confidential & Proprietary 42 • Ideally, you can layer expressions over any data. • Write once, deploy anywhere. • Practically, expressions will work better on specific data structures, formats, and engines. • You will need to copy from one format and/or engine to another

Slide 43

Slide 43 text

ODO LIBRARY 43 Thanks to Phillip Cloud and Christine Doig for slides.

Slide 44

Slide 44 text

© 2015 Continuum Analytics- Confidential & Proprietary Odo • A library for turning things into other things • Factored out from the blaze project • Handles a huge variety of conversions • odo is cp with types, for data 44

Slide 45

Slide 45 text

45 data migration, ~ cp with types, for data http://odo.pydata.org/en/latest/ from odo import odo odo(source, target) odo('iris.json', 'mongodb://localhost/mydb::iris') odo('iris.json', 'sqlite:///flowers.db::iris') odo('iris.csv', 'iris.json') odo('iris.csv', 'hdfs://hostname:iris.csv') odo('hive://hostname/default::iris_csv', 'hive://hostname/default::iris_parquet', stored_as='PARQUET', external=False) odo

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

50 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks dask

Slide 51

Slide 51 text

51 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed cluster single core computing Gigabyte Fits in memory T erabyte Fits on disk Petabyte Fits on many disks numpy, pandas dask dask.distributed dask

Slide 52

Slide 52 text

52 enables parallel computing http://dask.pydata.org/en/latest/ parallel computing shared memory distributed cluster single core computing numpy, pandas dask dask.distributed threaded scheduler multiprocessing scheduler dask

Slide 53

Slide 53 text

53 numpy dask >>> import numpy as np >>> np_ones = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # Result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') dask array

Slide 54

Slide 54 text

54 pandas dask >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 dask dataframe

Slide 55

Slide 55 text

55 semi-structure data, like JSON blobs or log ﬁles >>> import dask.bag as db >>> import json # Get tweets as a dask.bag from compressed json files >>> b = db.from_filenames('*.json.gz').map(json.loads) # Take two items in dask.bag >>> b.take(2) ({u'contributors': None, u'coordinates': None, u'created_at': u'Fri Oct 10 17:19:35 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None … # Count the frequencies of user locations >>> freq = b.pluck('user').pluck('location').frequencies() # Get the result as a dataframe >>> df = freq.to_dataframe() >>> df.compute() 0 1 0 20916 1 Natal 2 2 Planet earth. Sheffield. 1 3 Mad, USERA 1 4 Brasilia DF - Brazil 2 5 Rondonia Cacoal 1 6 msftsrep || 4/5. 1 dask bag

Slide 56

Slide 56 text

56 >>> import dask >>> from dask.distributed import Client # client connected to 50 nodes, 2 workers per node. >>> dc = Client('tcp://localhost:9000') # or >>> dc = Client('tcp://ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:9000') >>> b = db.from_s3('githubarchive-data', '2015-*.json.gz').map(json.loads) # use default single node scheduler >>> top_commits.compute() # use client with distributed cluster >>> top_commits.compute(get=dc.get) [(u'mirror-updates', 1463019), (u'KenanSulayman', 235300), (u'greatfirebot', 167558), (u'rydnr', 133323), (u'markkcc', 127625)] dask distributed

Slide 57

Slide 57 text

57 dask blaze e.g. we can drive dask arrays with blaze. >>> x = da.from_array(...) # Make a dask array >>> from blaze import Data, log, compute >>> d = Data(x) # Wrap with Blaze >>> y = log(d + 1)[:5].sum(axis=1) # Do work as usual >>> result = compute(y) # Fall back to dask dask can be a backend/engine for blaze

Slide 58

Slide 58 text

•Collections build task graphs •Schedulers execute task graphs •Graph specification = uniting interface 58

Slide 59

Slide 59 text

Questions? http://dask.pydata.org 59

Slide 60

Slide 60 text

NUMBA 60 Thanks to Stan Seibert for slides

Slide 61

Slide 61 text

Space of Python Compilation 61 Ahead Of Time Just In Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba Numba HOPE Theano Pyjion Replaces CPython / libpython Nuitka (future) Pyston PyPy

Slide 62

Slide 62 text

Compiler overview 62 Intermediate Representation (IR) x86 C++ ARM PTX C Fortran ObjC Code Generation Backend Parsing Frontend

Slide 63

Slide 63 text

Numba 63 Intermediate Representation (IR) x86 ARM PTX Python LLVM Numba Parsing Frontend Code Generation Backend

Slide 64

Slide 64 text

Example 64 Numba

Slide 65

Slide 65 text

How Numba works 65 Bytecode Analysis Python Function Function Arguments Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT

Slide 66

Slide 66 text

Numba Features 66 • Numba supports: Windows, OS X, and Linux 32 and 64-bit x86 CPUs and NVIDIA GPUs Python 2 and 3 NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter  (all of your existing Python libraries are still available)

Slide 67

Slide 67 text

Numba Modes 67 • object mode: Compiled code operates on Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN.

Slide 68

Slide 68 text

How to Use Numba 68 1. Create a realistic benchmark test case.  (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.  (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.  (see rest of this talk and online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions.   (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement.

Slide 69

Slide 69 text

A Whirlwind Tour of Numba Features 69 • Sometimes you can’t create a simple or efficient array expression or ufunc. Use Numba to work with array elements directly. • Example: Suppose you have a boolean grid and you want to find the maximum number neighbors a cell has in the grid:

Slide 70

Slide 70 text

The Basics 70

Slide 71

Slide 71 text

The Basics 71 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator  (nopython=True not required)

Slide 72

Slide 72 text

Calling Other Functions 72

Slide 73

Slide 73 text

Calling Other Functions 73 This function is not inlined This function is inlined 9.8x speedup compared to doing this with numpy functions

Slide 74

Slide 74 text

Making Ufuncs 74

Slide 75

Slide 75 text

Making Ufuncs 75 Monte Carlo simulating 500,000 tournaments in 50 ms

Slide 76

Slide 76 text

Case-study -- j0 from scipy.special 76 • scipy.special was one of the ﬁrst libraries I wrote (in 1999) • extended “umath” module by adding new “universal functions” to compute many scientiﬁc functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧

Slide 77

Slide 77 text

scipy.special.j0 wraps cephes algorithm 77 Don’t need this anymore!

Slide 78

Slide 78 text

Result --- equivalent to compiled code 78 In [6]: %timeit vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!

Slide 79

Slide 79 text

Word starting to get out! 79 Recent numba mailing list reports experiments of a SciPy author who got 2x speed-‐up by removing their Cython type annotations and surrounding function with numba.jit (with a few minor changes needed to the code). As soon as Numba’s ahead-‐of-‐time compilation moves beyond experimental stage one can legitimately use Numba to create a library that you ship to others (who then don’t need to have Numba installed — or just need a Numba run-‐time installed). SciPy (and NumPy) would look very different in Numba had existed 16 years ago when SciPy was getting started…. — and you would all be happier.

Slide 80

Slide 80 text

Generators 80

Slide 81

Slide 81 text

Releasing the GIL 81 Many fret about the GIL in Python With PyData Stack you often have multi-‐threaded In PyData Stack we quite often release GIL NumPy does it SciPy does it (quite often) Scikit-‐learn (now) does it Pandas (now) does it when possible Cython makes it easy Numba makes it easy

Slide 82

Slide 82 text

Releasing the GIL 82 Only nopython mode functions can release the GIL

Slide 83

Slide 83 text

Releasing the GIL 83 2.8x speedup with 4 cores

Slide 84

Slide 84 text

CUDA Python (in open-source Numba!) 84 CUDA Development using Python syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU

Slide 85

Slide 85 text

Example: Black-Scholes 85

Slide 86

Slide 86 text

Black-Scholes: Results 86 core i7 GeForce GTX 560 Ti About 9x faster on this GPU ~ same speed as CUDA-C

Slide 87

Slide 87 text

Other interesting things 87 • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/

Slide 88

Slide 88 text

What Doesn’t Work? 88 (A non-comprehensive list) • Sets, lists, dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode).

Slide 89

Slide 89 text

The (Near) Future 89 (Also a non-comprehensive list) • “JIT Classes” • Better support for strings/bytes, buffers, and parsing use- cases • More coverage of the Numpy API (advanced indexing, etc) • Documented extension API for adding your own types, low level function implementations, and targets. • Better debug workflows

Slide 90

Slide 90 text

Recently Added Numba Features 90 • Recently Added Numba Features • A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Many speed improvements

Slide 91

Slide 91 text

© 2015 Continuum Analytics- Confidential & Proprietary New Features • Support for ARMv7 (Raspbery Pi 2) • Python 3.5 support • NumPy 1.10 support • Faster loading of pre-compiled functions from the disk cache • ufunc compilation for multithreaded CPU and GPU targets (features only in NumbaPro previously). 91

Slide 92

Slide 92 text

Conclusion 92 • Lots of progress in the past year! • Try out Numba on your numerical and Numpy-related projects: conda install numba • Your feedback helps us make Numba better!  Tell us what you would like to see:    https://github.com/numba/numba • Stay tuned for more exciting stuff this year…

Slide 93

Slide 93 text

© 2015 Continuum Analytics- Confidential & Proprietary Thanks September 18, 2015 •DARPA XDATA program (Chris White and Wade Shen) which helped fund Numba, Blaze, Dask and Odo. •Investors of Continuum. •Clients and Customers of Continuum who help support these projects. •Numfocus volunteers •PyData volunteers