Save 37% off PRO during our Black Friday Sale! »

Numba and Dask at the PyData Foundations Workshop

Numba and Dask at the PyData Foundations Workshop

An overview of the PyData workshop held at IBM's INDEX conference plus information about Conda, Numba, Dask

6c8561779fff34c62074c614d19980fc?s=128

Travis E. Oliphant

February 20, 2018
Tweet

Transcript

  1. Scaling PyData PyData Workshop at IBM Index Conference Travis E.

    Oliphant Founder, Quansight 1 Credit to Anaconda, Inc. for some of these slides
  2. PyData Foundations Workshop Jake VanderPlas Jim Bednar Phillip Cloud Steven

    Silvester Jason Grout Travis Oliphant
  3. 3 • MS/BS degrees in Elec. Comp. Engineering • PhD

    from Mayo Clinic in Biomedical Engineering (Ultrasound and MRI) • Creator and Developer of SciPy (1998-2009) • Professor at BYU (2001-2007) Inverse Problems • Creator and Developer of NumPy (2005-2012) • Started Numba (2012) • Founder of NumFOCUS / PyData • Python Software Foundation Director (2012) • Co-founder of Continuum Analytics => Anaconda, Inc. • CEO (2012) => Chief Data Scientist (2017) • Founder (2018) of Quansight SciPy
  4. 4 Empower domain experts with high-level tools that exploit modern

    hard-ware Array Oriented Computing expertise
  5. • Express domain knowledge directly in arrays (tensors, matrices, vectors)

    --- easier to teach programming in the specific domain • Can take advantage of parallelism and accelerators • Array expressions Why Array-oriented computing 5 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6
  6. • Today’s vector machines (and vector co-processors, or GPUS) were

    made for array-oriented computing. • The software stack has just not caught up --- unfortunate because APL came out in 1963. • There is a reason Fortran remains popular. More reasons for array-oriented 6
  7. 7 Python and in particular PyData is Growing

  8. Data Science Workflow 8 New Data Notebooks Understand Data Getting

    Data Understand World Reports Microservices Dashboards Applications Decisions and Actions Models Exploratory Data Analysis and Viz Data Products
  9. Machine Learning Explosion 9 Scikit-Learn Tensorflow Keras XGBoost Torch MxNet

    theano lasagne caffe/caffe2 minpy neon CNTK DAAL Chainer Dynet Apache Singa Shogun CuPy https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose http://deeplearning.net/software_links/ http://scikit-learn.org/stable/related_projects.html
  10. 10 A Representation of Packages for ML © 2017 Anaconda,

    Inc. - Confidential & Proprietary
  11. 11 NumPy and Packages that Depend on It © 2017

    Anaconda, Inc. - Confidential & Proprietary
  12. 12 pandas Depends on NumPy
 (and other packages depend on

    pandas) © 2017 Anaconda, Inc. - Confidential & Proprietary
  13. 13 Caffe Depends on pandas and NumPy © 2017 Anaconda,

    Inc. - Confidential & Proprietary
  14. Embrace Innovation Without Anarchy 14 From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft Reproducibility

  15. 15 Conda Conda Forge Conda Environments Anaconda Project A cross-platform

    and language agnostic package and environment manager A community-led collection of recipes, build infrastructure, and packages for conda. Custom isolated software sandboxes to allow easy reproducibility and sharing of data-science work. Reproducible, executable project directories
  16. • Language independent • Platform independent • No special privileges

    required • No VMs or containers • Enables: - Reproducibility - Collaboration - Scaling “conda – package everything” 16 A Python v2.7 Conda Sandboxing Technology B Python v3.4 Pandas v0.18 Jupyter C R R Essentials conda NumPy v1.11 NumPy v1.10 Pandas v0.16
  17. Basic Conda Usage 17 Install a package conda install sympy

    List all installed packages conda list Search for packages conda search llvm Create a new environment conda create -n py3k python=3 Remove a package conda remove nose Get help conda install --help
  18. Advanced Conda Usage 18 Install a package in an environment

    conda install -n py3k sympy Update all packages conda update --all Export list of packages conda list --export packages.txt Install packages from an export conda install --file packages.txt See package history conda list --revisions Revert to a revision conda install --revision 23 Remove unused packages and cached tarballs conda clean -pt
  19. 19 Development Deployment Conda eases rapid deployment of ML

  20. Scaling Up and Out with Numba and Dask 20

  21. Scale Up vs Scale Out 21 Big Memory & Many

    Cores / GPU Box Best of Both (e.g. GPU Cluster) Many commodity nodes in a cluster Scale Up (Bigger Nodes) Scale Out (More Nodes) Numba Dask Dask with Numba
  22. © 2017 Anaconda, Inc. - Confidential & Proprietary Development Name

    Latest Release Number of Releases GitHub Stars Contributors Downloads in 2017 numba 0.37 101 2907 77 ~2m dask 0.17.0 41 2444 136 ~1.5m dask-ml 0.4.1 9 188 8 New Numba Dask Dask-ml http://numba.pydata.org http://github.com/numba http://dask.pydata.org http://github.com/dask http://dask-ml.readthedocs.io/en/latest/index.html http://github.com/dask/dask-ml
  23. Numba 23 Credit: Stan Seibert for many of these slides

  24. Numba (compile Python to CPUs and GPUs) 24 conda install

    numba Intermediate Representation (IR) x86 ARM PTX Python LLVM Numba Code Generation Backend Parsing Frontend
  25. 25 @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape

    m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up Image Processing
  26. Works with and does not replace the standard Python interpreter


    (all of your existing Python libraries are still available) Numba Features 26
  27. 7 things about Numba you may not know 27 1

    2 3 4 5 6 7 Numba is 100% Open Source Numba + Jupyter = Rapid CUDA Prototyping Numba can compile for the CPU and the GPU at the same time Numba makes array processing easy with @(gu)vectorize Numba comes with a CUDA Simulator You can send Numba functions over the network Numba developers are working On a GPU DataFrame (pygdf)
  28. Other Numba topics 28 CUDA Python — write general GPU

    kernels with Python Device Arrays — manage memory transfer from host to GPU Streaming — manage asynchronous and parallel GPU compute streams CUDA Simulator in Python — to help debug your kernels HSA Support — early support for HSA-based GPUs and APUs Pyculib — access to cuFFT, cuBLAS, cuSPARSE, cuRAND, CUDA Sorting Parallel Acceleration — prange, parallel functions, and more https://github.com/ContinuumIO/gtc2017-numba
  29. © 2017 Anaconda, Inc. - Confidential & Proprietary • Detects

    CPU model during compilation and optimizes for that target • Automatic type inference: No need to give type signatures for functions • Dispatches to multiple type-specializations for the same function • Call out to C libraries with CFFI and types • Special "callback" mode for creating C callbacks to use with external libraries • Optional caching to disk, and ahead-of-time creation of shared libraries • Compiler is extensible with new data types and functions Numba Features
  30. © 2017 Anaconda, Inc. - Confidential & Proprietary • Three

    main technologies for parallelism: Parallel Computing SIMD Multi-threading Distributed Computing x0 x1 x2 x3 x0 x1 x2 x3 x0 x3 x2 x1
  31. © 2017 Anaconda, Inc. - Confidential & Proprietary • Numba's

    CPU detection will enable LLVM to autovectorize for appropriate SIMD instruction set: • SSE, AVX, AVX2, AVX-512 • Will become even more important as AVX-512 is now available on both Xeon Phi and Skylake Xeon processors SIMD: Single Instruction Multiple Data
  32. © 2017 Anaconda, Inc. - Confidential & Proprietary Manual Multithreading:

    Release the GIL Speedup Ratio 0 0.9 1.8 2.6 3.5 Number of Threads 1 2 4 Option to release the GIL Using Python concurrent.futures
  33. © 2017 Anaconda, Inc. - Confidential & Proprietary Universal Functions

    (Ufuncs) Ufuncs are a core concept in NumPy for array-oriented computing. ◦ A function with scalar inputs is broadcast across the elements of the input arrays: • np.add([1,2,3], 3) == [4, 5, 6] • np.add([1,2,3], [10, 20, 30]) == [11, 22, 33] ◦ Parallelism is present, by construction. Numba will generate loops and can automatically multi-thread if requested. ◦ Before Numba, creating fast ufuncs required writing C. No longer!
  34. © 2017 Anaconda, Inc. - Confidential & Proprietary Universal Functions

    (Ufuncs) Different decorator! 1.8x speedup!
  35. © 2017 Anaconda, Inc. - Confidential & Proprietary Multi-threaded Ufuncs

    Specify type signature Select parallel target Automatically uses all CPU cores!
  36. © 2017 Anaconda, Inc. - Confidential & Proprietary ParallelAccelerator •

    ParallelAccelerator is a special compiler pass contributed by Intel Labs • Todd A. Anderson, Ehsan Totoni, Paul Liu • Based on similar contribution to Julia • Automatically generates mulithreaded code in a Numba compiled-function: • Array expressions and reductions • Random functions • Dot products • Explicit loops indicated with prange() call
  37. © 2017 Anaconda, Inc. - Confidential & Proprietary ParallelAccelerator: Example

    #1 Time (ms) 0 1000 2000 3000 4000 NumPy Numba Numba+PA 1.8x 3.6x 1000000x10 input, Core i7 Quad Core CPU
  38. © 2017 Anaconda, Inc. - Confidential & Proprietary ParallelAccelerator: prange()

    Time (ms) 0 25 50 75 100 NumPy Numba Numba+PA 4.3x 50x 1000000x10 input, Core i7 Quad Core CPU
  39. © 2017 Anaconda, Inc. - Confidential & Proprietary ParallelAccelerator: prange()

    Time (ms) 0 25 50 75 100 NumPy Numba Numba+PA 2x 3.6x 1000000x10 input, Core i7 Quad Core CPU
  40. © 2017 Anaconda, Inc. - Confidential & Proprietary ParallelAccelerator: Image

    Resampling https://github.com/bokeh/ datashader/blob/master/ examples/landsat.ipynb Interactive image resampling with Holoviews + Datashader Datashader resampling implemented with Numba + prange()
  41. © 2017 Anaconda, Inc. - Confidential & Proprietary ParallelAccelerator: Stencils

    730x547 image w/ 21x21 pixel blur half the lines of code and 4x faster on a quad core CPU than equivalent non-stencil Numba code
  42. Dask 42 Credit Matthew Rocklin for many of these slides

  43. • Designed to parallelize the Python ecosystem • Handles complex

    algorithms • Co-developed with Pandas/SKLearn/Jupyter teams • Familiar APIs for Python users • Scales • Scales from multicore to 1000-node clusters • Resilience, responsive, and real-time
  44. • Parallelizes NumPy, Pandas, SKLearn • Satisfies subset of these

    APIs • Uses these libraries internally • Co-developed with these teams • Task scheduler supports custom algorithms • Parallelize existing code • Build novel real-time systems • Arbitrary task graphs 
 with data dependencies • Same scalability
  45. Dask: From User Interaction to Execution 45 delayed

  46. 46 >>> import pandas as pd >>> df = pd.read_csv('iris.csv')

    >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask DataFrame is like Pandas
  47. Example 1: Using Dask DataFrames on a cluster with CSV

    data 47 • Built from Pandas DataFrames • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface
  48. 48 >>> import numpy as np >>> np_ones = np.ones((5000,

    1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # If result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') Dask Array is like NumPy
  49. Example 3: Using Dask Arrays with global temperature data 49

    • Built from NumPy
 n-dimensional arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms
  50. Dask Schedulers: Distributed Scheduler 50

  51. Scheduler Visualization with Bokeh 51

  52. Ten Reasons People Choose Dask

  53. Scalable Pandas DataFrames • Same API
 import dask.dataframe as dd


    df = dd.read_parquet(‘s3://bucket/accounts/2017')
 df.groupby(df.name).value.mean().compute() • Efficient Timeseries Operations
 df.loc[‘2017-01-01’] # Uses the Pandas index…
 df.value.rolling(10).std() # for efficient…
 df.value.resample(‘10m’).mean() # operations. • Co-developed with Pandas
 and by the Pandas developer community
  54. Scalable NumPy Arrays • Same API
 
 import dask.array as

    da
 x = da.from_array(my_hdf5_file)
 y = x.dot(x.T) • Applications • Atmospheric science • Satellite imagery • Biomedical imagery • Optimization algorithms
 check out dask-glm
  55. Parallelize Scikit-Learn/Joblib • Scikit-Learn parallelizes with Joblib
 
 estimator =

    RandomForest(…)
 
 estimator.fit(train_data, train_labels, njobs=8) • Joblib can use Dask
 
 from sklearn.externals.joblib import parallel_backend
 with parallel_backend('dask', scheduler=‘…’): estimator.fit(train_data, train_labels) https://pythonhosted.org/joblib/ http://distributed.readthedocs.io/en/latest/joblib.html Joblib Thread pool
  56. Parallelize Scikit-Learn/Joblib • Scikit-Learn parallelizes with Joblib
 
 estimator =

    RandomForest(…)
 
 estimator.fit(train_data, train_labels, njobs=8) • Joblib can use Dask
 
 from sklearn.externals.joblib import parallel_backend
 with parallel_backend('dask', scheduler=‘…’): estimator.fit(train_data, train_labels) https://pythonhosted.org/joblib/ http://distributed.readthedocs.io/en/latest/joblib.html Joblib Dask
  57. Many Other Libraries in Anaconda • Scikit-Image uses dask to

    break down images and speed up algorithms with overlapping regions • Geopandas can use Dask to partition data spatially and accelerate spatial joins
  58. Dask Scales Up • Thousand node clusters • Cloud computing

    • Super computers • Gigabyte/s bandwidth • 200 microsecond task overhead Dask Scales Down (the median cluster size is one) • Can run in a single Python thread pool • Almost no performance penalty (microseconds) • Lightweight • Few dependencies • Easy install
  59. Parallelize Web Backends • Web servers process thousands of small

    computations asynchronously
 for web pages or REST endpoints • Dask provides dynamic, heterogenous computation • Supports small data • 10ms roundtrip times • Dynamic scaling for different loads • Supports asynchronous Python (like GoLang)
 
 async def serve(request):
 future = dask_client.submit(process, request)
 result = await future
 return result
  60. Debugging support • Clean Python tracebacks when user code breaks

    • Connect to remote workers with IPython sessions 
 for advanced debugging
  61. Resource constraints • Define limited hardware resources for workers •

    Specify resource constraints when submitting tasks $ dask-worker … —resources GPU=2 $ dask-worker … —resources GPU=2 $ dask-worker … —resources special-db=1 future = client.submit(my_function, resources={‘GPU’: 1}) • Used for GPUs, big-memory machines, special hardware, database connections, I/O machines, etc..
  62. Collaboration • Many users can share the same cluster simultaneously

    • Define public datasets • Repeated computation and data use is shared among everyone df = dd.read_parquet(…).persist() client.publish_dataset(accounts=df) df = client.get_dataset(‘accounts’)
  63. Beautiful Diagnostic Dashboards • Fast responsive dashboards • Provide users

    performance insight • Powered by Bokeh
  64. Some Reasons not to Choose Dask

  65. • Dask is not a SQL database. 
 Does Pandas

    well, but won’t optimize complex queries. • Dask is not MPI
 Very fast, but does leave some performance on the table
 200us task overhead
 a couple copies in the network stack • Dask is not a JVM technology
 It’s a Python library
 (although Julia bindings available) • Dask is not always necessary 
 You may not need parallelism Dask’s limitations
  66. dask.pydata.org conda install dask