The State of AI/ML in Python

Slide 1

Slide 1 text

Slide 2

Slide 2 text

• MS/BS degrees in Elec. Comp. Engineering • PhD from Mayo Clinic in Biomedical Engineering (Ultrasound and MRI) • Creator and Developer of SciPy (1998-2009) • Professor at BYU (2001-2007) Inverse Problems • Creator and Developer of NumPy (2005-2012) • Started Numba and Conda (2012 - ) • Founder of NumFOCUS / PyData • Python Software Foundation Director (2012) • Co-founder of Continuum Analytics => Anaconda, Inc. • CEO (2012) => Chief Data Scientist (2017) • Founder (2018) of Quansight SciPy

Slide 3

Slide 3 text

Quansight — continuing Continuum momentum Replaced by Spin Out Spin Out 2012 2018 ? ? Key members of the founding team of Continuum Analytics. Anaconda can be seen as our first “spin-out” company. 2015 2020 and beyond…

Slide 4

Slide 4 text

We grow talent, build technology, and discover products while helping companies connect with open-source communities to organize and analyze their data using the latest advances in machine learning and AI. Create more Data Scientists/ML Engineers: We mentor people by connecting them with experienced mentors on real-world problems. Open Source Development: We build teams of talented people and connect them to open-source: JupyterLab, XND, Arrow, Numba, Dask, Dask-ML, Uarray, SymPy, … General Services: We help our clients with Python projects, cloud projects, data-engineering projects, visualization projects and custom GUIs, and machine-learning/AI projects. Three main areas:

Slide 5

Slide 5 text

Sustainable Open Source Subscription Prioritize Your Needs in Open Source (save $$$ by leveraging open-source in a way that keeps using the OSS community instead of by-passing it or fighting it) Hire from the Community (good people flock to good projects — we help you attract and retain them) Get Open Source Support (Help selecting projects to depend on, SLAs for security and bug fixes, community health monitoring, expert help and support)

Slide 6

Slide 6 text

Data Science Workflow New Data Notebooks Understand Data Getting Data Understand World Reports Microservices Dashboards Applications Decisions and Actions Models Exploratory Data Analysis and Viz Data Products

Slide 7

Slide 7 text

AI is everywhere

Slide 8

Slide 8 text

Neural network with several layers trained with ~130,000 images. Matched trained dermatologists with 91% area under sensitivity- specificity curve. Keys: • Access to Data • Access to Software • Access to Compute

Slide 9

Slide 9 text

Python and in particular PyData keeps Growing

Slide 10

Slide 10 text

Google Search Trends Python now most popular

Slide 11

Slide 11 text

Python’s Scientific Ecosystem Bokeh Jake Vanderplas PyCon 2017 Keynote

Slide 12

Slide 12 text

1991 2018 2001 2015 2009 2012 2005 … 2001 2006 Python Data Analysis and Machine Learning Time-Line 1991 2003 2014 2011 2010 2016

Slide 13

Slide 13 text

] https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose http://deeplearning.net/software_links/ http://scikit-learn.org/stable/related_projects.html Explosion of ML Frameworks and libraries TVM/NNVM

Slide 14

Slide 14 text

ML Framework Overview

Slide 15

Slide 15 text

Key Features Needed for any ML Library • Ability to create chains of functions on n-dimensional arrays • Ability to derive the derivative of the Loss-Function quickly (Automatic Differentiation) • Key Loss Functions implemented • Cross-validation methods • An Optimization library with several useful methods • Ability to compute functions on n-dimensional arrays on multiple hardware with highly parallel-execution • Ability to create chains of functions on n-dimensional arrays • Ability to compute functions on n-dimensional arrays on multiple hardware For Training For Inference Missing from NumPy / SciPy and Scikit-Learn, but added by CuPy and Autograd

Slide 16

Slide 16 text

Most Libraries (other than Chainer) chose to re-implement NumPy and SciPy as they needed. • Needed the stack to work in other languages too (Node, Java, C++, Lua, etc.) • Had legacy code to integrate with • Needed only a subset of functionality of NumPy / SciPy to build ML • Lacked familiarity with the NumPy / SciPy communities and how to engage with them Possible Reasons:

Slide 17

Slide 17 text

Stats on some of these Projects Primary Sponsor Stars Forks Contributors Releases Participants TensorFlow Google 107,703 66,622 1614 65 11677 PyTorch Facebook 17,983 4,250 742 17 3258 MXNet Amazon 15,023 5,449 581 52 1342 Chainer Preferred Networks (Toyota) 4,030 1,074 179 71 206 PaddlePaddle Baidu 7,434 2,030 150 14 792 CNTK Microsoft 14,986 4,003 190 37 1038 Theano University of Montreal 8,422 2,448 327 31 556 August 15th 2018

Slide 18

Slide 18 text

Last update: 11 May, 2018 Courtesy Preferred Networks!

Slide 19

Slide 19 text

Chainer – a deep learning framework Chainer is a Python framework that lets researchers quickly implement, train, and evaluate deep learning models. Designing a network Training, evaluation Data set

Slide 20

Slide 20 text

Written in pure Python and well-documented. No need to learn a new tensor API since Chainer uses Numpy and CuPy (Numpy-like API) User-friendly error messages. Easy to debug using pure Python debuggers. Easy and intuitive to write a network. Supports dynamic graphs. Chainer features Fast ☑ CUDA ☑ cuDNN ☑ NCCL Full featured ☑ Convolutional Networks ☑ Recurrent Networks ☑ Backprop of backprop Intuitive ☑ Define-by-Run ☑ High debuggability Supports GPU acceleration using CUDA with CuPy High-speed training/inference with cuDNN’s optimized deep learning functions with CuPy Supports a fast, multi-GPU learning using NCCL with CuPy N-dimensional Convolution, Deconvolution, Pooling, BN, etc. RNN components such as LSTM, Bi-directional LSTM, GRU and Bi-directional GRU Higher order derivatives (a.k.a. gradient of gradient) is supported Well-abstracted common tools for various NN learning, easy to write a set of learning flows ☑ Easy to use APIs ☑ Low learning curve ☑ Maintainable codebase

Slide 21

Slide 21 text

Add-on packages for Chainer Distributed deep learning, deep reinforcement learning, computer vision ChainerMN (Multi-Node): additional package for distributed deep learning 　　High scalability (100 times faster with 128GPU) ChainerRL: deep reinforcement learning library 　　DQN, DDPG, A3C, ACER, NSQ, PCL, etc. OpenAI Gym support ChainerCV: provides image recognition algorithms, dataset wrappers 　　Faster R-CNN, Single Shot Multibox Detector (SSD), SegNet, etc. ChainerUI: a visualization and experiment management tool for Chainer.  Loss curve visualization, hyper parameter comparizon in tables, etc. ChainerUI

Slide 22

Slide 22 text

ChainerMN ChainerMN is the fastest at the comparison of elapsed time to train ResNet-50 on ImageNet dataset for 100 epochs (May 2017) Recently we achieved  15 mins to train ResNet50 on ImageNet dataset with 8 times larger cluster (1024 GPUs over 128 nodes) See the details in this paper:  “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes” https://arxiv.org/abs/1711.04325

Slide 23

Slide 23 text

Explore Communities Around these Projects WITH ProjectData AS (SELECT * FROM `githubarchive.day.2017*` WHERE repo.name LIKE 'Theano/Theano'), Actors AS (SELECT DISTINCT(actor.login) AS login FROM ProjectData) SELECT * FROM ( SELECT actors.login, (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssueCommentEvent' AND actor.login = actors.login) AS Comments, (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestEvent' AND actor.login = actors.login) AS PRs, (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestReviewCommentEvent' AND actor.login = actors.login) AS ReviewComments, (SELECT COUNT(*) FROM ProjectData WHERE type = 'ReleaseEvent' AND actor.login = actors.login) AS Releases, (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssuesEvent' AND actor.login = actors.login) AS ClosedRenamedAndLabeledIssues FROM Actors as actors ) WHERE PRs > 0 OR Comments > 0 ORDER BY PRs DESC, Comments DESC; Combine average monthly score for 2017 with (current) average monthly score for 2018 Weights = Comments: 1, PRs: 5, ReviewComments: 5, Releases: 50, ClosedRenamedAndLabeledIssues: 5 Get a weighted-score for each participant in the GitHub community

Slide 24

Slide 24 text

Empirical CDF of Raw Scores

Slide 25

Slide 25 text

Empirical CDF of Normalized Scores

Slide 26

Slide 26 text

Python Scientific ecosystem started as “organic” Pre 2016 it is understandable as the personal journeys of a few people

Slide 27

Slide 27 text

1996 - 2001 Analyze 12.0 https://analyzedirect.com/ Richard Robb Retired in 2015 Bringing “SciFi” Medicine to Life since 1971

Slide 28

Slide 28 text

Science led to Python Raja Muthupillai Armando Manduca Richard Ehman 1997 Jim Greenleaf

Slide 29

Slide 29 text

Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

Slide 30

Slide 30 text

First problem: Efficient Data Input The first step is to get the data right “It’s Always About the Data” http://www.python.org/doc/essays/refcnt/ Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998

Slide 31

Slide 31 text

Early pieces of SciPy cephesmodule fftw wrappers June 1998 November 1998 stats.py December 1998 Gary Strangman

Slide 32

Slide 32 text

1999 : Early SciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? Gist XPLOT DISLIN Gnuplot Helping with f2py

Slide 33

Slide 33 text

SciPy 2001 Eric Jones weave cluster GA* Pearu Peterson linalg interpolate f2py optimize sparse interpolate integrate special signal stats fftpack misc Travis Oliphant

Slide 34

Slide 34 text

Brief History of NumPy Person Package Year Jim Fulton Matrix Object 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005

Slide 35

Slide 35 text

NumPy was created to unify array objects in Python and unify PyData community Numeric Numarray NumPy I essentially sacrificed tenure at a University to write NumPy and unify array objects.

Slide 36

Slide 36 text

Now a large community effort SciPy ~ 636 contributors NumPy ~ 679 contributors

Slide 37

Slide 37 text

Now array-like objects everywhere Sparse Arrays Neon CUDArray

Slide 38

Slide 38 text

We have a “divided” community again! Numeric Numarray NumPy

Slide 39

Slide 39 text

Examples of packages being built on differing standards FastAI skorch Pyro Eduard anyrl Braid Sonnet Gluon

Slide 40

Slide 40 text

Some Unification Efforts High Level shared APIs like Gluon and Keras But, then we also have…

Slide 41

Slide 41 text

Example of Gluon

Slide 42

Slide 42 text

Other Unification Efforts Train the Model Deploy the Model Platform1 Platform 2 Deploy the Model Platform 3

Slide 43

Slide 43 text

NNVM / TVM — Ambitious Plan at UW

Slide 44

Slide 44 text

PEP 3118 — A solution for the community • Back in 2006 when I wrote NumPy, I also spent time improving the Python Buffer protocol creating an interface for array-like objects in memory to share data with each-other easily. • A “fix-it-twice” solution. • All the array objects in Python could export and consume it to make zero-copy interoperability seamless.

Slide 45

Slide 45 text

Opportunity Exists for Organic Community By expanding the previously defined Array Interface into a formal abstract uarray object with a multiple- dispatch mechanism for specializing functions on different implementations — we can provide a firm foundation for NumPy Dependencies to move into the Modern “Differentiable Array Computing” world and avoid a lot of library re-writes and silos that will exist otherwise. Array Interface MXNET Tensor THTensor NumPy Dask Pandas Gluon SciPy Scikit-Image Scikit-Learn PyMC4 … …

Slide 46

Slide 46 text

Work started at Quansight Labs Finding Sponsors for our work! [email protected] @teoliphant