The State of AI/ML in Python

© 2017 Continuum Analytics - Confidential & Proprietary © 2018
Quansight - Confidential & Proprietary State of AI / ML in Python Travis E. Oliphant, PhD August 2018 [email protected] @teoliphant

• MS/BS degrees in Elec. Comp. Engineering • PhD from
Mayo Clinic in Biomedical Engineering (Ultrasound and MRI) • Creator and Developer of SciPy (1998-2009) • Professor at BYU (2001-2007) Inverse Problems • Creator and Developer of NumPy (2005-2012) • Started Numba and Conda (2012 - ) • Founder of NumFOCUS / PyData • Python Software Foundation Director (2012) • Co-founder of Continuum Analytics => Anaconda, Inc. • CEO (2012) => Chief Data Scientist (2017) • Founder (2018) of Quansight SciPy

Quansight — continuing Continuum momentum Replaced by Spin Out Spin
Out 2012 2018 ? ? Key members of the founding team of Continuum Analytics. Anaconda can be seen as our first “spin-out” company. 2015 2020 and beyond…

We grow talent, build technology, and discover products while helping
companies connect with open-source communities to organize and analyze their data using the latest advances in machine learning and AI. Create more Data Scientists/ML Engineers: We mentor people by connecting them with experienced mentors on real-world problems. Open Source Development: We build teams of talented people and connect them to open-source: JupyterLab, XND, Arrow, Numba, Dask, Dask-ML, Uarray, SymPy, … General Services: We help our clients with Python projects, cloud projects, data-engineering projects, visualization projects and custom GUIs, and machine-learning/AI projects. Three main areas:

Sustainable Open Source Subscription Prioritize Your Needs in Open Source
(save $$$ by leveraging open-source in a way that keeps using the OSS community instead of by-passing it or fighting it) Hire from the Community (good people flock to good projects — we help you attract and retain them) Get Open Source Support (Help selecting projects to depend on, SLAs for security and bug fixes, community health monitoring, expert help and support)

Data Science Workflow New Data Notebooks Understand Data Getting Data
Understand World Reports Microservices Dashboards Applications Decisions and Actions Models Exploratory Data Analysis and Viz Data Products

AI is everywhere

Neural network with several layers trained with ~130,000 images. Matched
trained dermatologists with 91% area under sensitivity- specificity curve. Keys: • Access to Data • Access to Software • Access to Compute

Python and in particular PyData keeps Growing

Google Search Trends Python now most popular

Python’s Scientific Ecosystem Bokeh Jake Vanderplas PyCon 2017 Keynote

1991 2018 2001 2015 2009 2012 2005 … 2001 2006
Python Data Analysis and Machine Learning Time-Line 1991 2003 2014 2011 2010 2016

] https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose http://deeplearning.net/software_links/ http://scikit-learn.org/stable/related_projects.html Explosion of ML Frameworks and libraries
TVM/NNVM

ML Framework Overview

Key Features Needed for any ML Library • Ability to
create chains of functions on n-dimensional arrays • Ability to derive the derivative of the Loss-Function quickly (Automatic Differentiation) • Key Loss Functions implemented • Cross-validation methods • An Optimization library with several useful methods • Ability to compute functions on n-dimensional arrays on multiple hardware with highly parallel-execution • Ability to create chains of functions on n-dimensional arrays • Ability to compute functions on n-dimensional arrays on multiple hardware For Training For Inference Missing from NumPy / SciPy and Scikit-Learn, but added by CuPy and Autograd

Most Libraries (other than Chainer) chose to re-implement NumPy and
SciPy as they needed. • Needed the stack to work in other languages too (Node, Java, C++, Lua, etc.) • Had legacy code to integrate with • Needed only a subset of functionality of NumPy / SciPy to build ML • Lacked familiarity with the NumPy / SciPy communities and how to engage with them Possible Reasons:

Stats on some of these Projects Primary Sponsor Stars Forks
Contributors Releases Participants TensorFlow Google 107,703 66,622 1614 65 11677 PyTorch Facebook 17,983 4,250 742 17 3258 MXNet Amazon 15,023 5,449 581 52 1342 Chainer Preferred Networks (Toyota) 4,030 1,074 179 71 206 PaddlePaddle Baidu 7,434 2,030 150 14 792 CNTK Microsoft 14,986 4,003 190 37 1038 Theano University of Montreal 8,422 2,448 327 31 556 August 15th 2018

Last update: 11 May, 2018 Courtesy Preferred Networks!

Chainer – a deep learning framework Chainer is a Python
framework that lets researchers quickly implement, train, and evaluate deep learning models. Designing a network Training, evaluation Data set

Written in pure Python and well-documented. No need to learn
a new tensor API since Chainer uses Numpy and CuPy (Numpy-like API) User-friendly error messages. Easy to debug using pure Python debuggers. Easy and intuitive to write a network. Supports dynamic graphs. Chainer features Fast ☑ CUDA ☑ cuDNN ☑ NCCL Full featured ☑ Convolutional Networks ☑ Recurrent Networks ☑ Backprop of backprop Intuitive ☑ Define-by-Run ☑ High debuggability Supports GPU acceleration using CUDA with CuPy High-speed training/inference with cuDNN’s optimized deep learning functions with CuPy Supports a fast, multi-GPU learning using NCCL with CuPy N-dimensional Convolution, Deconvolution, Pooling, BN, etc. RNN components such as LSTM, Bi-directional LSTM, GRU and Bi-directional GRU Higher order derivatives (a.k.a. gradient of gradient) is supported Well-abstracted common tools for various NN learning, easy to write a set of learning flows ☑ Easy to use APIs ☑ Low learning curve ☑ Maintainable codebase

Add-on packages for Chainer Distributed deep learning, deep reinforcement learning,
computer vision ChainerMN (Multi-Node): additional package for distributed deep learning 　　High scalability (100 times faster with 128GPU) ChainerRL: deep reinforcement learning library 　　DQN, DDPG, A3C, ACER, NSQ, PCL, etc. OpenAI Gym support ChainerCV: provides image recognition algorithms, dataset wrappers 　　Faster R-CNN, Single Shot Multibox Detector (SSD), SegNet, etc. ChainerUI: a visualization and experiment management tool for Chainer.  Loss curve visualization, hyper parameter comparizon in tables, etc. ChainerUI

ChainerMN ChainerMN is the fastest at the comparison of elapsed
time to train ResNet-50 on ImageNet dataset for 100 epochs (May 2017) Recently we achieved  15 mins to train ResNet50 on ImageNet dataset with 8 times larger cluster (1024 GPUs over 128 nodes) See the details in this paper:  “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes” https://arxiv.org/abs/1711.04325

Explore Communities Around these Projects WITH ProjectData AS (SELECT *
FROM `githubarchive.day.2017*` WHERE repo.name LIKE 'Theano/Theano'), Actors AS (SELECT DISTINCT(actor.login) AS login FROM ProjectData) SELECT * FROM ( SELECT actors.login, (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssueCommentEvent' AND actor.login = actors.login) AS Comments, (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestEvent' AND actor.login = actors.login) AS PRs, (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestReviewCommentEvent' AND actor.login = actors.login) AS ReviewComments, (SELECT COUNT(*) FROM ProjectData WHERE type = 'ReleaseEvent' AND actor.login = actors.login) AS Releases, (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssuesEvent' AND actor.login = actors.login) AS ClosedRenamedAndLabeledIssues FROM Actors as actors ) WHERE PRs > 0 OR Comments > 0 ORDER BY PRs DESC, Comments DESC; Combine average monthly score for 2017 with (current) average monthly score for 2018 Weights = Comments: 1, PRs: 5, ReviewComments: 5, Releases: 50, ClosedRenamedAndLabeledIssues: 5 Get a weighted-score for each participant in the GitHub community

Empirical CDF of Raw Scores

Empirical CDF of Normalized Scores

Python Scientific ecosystem started as “organic” Pre 2016 it is
understandable as the personal journeys of a few people

1996 - 2001 Analyze 12.0 https://analyzedirect.com/ Richard Robb Retired in
2015 Bringing “SciFi” Medicine to Life since 1971

Science led to Python Raja Muthupillai Armando Manduca Richard Ehman
1997 Jim Greenleaf

Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991
0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

First problem: Efficient Data Input The first step is to
get the data right “It’s Always About the Data” http://www.python.org/doc/essays/refcnt/ Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998

Early pieces of SciPy cephesmodule fftw wrappers June 1998 November
1998 stats.py December 1998 Gary Strangman

1999 : Early SciPy emerges Discussions on the matrix-sig from
1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? Gist XPLOT DISLIN Gnuplot Helping with f2py

SciPy 2001 Eric Jones weave cluster GA* Pearu Peterson linalg
interpolate f2py optimize sparse interpolate integrate special signal stats fftpack misc Travis Oliphant

Brief History of NumPy Person Package Year Jim Fulton Matrix
Object 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005

NumPy was created to unify array objects in Python and
unify PyData community Numeric Numarray NumPy I essentially sacrificed tenure at a University to write NumPy and unify array objects.

Now a large community effort SciPy ~ 636 contributors NumPy
~ 679 contributors

Now array-like objects everywhere Sparse Arrays Neon CUDArray

We have a “divided” community again! Numeric Numarray NumPy

Examples of packages being built on differing standards FastAI skorch
Pyro Eduard anyrl Braid Sonnet Gluon

Some Unification Efforts High Level shared APIs like Gluon and
Keras But, then we also have…

Example of Gluon

Other Unification Efforts Train the Model Deploy the Model Platform1
Platform 2 Deploy the Model Platform 3

NNVM / TVM — Ambitious Plan at UW

PEP 3118 — A solution for the community • Back
in 2006 when I wrote NumPy, I also spent time improving the Python Buffer protocol creating an interface for array-like objects in memory to share data with each-other easily. • A “fix-it-twice” solution. • All the array objects in Python could export and consume it to make zero-copy interoperability seamless.

Opportunity Exists for Organic Community By expanding the previously defined
Array Interface into a formal abstract uarray object with a multiple- dispatch mechanism for specializing functions on different implementations — we can provide a firm foundation for NumPy Dependencies to move into the Modern “Differentiable Array Computing” world and avoid a lot of library re-writes and silos that will exist otherwise. Array Interface MXNET Tensor THTensor NumPy Dask Pandas Gluon SciPy Scikit-Image Scikit-Learn PyMC4 … …

Work started at Quansight Labs Finding Sponsors for our work!
[email protected] @teoliphant

The State of AI/ML in Python

The State of AI/ML in Python

More Decks by Travis E. Oliphant

Other Decks in Science

Featured

Transcript