$30 off During Our Annual Pro Sale. View Details »

The State of AI/ML in Python

The State of AI/ML in Python

AI and Machine Learning have catapulted Python into the number one programming language. This same trend brought new frameworks, and new contributors which were not always aware of the existing systems and libraries. This has resulted in a divergence of fundamental array objects and diverging downstream functional stacks. I review some of the popular Machine Learning Frameworks as well as a brief history of NumPy and SciPy to provide context to a new proposed project of a general array interface to connect downstream computations with multiple backend implementations of logical arrays.

Travis E. Oliphant

August 19, 2018
Tweet

More Decks by Travis E. Oliphant

Other Decks in Science

Transcript

  1. © 2017 Continuum Analytics - Confidential & Proprietary © 2018

    Quansight - Confidential & Proprietary State of AI / ML in Python Travis E. Oliphant, PhD August 2018 [email protected] @teoliphant
  2. • MS/BS degrees in Elec. Comp. Engineering • PhD from

    Mayo Clinic in Biomedical Engineering (Ultrasound and MRI) • Creator and Developer of SciPy (1998-2009) • Professor at BYU (2001-2007) Inverse Problems • Creator and Developer of NumPy (2005-2012) • Started Numba and Conda (2012 - ) • Founder of NumFOCUS / PyData • Python Software Foundation Director (2012) • Co-founder of Continuum Analytics => Anaconda, Inc. • CEO (2012) => Chief Data Scientist (2017) • Founder (2018) of Quansight SciPy
  3. Quansight — continuing Continuum momentum Replaced by Spin Out Spin

    Out 2012 2018 ? ? Key members of the founding team of Continuum Analytics. Anaconda can be seen as our first “spin-out” company. 2015 2020 and beyond…
  4. We grow talent, build technology, and discover products while helping

    companies connect with open-source communities to organize and analyze their data using the latest advances in machine learning and AI. Create more Data Scientists/ML Engineers: We mentor people by connecting them with experienced mentors on real-world problems. Open Source Development: We build teams of talented people and connect them to open-source: JupyterLab, XND, Arrow, Numba, Dask, Dask-ML, Uarray, SymPy, … General Services: We help our clients with Python projects, cloud projects, data-engineering projects, visualization projects and custom GUIs, and machine-learning/AI projects. Three main areas:
  5. Sustainable Open Source Subscription Prioritize Your Needs in Open Source

    (save $$$ by leveraging open-source in a way that keeps using the OSS community instead of by-passing it or fighting it) Hire from the Community (good people flock to good projects — we help you attract and retain them) Get Open Source Support (Help selecting projects to depend on, SLAs for security and bug fixes, community health monitoring, expert help and support)
  6. Data Science Workflow New Data Notebooks Understand Data Getting Data

    Understand World Reports Microservices Dashboards Applications Decisions and Actions Models Exploratory Data Analysis and Viz Data Products
  7. Neural network with several layers trained with ~130,000 images. Matched

    trained dermatologists with 91% area under sensitivity- specificity curve. Keys: • Access to Data • Access to Software • Access to Compute
  8. 1991 2018 2001 2015 2009 2012 2005 … 2001 2006

    Python Data Analysis and Machine Learning Time-Line 1991 2003 2014 2011 2010 2016
  9. Key Features Needed for any ML Library • Ability to

    create chains of functions on n-dimensional arrays • Ability to derive the derivative of the Loss-Function quickly (Automatic Differentiation) • Key Loss Functions implemented • Cross-validation methods • An Optimization library with several useful methods • Ability to compute functions on n-dimensional arrays on multiple hardware with highly parallel-execution • Ability to create chains of functions on n-dimensional arrays • Ability to compute functions on n-dimensional arrays on multiple hardware For Training For Inference Missing from NumPy / SciPy and Scikit-Learn, but added by CuPy and Autograd
  10. Most Libraries (other than Chainer) chose to re-implement NumPy and

    SciPy as they needed. • Needed the stack to work in other languages too (Node, Java, C++, Lua, etc.) • Had legacy code to integrate with • Needed only a subset of functionality of NumPy / SciPy to build ML • Lacked familiarity with the NumPy / SciPy communities and how to engage with them Possible Reasons:
  11. Stats on some of these Projects Primary Sponsor Stars Forks

    Contributors Releases Participants TensorFlow Google 107,703 66,622 1614 65 11677 PyTorch Facebook 17,983 4,250 742 17 3258 MXNet Amazon 15,023 5,449 581 52 1342 Chainer Preferred Networks (Toyota) 4,030 1,074 179 71 206 PaddlePaddle Baidu 7,434 2,030 150 14 792 CNTK Microsoft 14,986 4,003 190 37 1038 Theano University of Montreal 8,422 2,448 327 31 556 August 15th 2018
  12. Chainer – a deep learning framework Chainer is a Python

    framework that lets researchers quickly implement, train, and evaluate deep learning models. Designing a network Training, evaluation Data set
  13. Written in pure Python and well-documented. No need to learn

    a new tensor API since Chainer uses Numpy and CuPy (Numpy-like API) User-friendly error messages. Easy to debug using pure Python debuggers. Easy and intuitive to write a network. Supports dynamic graphs. Chainer features Fast ☑ CUDA ☑ cuDNN ☑ NCCL Full featured ☑ Convolutional Networks ☑ Recurrent Networks ☑ Backprop of backprop Intuitive ☑ Define-by-Run ☑ High debuggability Supports GPU acceleration using CUDA with CuPy High-speed training/inference with cuDNN’s optimized deep learning functions with CuPy Supports a fast, multi-GPU learning using NCCL with CuPy N-dimensional Convolution, Deconvolution, Pooling, BN, etc. RNN components such as LSTM, Bi-directional LSTM, GRU and Bi-directional GRU Higher order derivatives (a.k.a. gradient of gradient) is supported Well-abstracted common tools for various NN learning, easy to write a set of learning flows ☑ Easy to use APIs ☑ Low learning curve ☑ Maintainable codebase
  14. Add-on packages for Chainer Distributed deep learning, deep reinforcement learning,

    computer vision ChainerMN (Multi-Node): additional package for distributed deep learning   High scalability (100 times faster with 128GPU) ChainerRL: deep reinforcement learning library   DQN, DDPG, A3C, ACER, NSQ, PCL, etc. OpenAI Gym support ChainerCV: provides image recognition algorithms, dataset wrappers   Faster R-CNN, Single Shot Multibox Detector (SSD), SegNet, etc. ChainerUI: a visualization and experiment management tool for Chainer.
 Loss curve visualization, hyper parameter comparizon in tables, etc. ChainerUI
  15. ChainerMN ChainerMN is the fastest at the comparison of elapsed

    time to train ResNet-50 on ImageNet dataset for 100 epochs (May 2017) Recently we achieved
 15 mins to train ResNet50 on ImageNet dataset with 8 times larger cluster (1024 GPUs over 128 nodes) See the details in this paper:
 “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes” https://arxiv.org/abs/1711.04325
  16. Explore Communities Around these Projects WITH ProjectData AS (SELECT *

    FROM `githubarchive.day.2017*` WHERE repo.name LIKE 'Theano/Theano'), Actors AS (SELECT DISTINCT(actor.login) AS login FROM ProjectData) SELECT * FROM ( SELECT actors.login, (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssueCommentEvent' AND actor.login = actors.login) AS Comments, (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestEvent' AND actor.login = actors.login) AS PRs, (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestReviewCommentEvent' AND actor.login = actors.login) AS ReviewComments, (SELECT COUNT(*) FROM ProjectData WHERE type = 'ReleaseEvent' AND actor.login = actors.login) AS Releases, (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssuesEvent' AND actor.login = actors.login) AS ClosedRenamedAndLabeledIssues FROM Actors as actors ) WHERE PRs > 0 OR Comments > 0 ORDER BY PRs DESC, Comments DESC; Combine average monthly score for 2017 with (current) average monthly score for 2018 Weights = Comments: 1, PRs: 5, ReviewComments: 5, Releases: 50, ClosedRenamedAndLabeledIssues: 5 Get a weighted-score for each participant in the GitHub community
  17. Python Scientific ecosystem started as “organic” Pre 2016 it is

    understandable as the personal journeys of a few people
  18. 1996 - 2001 Analyze 12.0 https://analyzedirect.com/ Richard Robb Retired in

    2015 Bringing “SciFi” Medicine to Life since 1971
  19. Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991

    0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html
  20. First problem: Efficient Data Input The first step is to

    get the data right “It’s Always About the Data” http://www.python.org/doc/essays/refcnt/ Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998
  21. Early pieces of SciPy cephesmodule fftw wrappers June 1998 November

    1998 stats.py December 1998 Gary Strangman
  22. 1999 : Early SciPy emerges Discussions on the matrix-sig from

    1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? Gist XPLOT DISLIN Gnuplot Helping with f2py
  23. SciPy 2001 Eric Jones weave cluster GA* Pearu Peterson linalg

    interpolate f2py optimize sparse interpolate integrate special signal stats fftpack misc Travis Oliphant
  24. Brief History of NumPy Person Package Year Jim Fulton Matrix

    Object 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005
  25. NumPy was created to unify array objects in Python and

    unify PyData community Numeric Numarray NumPy I essentially sacrificed tenure at a University to write NumPy and unify array objects.
  26. PEP 3118 — A solution for the community • Back

    in 2006 when I wrote NumPy, I also spent time improving the Python Buffer protocol creating an interface for array-like objects in memory to share data with each-other easily. • A “fix-it-twice” solution. • All the array objects in Python could export and consume it to make zero-copy interoperability seamless.
  27. Opportunity Exists for Organic Community By expanding the previously defined

    Array Interface into a formal abstract uarray object with a multiple- dispatch mechanism for specializing functions on different implementations — we can provide a firm foundation for NumPy Dependencies to move into the Modern “Differentiable Array Computing” world and avoid a lot of library re-writes and silos that will exist otherwise. Array Interface MXNET Tensor THTensor NumPy Dask Pandas Gluon SciPy Scikit-Image Scikit-Learn PyMC4 … …