$30 off During Our Annual Pro Sale. View Details »

The State of AI/ML in Python

The State of AI/ML in Python

AI and Machine Learning have catapulted Python into the number one programming language. This same trend brought new frameworks, and new contributors which were not always aware of the existing systems and libraries. This has resulted in a divergence of fundamental array objects and diverging downstream functional stacks. I review some of the popular Machine Learning Frameworks as well as a brief history of NumPy and SciPy to provide context to a new proposed project of a general array interface to connect downstream computations with multiple backend implementations of logical arrays.

Travis E. Oliphant

August 19, 2018
Tweet

More Decks by Travis E. Oliphant

Other Decks in Science

Transcript

  1. © 2017 Continuum Analytics - Confidential & Proprietary
    © 2018 Quansight - Confidential & Proprietary
    State of AI / ML in Python
    Travis E. Oliphant, PhD
    August 2018
    [email protected]
    @teoliphant

    View Slide

  2. • MS/BS degrees in Elec. Comp. Engineering
    • PhD from Mayo Clinic in Biomedical Engineering
    (Ultrasound and MRI)
    • Creator and Developer of SciPy (1998-2009)
    • Professor at BYU (2001-2007) Inverse Problems
    • Creator and Developer of NumPy (2005-2012)
    • Started Numba and Conda (2012 - )
    • Founder of NumFOCUS / PyData
    • Python Software Foundation Director (2012)
    • Co-founder of Continuum Analytics => Anaconda, Inc.
    • CEO (2012) => Chief Data Scientist (2017)
    • Founder (2018) of Quansight
    SciPy

    View Slide

  3. Quansight — continuing Continuum momentum
    Replaced by
    Spin Out
    Spin Out
    2012
    2018 ?
    ?
    Key members of the founding team of Continuum Analytics.
    Anaconda can be seen as our first “spin-out” company.
    2015
    2020 and beyond…

    View Slide

  4. We grow talent, build technology, and discover products while helping
    companies connect with open-source communities to organize and analyze
    their data using the latest advances in machine learning and AI.
    Create more Data Scientists/ML Engineers: We mentor people by
    connecting them with experienced mentors on real-world problems.
    Open Source Development: We build teams of talented people and connect
    them to open-source: JupyterLab, XND, Arrow, Numba, Dask, Dask-ML,
    Uarray, SymPy, …
    General Services: We help our clients with Python projects, cloud projects,
    data-engineering projects, visualization projects and custom GUIs, and
    machine-learning/AI projects.
    Three main areas:

    View Slide

  5. Sustainable Open Source Subscription
    Prioritize Your Needs in Open Source
    (save $$$ by leveraging open-source in a way that keeps using the OSS
    community instead of by-passing it or fighting it)
    Hire from the Community
    (good people flock to good projects — we help you attract and retain them)
    Get Open Source Support
    (Help selecting projects to depend on, SLAs for security and bug fixes, community
    health monitoring, expert help and support)

    View Slide

  6. Data Science Workflow
    New Data
    Notebooks
    Understand Data
    Getting Data
    Understand World
    Reports
    Microservices
    Dashboards
    Applications
    Decisions
    and
    Actions
    Models
    Exploratory Data Analysis and Viz
    Data Products

    View Slide

  7. AI is everywhere

    View Slide

  8. Neural network with
    several layers trained
    with ~130,000 images.
    Matched trained
    dermatologists with 91%
    area under sensitivity-
    specificity curve.
    Keys:
    • Access to Data
    • Access to Software
    • Access to Compute

    View Slide

  9. Python and in particular PyData keeps Growing

    View Slide

  10. Google
    Search
    Trends
    Python now most popular

    View Slide

  11. Python’s Scientific Ecosystem
    Bokeh
    Jake Vanderplas PyCon 2017 Keynote

    View Slide

  12. 1991 2018
    2001
    2015
    2009 2012
    2005

    2001
    2006
    Python Data Analysis and Machine Learning Time-Line
    1991
    2003
    2014
    2011
    2010 2016

    View Slide

  13. ]
    https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose
    http://deeplearning.net/software_links/
    http://scikit-learn.org/stable/related_projects.html
    Explosion of ML Frameworks and libraries
    TVM/NNVM

    View Slide

  14. ML Framework Overview

    View Slide

  15. Key Features Needed for any ML Library
    • Ability to create chains of functions on n-dimensional arrays
    • Ability to derive the derivative of the Loss-Function quickly (Automatic
    Differentiation)
    • Key Loss Functions implemented
    • Cross-validation methods
    • An Optimization library with several useful methods
    • Ability to compute functions on n-dimensional arrays on multiple
    hardware with highly parallel-execution
    • Ability to create chains of functions on n-dimensional arrays
    • Ability to compute functions on n-dimensional arrays on multiple hardware
    For Training
    For Inference
    Missing from NumPy / SciPy and
    Scikit-Learn, but added by CuPy
    and Autograd

    View Slide

  16. Most Libraries (other than Chainer) chose
    to re-implement NumPy and SciPy as they
    needed.
    • Needed the stack to work in other languages too (Node, Java, C++, Lua, etc.)
    • Had legacy code to integrate with
    • Needed only a subset of functionality of NumPy / SciPy to build ML
    • Lacked familiarity with the NumPy / SciPy communities and how to engage
    with them
    Possible Reasons:

    View Slide

  17. Stats on some of these Projects
    Primary
    Sponsor
    Stars Forks Contributors Releases Participants
    TensorFlow Google 107,703 66,622 1614 65 11677
    PyTorch Facebook 17,983 4,250 742 17 3258
    MXNet Amazon 15,023 5,449 581 52 1342
    Chainer
    Preferred
    Networks
    (Toyota)
    4,030 1,074 179 71 206
    PaddlePaddle Baidu 7,434 2,030 150 14 792
    CNTK Microsoft 14,986 4,003 190 37 1038
    Theano University of
    Montreal
    8,422 2,448 327 31 556
    August 15th 2018

    View Slide

  18. Last update: 11 May, 2018
    Courtesy Preferred Networks!

    View Slide

  19. Chainer – a deep learning framework
    Chainer is a Python framework that lets researchers quickly
    implement, train, and evaluate deep learning models.
    Designing a network Training, evaluation
    Data
    set

    View Slide

  20. Written in pure Python and well-documented.
    No need to learn a new tensor API since Chainer uses Numpy and CuPy (Numpy-like API)
    User-friendly error messages. Easy to debug using pure Python debuggers.
    Easy and intuitive to write a network. Supports dynamic graphs.
    Chainer features
    Fast
    ☑ CUDA
    ☑ cuDNN
    ☑ NCCL
    Full
    featured
    ☑ Convolutional Networks
    ☑ Recurrent Networks
    ☑ Backprop of backprop
    Intuitive
    ☑ Define-by-Run
    ☑ High debuggability
    Supports GPU acceleration using CUDA with CuPy
    High-speed training/inference with cuDNN’s optimized deep learning functions with CuPy
    Supports a fast, multi-GPU learning using NCCL with CuPy
    N-dimensional Convolution, Deconvolution, Pooling, BN, etc.
    RNN components such as LSTM, Bi-directional LSTM, GRU and Bi-directional GRU
    Higher order derivatives (a.k.a. gradient of gradient) is supported
    Well-abstracted common tools for various NN learning, easy to write a set of learning flows
    ☑ Easy to use APIs
    ☑ Low learning curve
    ☑ Maintainable codebase

    View Slide

  21. Add-on packages for Chainer
    Distributed deep learning, deep reinforcement learning, computer vision
    ChainerMN (Multi-Node): additional package for distributed deep learning
      High scalability (100 times faster with 128GPU)
    ChainerRL: deep reinforcement learning library
      DQN, DDPG, A3C, ACER, NSQ, PCL, etc. OpenAI Gym support
    ChainerCV: provides image recognition algorithms, dataset wrappers
      Faster R-CNN, Single Shot Multibox Detector (SSD), SegNet, etc.
    ChainerUI: a visualization and experiment management tool for Chainer.

    Loss curve visualization, hyper parameter comparizon in tables, etc.
    ChainerUI

    View Slide

  22. ChainerMN
    ChainerMN is the fastest at the comparison of elapsed time to train ResNet-50 on ImageNet
    dataset for 100 epochs (May 2017)
    Recently we achieved

    15 mins to train ResNet50 on
    ImageNet dataset with 8 times
    larger cluster (1024 GPUs over
    128 nodes)
    See the details in this paper:

    “Extremely Large Minibatch SGD: Training
    ResNet-50 on ImageNet in 15 Minutes”
    https://arxiv.org/abs/1711.04325

    View Slide

  23. Explore Communities Around these Projects
    WITH
    ProjectData AS (SELECT * FROM `githubarchive.day.2017*` WHERE repo.name LIKE 'Theano/Theano'),
    Actors AS (SELECT DISTINCT(actor.login) AS login FROM ProjectData)
    SELECT * FROM (
    SELECT
    actors.login,
    (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssueCommentEvent' AND actor.login = actors.login) AS Comments,
    (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestEvent' AND actor.login = actors.login) AS PRs,
    (SELECT COUNT(*) FROM ProjectData WHERE type = 'PullRequestReviewCommentEvent' AND actor.login = actors.login) AS ReviewComments,
    (SELECT COUNT(*) FROM ProjectData WHERE type = 'ReleaseEvent' AND actor.login = actors.login) AS Releases,
    (SELECT COUNT(*) FROM ProjectData WHERE type = 'IssuesEvent' AND actor.login = actors.login) AS ClosedRenamedAndLabeledIssues
    FROM Actors as actors
    )
    WHERE PRs > 0 OR Comments > 0
    ORDER BY PRs DESC, Comments DESC;
    Combine average monthly score for 2017 with (current) average monthly score for 2018
    Weights = Comments: 1, PRs: 5, ReviewComments: 5, Releases: 50, ClosedRenamedAndLabeledIssues: 5
    Get a weighted-score for each participant in the GitHub community

    View Slide

  24. Empirical CDF of
    Raw Scores

    View Slide

  25. Empirical CDF of
    Normalized Scores

    View Slide

  26. Python Scientific ecosystem
    started as “organic”
    Pre 2016 it is understandable as the
    personal journeys of a few people

    View Slide

  27. 1996 - 2001
    Analyze 12.0
    https://analyzedirect.com/
    Richard Robb
    Retired in 2015
    Bringing “SciFi”
    Medicine to Life
    since 1971

    View Slide

  28. Science led to Python
    Raja Muthupillai
    Armando Manduca
    Richard Ehman
    1997
    Jim Greenleaf

    View Slide

  29. Python origins.
    Version Date
    0.9.0 Feb. 1991
    0.9.4 Dec. 1991
    0.9.6 Apr. 1992
    0.9.8 Jan. 1993
    1.0.0 Jan. 1994
    1.2 Apr. 1995
    1.4 Oct. 1996
    1.5.2 Apr. 1999
    http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

    View Slide

  30. First problem: Efficient Data Input
    The first step is to get the data right
    “It’s Always About the Data”
    http://www.python.org/doc/essays/refcnt/
    Reference Counting Essay
    May 1998
    Guido van Rossum
    TableIO
    April 1998
    Michael A. Miller
    NumPyIO
    June 1998

    View Slide

  31. Early pieces of SciPy
    cephesmodule
    fftw wrappers
    June 1998 November 1998
    stats.py
    December 1998
    Gary
    Strangman

    View Slide

  32. 1999 : Early SciPy emerges
    Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington,
    Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.
    In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in
    earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy
    Gaussian quadrature 5 Jan 1999
    cephes 1.0 30 Jan 1999
    sigtools 0.40 23 Feb 1999
    Numeric docs March 1999
    cephes 1.1 9 Mar 1999
    multipack 0.3 13 Apr 1999
    Helper routines 14 Apr 1999
    multipack 0.6 (leastsq, ode, fsolve,
    quad)
    29 Apr 1999
    sparse plan described 30 May 1999
    multipack 0.7 14 Jun 1999
    SparsePy 0.1 5 Nov 1999
    cephes 1.2 (vectorize) 29 Dec 1999
    Plotting??
    Gist
    XPLOT
    DISLIN
    Gnuplot
    Helping with f2py

    View Slide

  33. SciPy 2001
    Eric Jones
    weave
    cluster
    GA*
    Pearu Peterson
    linalg
    interpolate
    f2py
    optimize
    sparse
    interpolate
    integrate
    special
    signal
    stats
    fftpack
    misc
    Travis Oliphant

    View Slide

  34. Brief History of NumPy
    Person Package Year
    Jim Fulton Matrix Object 1994
    Jim Hugunin Numeric 1995
    Perry Greenfield,
    Rick White, Todd
    Miller
    Numarray 2001
    Travis Oliphant NumPy 2005

    View Slide

  35. NumPy was created to unify array objects
    in Python and unify PyData community
    Numeric
    Numarray
    NumPy
    I essentially sacrificed tenure at a University to write NumPy and
    unify array objects.

    View Slide

  36. Now a large community effort
    SciPy ~ 636 contributors
    NumPy ~ 679 contributors

    View Slide

  37. Now array-like objects everywhere
    Sparse Arrays
    Neon
    CUDArray

    View Slide

  38. We have a “divided” community again!
    Numeric
    Numarray
    NumPy

    View Slide

  39. Examples of packages being built on
    differing standards
    FastAI
    skorch
    Pyro Eduard
    anyrl
    Braid
    Sonnet
    Gluon

    View Slide

  40. Some Unification Efforts
    High Level
    shared APIs like
    Gluon and
    Keras
    But, then we also have…

    View Slide

  41. Example of Gluon

    View Slide

  42. Other Unification Efforts
    Train the
    Model
    Deploy the
    Model
    Platform1
    Platform 2
    Deploy the
    Model
    Platform 3

    View Slide

  43. NNVM / TVM — Ambitious Plan at UW

    View Slide

  44. PEP 3118 — A solution for the community
    • Back in 2006 when I wrote NumPy, I also spent time improving
    the Python Buffer protocol creating an interface for array-like
    objects in memory to share data with each-other easily.
    • A “fix-it-twice” solution.
    • All the array objects in Python could export and consume it to
    make zero-copy interoperability seamless.

    View Slide

  45. Opportunity Exists for Organic Community
    By expanding the previously defined Array Interface into a formal abstract uarray object with a multiple-
    dispatch mechanism for specializing functions on different implementations — we can provide a firm
    foundation for NumPy Dependencies to move into the Modern “Differentiable Array Computing” world
    and avoid a lot of library re-writes and silos that will exist otherwise.
    Array Interface
    MXNET Tensor THTensor NumPy Dask
    Pandas
    Gluon SciPy Scikit-Image Scikit-Learn
    PyMC4 …

    View Slide

  46. Work
    started at
    Quansight
    Labs
    Finding
    Sponsors
    for our
    work!
    [email protected] @teoliphant

    View Slide