Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The State of the Stack (SciPy 2015 Keynote)

The State of the Stack (SciPy 2015 Keynote)

The scientific Python community relies not just on the Python language, but on an entire ecosystem of open-source tools built on top of it. A key strength of the community is that these tools are not static: they are constantly evolving and improving through the volunteer effort of thousands of community members around the world. I'll provide a high-level view of the current state of the scientific Python software ecosystem, discuss its strengths and weaknesses relative to other available tools, and highlight some specific areas where I see room for growth and improvement.

Video Link: http://y2u.be/5GlNDD7qbP4

Jake VanderPlas

July 10, 2015
Tweet

More Decks by Jake VanderPlas

Other Decks in Programming

Transcript

  1. #SciPy2015
    Jake VanderPlas
    The State of the Stack
    Jake VanderPlas @jakevdp
    SciPy Keynote; July 10, 2015

    View Slide

  2. #SciPy2015
    Jake VanderPlas
    The State of the Stack
    Jake VanderPlas @jakevdp
    SciPy Keynote; July 10, 2015

    View Slide

  3. #SciPy2015
    Jake VanderPlas
    This Talk . . .
    Where we are
    Where we’ve come from
    Where we’re going

    View Slide

  4. #SciPy2015
    Jake VanderPlas
    Python’s Scientific Ecosystem

    View Slide

  5. #SciPy2015
    Jake VanderPlas
    Python’s Scientific Ecosystem

    View Slide

  6. #SciPy2015
    Jake VanderPlas
    Python’s Scientific Ecosystem

    View Slide

  7. #SciPy2015
    Jake VanderPlas
    Python’s Scientific Ecosystem

    View Slide

  8. #SciPy2015
    Jake VanderPlas
    Python’s Scientific Ecosystem (and
    many,
    many
    more)

    View Slide

  9. #SciPy2015
    Jake VanderPlas
    Performance:
    Numba, Weave, Numexpr, Theano . . .
    Visualization:
    Bokeh, Seaborn, Plotly, Chaco, mpld3, ggplot,
    MayaVi, vincent, toyplot, HoloViews . . .
    Data Structures & Computation:
    Blaze, Dask, DistArray, XRay,
    Graphlab, SciDBpy, pySpark . . .
    Packaging & distribution:
    pip/wheels, conda, EPD, Canopy, Anaconda ...
    Many more tools:

    View Slide

  10. #SciPy2015
    Jake VanderPlas
    Recent-ish Developments

    View Slide

  11. #SciPy2015
    Jake VanderPlas
    Recent Developments:
    Core Language

    View Slide

  12. #SciPy2015
    Jake VanderPlas
    3
    If you haven’t switched, it’s time.
    I’ll just leave this right here . . .

    View Slide

  13. #SciPy2015
    Jake VanderPlas
    Recent Developments:
    Visualization

    View Slide

  14. #SciPy2015
    Jake VanderPlas
    Matplotlib:
    Evolving into a more Modern Package
    Matplotlib 2.0 will break backward compatibility
    to provide new plot style defaults!
    (See State of Matplotlib talk for more details)
    Matplotlib 1.4 features stylesheets,
    with several very nice built-in styles.

    View Slide

  15. #SciPy2015
    Jake VanderPlas
    In [3]: make_plots()
    New Matplotlib Styles:

    View Slide

  16. #SciPy2015
    Jake VanderPlas
    In [4]: plt.style.use(‘ggplot’)
    make_plots()
    New Matplotlib Styles:

    View Slide

  17. #SciPy2015
    Jake VanderPlas
    In [5]: plt.style.use(‘fivethirtyeight’)
    make_plots()
    New Matplotlib Styles:

    View Slide

  18. #SciPy2015
    Jake VanderPlas
    In [6]: plt.style.use(‘bmh’)
    make_plots()
    New Matplotlib Styles:

    View Slide

  19. #SciPy2015
    Jake VanderPlas
    Seaborn: Matplotlib + Pandas +
    Statistical Visualization
    http://stanford.edu/~mwaskom/software/seaborn/
    - built on top of matplotlib: able to use any of
    its backends & output formats
    - pandas-aware: quick plotting of labeled data
    - provides beautiful, well-thought-out default
    plot styles

    View Slide

  20. #SciPy2015
    Jake VanderPlas
    In [7]: import seaborn; seaborn.set()
    make_plots()
    Seaborn’s Matplotlib Style:
    (style available natively in next matplotlib release)

    View Slide

  21. #SciPy2015
    Jake VanderPlas
    Bokeh: Powerful Interactive Viz
    http://bokeh.pydata.org/
    - HTML5 output, both server and client-side
    - Flexible in-browser interactivity
    - Fundamentally a Javascript library with
    Python bindings

    View Slide

  22. #SciPy2015
    Jake VanderPlas
    Recent Developments:
    Arrays & Data Structures

    View Slide

  23. #SciPy2015
    Jake VanderPlas
    (
    Arrays and Data Structures
    xray implements numpy-
    style ND arrays with Pandas-
    style labels & indices.
    http://xray.readthedocs.org/
    Modern data is heterogeneous, noisy, and
    complicated. Anonymous dense arrays are no
    longer enough!
    + )

    View Slide

  24. #SciPy2015
    Jake VanderPlas
    Arrays and Data Structures
    Dask: a lightweight tool for general
    parallelized array storage and computation.
    http://dask.pydata.org/
    The project is still young, but the possibilities
    are very exciting!

    View Slide

  25. #SciPy2015
    Jake VanderPlas
    Recent Developments:
    Computation & Performance

    View Slide

  26. #SciPy2015
    Jake VanderPlas
    Computation & Performance:
    Numba: with a simple decorator, Python
    JIT compiles to LLVM and executes at near
    C/Fortran speed!
    http://numba.pydata.org/
    Still some features missing, but very promising
    (see my blog posts for some examples).

    View Slide

  27. #SciPy2015
    Jake VanderPlas
    Computation & Performance:
    Numba: with a simple decorator, Python
    JIT compiles to LLVM and executes at near
    C/Fortran speed!
    http://numba.pydata.org/
    Still some features missing, but very promising
    (see my blog posts for some examples).
    20x speedup!

    View Slide

  28. #SciPy2015
    Jake VanderPlas
    Recent Developments:
    Distribution & Packaging

    View Slide

  29. #SciPy2015
    Jake VanderPlas
    Distribution & Packaging:
    conda distribution & packaging
    tool has changed the way many
    use, develop, & teach Python.
    - like pip, but better management of Python
    & non-python dependencies
    - like virtualenv, but allows different
    versions of compiled libraries
    - Similar to yum / apt / macports / brew,
    but platform-independent!
    http://conda.pydata.org/

    View Slide

  30. #SciPy2015
    Jake VanderPlas
    And of course . . .

    View Slide

  31. #SciPy2015
    Jake VanderPlas
    IPython & Jupyter
    So much happening . . .
    - The IPython/Jupyter split
    - Widgets = awesome
    - Docker-based backends
    - Jupyter Hub
    - new $6M grant this week!
    Python stack is branching out
    to benefit other languages!

    View Slide

  32. #SciPy2015
    Jake VanderPlas
    And so much more . . .

    View Slide

  33. #SciPy2015
    Jake VanderPlas
    A Historical Perspective . . .

    View Slide

  34. #SciPy2015
    Jake VanderPlas
    Why Python?
    Python was created in
    the 1980s as a teaching
    language, and to “bridge
    the gap between the
    shell and C” 1
    1. Guido Van Rossum The Making of Python

    View Slide

  35. #SciPy2015
    Jake VanderPlas
    “I thought we'd write
    small Python programs,
    maybe 10 lines, maybe 50,
    maybe 500 lines — that
    would be a big one”
    Guido Van Rossum The Making of Python

    View Slide

  36. #SciPy2015
    Jake VanderPlas
    . . . why did a “toy language” become
    the core of a scientific stack?
    Python is not a scientific
    programming language!

    View Slide

  37. #SciPy2015
    Jake VanderPlas
    Pre-Python workflows:
    “I had a hodge-podge of work processes. I would have
    Perl scripts that called C++ numerical routines that would
    dump data files, and I would load them up into MatLab
    to plot them. After a while I got tired of the MatLab
    dependency… so I started loading them up in GnuPlot.”
    -John Hunter
    creator of Matplotlib
    SciPy 2012 Keynote

    View Slide

  38. #SciPy2015
    Jake VanderPlas
    Pre-Python workflows:
    “My advisor had a heavily customized awk/sed/bash
    workflow to manage job submissions and
    postprocessing of C codes for supercomputing runs…
    So I used her scripts to run my jobs, and on top of that
    had added my own layer of Perl, plus a hefty amount
    of Gnuplot, IDL and Mathematica.”
    - Fernando Perez
    creator of IPython
    via email

    View Slide

  39. #SciPy2015
    Jake VanderPlas
    Pre-Python workflows:
    “Prior to Python, I used Perl (for a year) and then
    Matlab and shell scripts & Fortran & C/C++ libraries.
    When I discovered Python, I really liked the
    language... But, it was very nascent and lacked a lot of
    libraries. I felt like I could add value to the world by
    connecting low-level libraries to high-level usage in
    Python.”
    - Travis Oliphant
    creator of NumPy & SciPy
    via email

    View Slide

  40. #SciPy2015
    Jake VanderPlas
    Python is not a scientific
    programming language!

    View Slide

  41. #SciPy2015
    Jake VanderPlas
    Python is not a scientific
    programming language!
    . . . Python is a glue.

    View Slide

  42. #SciPy2015
    Jake VanderPlas
    Python glues together this hodge-
    podge of scientific tools.

    View Slide

  43. #SciPy2015
    Jake VanderPlas
    Python glues together this hodge-
    podge of scientific tools.
    High-level syntax wraps low-level
    C/Fortran libraries, which is (mostly)
    where the computation happens.

    View Slide

  44. #SciPy2015
    Jake VanderPlas
    Python glues together this hodge-
    podge of scientific tools.
    High-level syntax wraps low-level
    C/Fortran libraries, which is (mostly)
    where the computation happens.
    It is speed of development, not
    necessarily speed of execution,
    that has driven Python’s popularity.

    View Slide

  45. #SciPy2015
    Jake VanderPlas
    Why don’t you use C instead
    of Python? It’s so much faster!

    View Slide

  46. #SciPy2015
    Jake VanderPlas
    Why don’t you commute by
    airplane instead of by car? It’s
    so much faster!
    Why don’t you use C instead
    of Python? It’s so much faster!

    View Slide

  47. #SciPy2015
    Jake VanderPlas
    1995
    2000
    2005
    2010
    2015
    But this efficiency depends on the
    Scientific Stack . . .

    View Slide

  48. #SciPy2015
    Jake VanderPlas
    Numeric
    1995
    2000
    2005
    2010
    2015
    1995: Numeric was an early
    Python scientific array library,
    largely written by Jim Hugunin

    View Slide

  49. #SciPy2015
    Jake VanderPlas
    Numeric
    Multipack
    1995
    2000
    2005
    2010
    2015
    1998: Multipack, built on
    Numeric, was a set of wrappers
    of Fortran packages written by
    Travis Oliphant.

    View Slide

  50. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Multipack
    1995
    2000
    2005
    2010
    2015
    2002: Numarray, was created by Perry
    Greenfield, Paul Dubois, and others to
    address fundamental deficiencies in
    Numeric for larger datasets.

    View Slide

  51. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Multipack
    1995
    2000
    2005
    2010
    2015
    Numpy
    2006: In a Herculean effort to head-off
    this split in the community, Travis
    Oliphant incorporated best parts of
    Numeric + Numarray into NumPy

    View Slide

  52. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    1995
    2000
    2005
    2010
    2015
    2000: Eric Jones, Travis Oliphant, Pearu
    Peterson, and others spun multipack
    into the SciPy package, aiming for a full
    Python MatLab replacement.

    View Slide

  53. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    IPython
    1995
    2000
    2005
    2010
    2015
    2001: Fernando Perez started the
    IPython project, aiming for a
    mathematica-style environment
    for Scientific Python.

    View Slide

  54. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    IPython
    Matplotlib
    1995
    2000
    2005
    2010
    2015
    2002: John Hunter wanted an
    open MatLab replacement, and
    started Matplotlib as an effort at
    MatLab-style visualization.

    View Slide

  55. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    IPython
    Notebook
    Matplotlib
    1995
    2000
    2005
    2010
    2015
    2012: The IPython team
    released the IPython
    Notebook, and the world has
    never been the same

    View Slide

  56. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    IPython
    Notebook
    Matplotlib
    Pandas
    1995
    2000
    2005
    2010
    2015
    2009: Wes McKinney began Pandas,
    eventually drawing-in a much larger
    Python user-base, especially in
    industry data science.

    View Slide

  57. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    IPython
    Notebook
    Matplotlib
    Scikits
    Pandas
    1995
    2000
    2005
    2010
    2015
    2009: With SciPy’s sheer size making fast
    development difficult, community
    decided to promote “scikits” as an
    avenue for more specialized algorithms.

    View Slide

  58. #SciPy2015
    Jake VanderPlas
    Numeric
    Numarray
    Numpy
    Multipack SciPy
    IPython
    Notebook
    Matplotlib
    Scikits
    Pandas
    Conda
    1995
    2000
    2005
    2010
    2015
    2012: Continuum releases
    conda, a package manager
    for scientific computing.

    View Slide

  59. #SciPy2015
    Jake VanderPlas
    (Python as glue: you’re not
    doing scientific computing in
    Python, you’re using python to
    glue together tools in Fortran
    or C)
    Scientific Python is Federated . . .

    View Slide

  60. #SciPy2015
    Jake VanderPlas
    (Python as glue: you’re not
    doing scientific computing in
    Python, you’re using python to
    glue together tools in Fortran
    or C)
    Scientific Python is Federated . . .

    View Slide

  61. #SciPy2015
    Jake VanderPlas

    View Slide

  62. #SciPy2015
    Jake VanderPlas

    View Slide

  63. #SciPy2015
    Jake VanderPlas
    Early on, developers were simply
    writing tools to solve their problems . . .
    Com
    putation
    Visualization
    Shell

    View Slide

  64. #SciPy2015
    Jake VanderPlas
    Shell
    Com
    putation
    Visualization
    It took deliberate collaboration to arrive
    at today’s coherent ecosystem:

    View Slide

  65. #SciPy2015
    Jake VanderPlas
    Toward the Future

    View Slide

  66. #SciPy2015
    Jake VanderPlas
    Lessons Learned
    1. No centralized leadership! What is “core” in the
    ecosystem evolves & is up to the community.

    View Slide

  67. #SciPy2015
    Jake VanderPlas
    Evolving Computational Core: Numba?

    View Slide

  68. #SciPy2015
    Jake VanderPlas
    Evolving Computational Core: Numba?
    Just as Cython matured to become a core piece,
    perhaps Numba might as well? How might a JIT-
    enabled scipy, sklearn, pandas, etc. look?
    Numba

    View Slide

  69. #SciPy2015
    Jake VanderPlas
    Lessons Learned
    1. No centralized leadership! What is “core” in the
    ecosystem evolves & is up to the community.
    2. To be most useful as an ecosystem, we must be willing
    for packages to adapt to the changing landscape.

    View Slide

  70. #SciPy2015
    Jake VanderPlas
    Modern data is sparse, heterogeneous, and labeled, and
    NumPy arrays don’t measure up: Let’s make Pandas a
    core dependency!
    Evolving Computational Core: Pandas?

    View Slide

  71. #SciPy2015
    Jake VanderPlas
    Modern data is sparse, heterogeneous, and labeled, and
    NumPy arrays don’t measure up: Let’s make Pandas a
    core dependency!
    Evolving Computational Core: Pandas?

    View Slide

  72. #SciPy2015
    Jake VanderPlas
    Imagining Pandas-enabled Matplotlib:
    Old Way: plt.plot(data[‘time’], data[‘temperature’])
    plt.xlabel(‘time’)
    plt.legend([‘temperature’])

    View Slide

  73. #SciPy2015
    Jake VanderPlas
    Imagining Pandas-enabled Matplotlib:
    Old Way:
    New Way?
    plt.plot(data[‘time’], data[‘temperature’])
    plt.xlabel(‘time’)
    plt.legend([‘temperature’])
    plt.plot(‘time’, ‘temperature’, data=data)

    View Slide

  74. #SciPy2015
    Jake VanderPlas
    With Pandas core dependency, what elements of
    Seaborn & Pandas could be moved into matplotlib?
    Evolving Computational Core
    Seaborn

    View Slide

  75. #SciPy2015
    Jake VanderPlas
    Evolving the core: SciPy
    SciPy’s monolithic design was driven by
    packaging & distribution difficulties.
    With conda, do we still need a single SciPy
    package? Should it be split-up into smaller
    packages?
    “In 2001… we came out with what we called the
    Scipy library, but it was really the SciPy
    distribution. I didn’t realize it at the time, but that’s
    really what it was: a collection of tools with a
    single installer, so you get everything up and
    running quickly.
    - Travis Oliphant, EuroSciPy 2014 Keynote

    View Slide

  76. #SciPy2015
    Jake VanderPlas
    Lessons Learned
    1. No centralized leadership! What is “core” in the
    ecosystem evolves & is up to the community.
    2. To be most useful as an ecosystem, we must be willing
    for packages to adapt to the changing landscape.
    3. Interoperability with core pieces of other languages has
    been key to the success of the SciPy stack (e.g.
    C/Fortran libraries, new Jupyter framework).

    View Slide

  77. #SciPy2015
    Jake VanderPlas
    Universal Plotting Serialization?
    Much of modern interactive plotting (d3, HTML5, Bokeh,
    ggvis, mpld3, etc.) involves generating & processing
    plot serializations.

    View Slide

  78. #SciPy2015
    Jake VanderPlas
    Universal Plotting Serialization?
    Much of modern interactive plotting (d3, HTML5, Bokeh,
    ggvis, mpld3, etc.) involves generating & processing
    plot serializations.
    Doing this natively in matplotlib
    would open up extensibility!

    View Slide

  79. #SciPy2015
    Jake VanderPlas
    Universal DataFrames?
    C/Fortran
    Memory Blocks

    View Slide

  80. #SciPy2015
    Jake VanderPlas
    Universal DataFrames?
    C/Fortran
    Memory Blocks
    Dataframe.jl
    DataFrame

    View Slide

  81. #SciPy2015
    Jake VanderPlas
    Universal DataFrames?
    C/Fortran
    Memory Blocks
    Über Dataframe?

    View Slide

  82. #SciPy2015
    Jake VanderPlas
    Lessons Learned
    1. No centralized leadership! What is “core” in the
    ecosystem evolves & is up to the community.
    2. To be most useful as an ecosystem, we must be willing
    for packages to adapt to the changing landscape.
    3. Interoperability with core pieces of other languages has
    been key to the success of the SciPy stack. (esp.
    C/Fortran libraries, new Jupyter framework).
    4. The stack was built from both continuity (e.g. Numeric
    /Numarray→NumPy) and brand-new efforts (e.g.
    matplotlib, Pandas). Don’t discount either approach!

    View Slide

  83. #SciPy2015
    Jake VanderPlas
    Considering the Future of Matplolib
    Usual complaints about Matplotlib:
    - Non-optimal stylistic defaults
    - Non-optimal API
    - Difficulty exporting interactive plots
    - Difficulty with large datasets

    View Slide

  84. #SciPy2015
    Jake VanderPlas
    Considering the Future of Matplolib
    Usual complaints about Matplotlib:
    - Non-optimal stylistic defaults
    - Non-optimal API
    - Difficulty exporting interactive plots
    - Difficulty with large datasets
    Matplotlib 2.0!

    View Slide

  85. #SciPy2015
    Jake VanderPlas
    Considering the Future of Matplolib
    Usual complaints about Matplotlib:
    - Non-optimal stylistic defaults
    - Non-optimal API
    - Difficulty exporting interactive plots
    - Difficulty with large datasets
    Matplotlib 2.0!
    Seaborn, ggplot!

    View Slide

  86. #SciPy2015
    Jake VanderPlas
    Considering the Future of Matplolib
    Usual complaints about Matplotlib:
    - Non-optimal stylistic defaults
    - Non-optimal API
    - Difficulty exporting interactive plots
    - Difficulty with large datasets
    Matplotlib 2.0!
    Seaborn, ggplot!
    Serialization to
    mpld3/Bokeh/etc?

    View Slide

  87. #SciPy2015
    Jake VanderPlas
    Considering the Future of Matplolib
    Usual complaints about Matplotlib:
    - Non-optimal stylistic defaults
    - Non-optimal API
    - Difficulty exporting interactive plots
    - Difficulty with large datasets
    Matplotlib 2.0!
    Seaborn, ggplot!
    Serialization to
    mpld3/Bokeh/etc?

    View Slide

  88. #SciPy2015
    Jake VanderPlas
    Lesson from Numeric/Numarray, etc.:
    Stick with matplotlib & modify it!
    (e.g. serialization to VisPy? Numba-driven
    backend? new backend architecture?,
    etc.)
    Lesson from Pandas & Matplotlib, etc.:
    Start something from scratch; features
    will draw users! (e.g. VisPy, Bokeh,
    Something new?)

    View Slide

  89. #SciPy2015
    Jake VanderPlas
    The State of the Stack is up to You.

    View Slide

  90. #SciPy2015
    Jake VanderPlas
    ~ Thank You! ~
    Email: [email protected]
    Twitter: @jakevdp
    Github: jakevdp
    Web: http://vanderplas.com/
    Blog: http://jakevdp.github.io/

    View Slide

  91. #SciPy2015
    Jake VanderPlas

    View Slide

  92. #SciPy2015
    Jake VanderPlas
    (Python as glue: you’re not
    doing scientific computing in
    Python, you’re using python to
    glue together tools in Fortran
    or C)
    “We were trying to plug the major holes in the Python
    world that made it inferior to Matlab for scientific
    computing... [Plotting] was probably the biggest obvious
    gap between Python and Matlab, so we thought that this
    would be a central piece of SciPy.”
    - Eric Jones via email
    “[SciPy] was intended to be a Matlab replacement and
    so needed libraries, plus, plotting, and a shell to work in,
    along with tools to integrate with C/C++.”
    - Travis Oliphant via email
    “... from the very beginning, Mathematica and its notebooks
    (and the Maple worksheets before) were in my mind as the
    ideal environment for daily scientific work.”
    - Fernando Perez
    The IPython Notebook: A Historical Retrospective

    View Slide

  93. #SciPy2015
    Jake VanderPlas photo courtesey of Fernando Perez

    View Slide