Slide 1

Slide 1 text

#SciPy2015 Jake VanderPlas The State of the Stack Jake VanderPlas @jakevdp SciPy Keynote; July 10, 2015

Slide 2

Slide 2 text

#SciPy2015 Jake VanderPlas The State of the Stack Jake VanderPlas @jakevdp SciPy Keynote; July 10, 2015

Slide 3

Slide 3 text

#SciPy2015 Jake VanderPlas This Talk . . . Where we are Where we’ve come from Where we’re going

Slide 4

Slide 4 text

#SciPy2015 Jake VanderPlas Python’s Scientific Ecosystem

Slide 5

Slide 5 text

#SciPy2015 Jake VanderPlas Python’s Scientific Ecosystem

Slide 6

Slide 6 text

#SciPy2015 Jake VanderPlas Python’s Scientific Ecosystem

Slide 7

Slide 7 text

#SciPy2015 Jake VanderPlas Python’s Scientific Ecosystem

Slide 8

Slide 8 text

#SciPy2015 Jake VanderPlas Python’s Scientific Ecosystem (and many, many more)

Slide 9

Slide 9 text

#SciPy2015 Jake VanderPlas Performance: Numba, Weave, Numexpr, Theano . . . Visualization: Bokeh, Seaborn, Plotly, Chaco, mpld3, ggplot, MayaVi, vincent, toyplot, HoloViews . . . Data Structures & Computation: Blaze, Dask, DistArray, XRay, Graphlab, SciDBpy, pySpark . . . Packaging & distribution: pip/wheels, conda, EPD, Canopy, Anaconda ... Many more tools:

Slide 10

Slide 10 text

#SciPy2015 Jake VanderPlas Recent-ish Developments

Slide 11

Slide 11 text

#SciPy2015 Jake VanderPlas Recent Developments: Core Language

Slide 12

Slide 12 text

#SciPy2015 Jake VanderPlas 3 If you haven’t switched, it’s time. I’ll just leave this right here . . .

Slide 13

Slide 13 text

#SciPy2015 Jake VanderPlas Recent Developments: Visualization

Slide 14

Slide 14 text

#SciPy2015 Jake VanderPlas Matplotlib: Evolving into a more Modern Package Matplotlib 2.0 will break backward compatibility to provide new plot style defaults! (See State of Matplotlib talk for more details) Matplotlib 1.4 features stylesheets, with several very nice built-in styles.

Slide 15

Slide 15 text

#SciPy2015 Jake VanderPlas In [3]: make_plots() New Matplotlib Styles:

Slide 16

Slide 16 text

#SciPy2015 Jake VanderPlas In [4]: plt.style.use(‘ggplot’) make_plots() New Matplotlib Styles:

Slide 17

Slide 17 text

#SciPy2015 Jake VanderPlas In [5]: plt.style.use(‘fivethirtyeight’) make_plots() New Matplotlib Styles:

Slide 18

Slide 18 text

#SciPy2015 Jake VanderPlas In [6]: plt.style.use(‘bmh’) make_plots() New Matplotlib Styles:

Slide 19

Slide 19 text

#SciPy2015 Jake VanderPlas Seaborn: Matplotlib + Pandas + Statistical Visualization http://stanford.edu/~mwaskom/software/seaborn/ - built on top of matplotlib: able to use any of its backends & output formats - pandas-aware: quick plotting of labeled data - provides beautiful, well-thought-out default plot styles

Slide 20

Slide 20 text

#SciPy2015 Jake VanderPlas In [7]: import seaborn; seaborn.set() make_plots() Seaborn’s Matplotlib Style: (style available natively in next matplotlib release)

Slide 21

Slide 21 text

#SciPy2015 Jake VanderPlas Bokeh: Powerful Interactive Viz http://bokeh.pydata.org/ - HTML5 output, both server and client-side - Flexible in-browser interactivity - Fundamentally a Javascript library with Python bindings

Slide 22

Slide 22 text

#SciPy2015 Jake VanderPlas Recent Developments: Arrays & Data Structures

Slide 23

Slide 23 text

#SciPy2015 Jake VanderPlas ( Arrays and Data Structures xray implements numpy- style ND arrays with Pandas- style labels & indices. http://xray.readthedocs.org/ Modern data is heterogeneous, noisy, and complicated. Anonymous dense arrays are no longer enough! + )

Slide 24

Slide 24 text

#SciPy2015 Jake VanderPlas Arrays and Data Structures Dask: a lightweight tool for general parallelized array storage and computation. http://dask.pydata.org/ The project is still young, but the possibilities are very exciting!

Slide 25

Slide 25 text

#SciPy2015 Jake VanderPlas Recent Developments: Computation & Performance

Slide 26

Slide 26 text

#SciPy2015 Jake VanderPlas Computation & Performance: Numba: with a simple decorator, Python JIT compiles to LLVM and executes at near C/Fortran speed! http://numba.pydata.org/ Still some features missing, but very promising (see my blog posts for some examples).

Slide 27

Slide 27 text

#SciPy2015 Jake VanderPlas Computation & Performance: Numba: with a simple decorator, Python JIT compiles to LLVM and executes at near C/Fortran speed! http://numba.pydata.org/ Still some features missing, but very promising (see my blog posts for some examples). 20x speedup!

Slide 28

Slide 28 text

#SciPy2015 Jake VanderPlas Recent Developments: Distribution & Packaging

Slide 29

Slide 29 text

#SciPy2015 Jake VanderPlas Distribution & Packaging: conda distribution & packaging tool has changed the way many use, develop, & teach Python. - like pip, but better management of Python & non-python dependencies - like virtualenv, but allows different versions of compiled libraries - Similar to yum / apt / macports / brew, but platform-independent! http://conda.pydata.org/

Slide 30

Slide 30 text

#SciPy2015 Jake VanderPlas And of course . . .

Slide 31

Slide 31 text

#SciPy2015 Jake VanderPlas IPython & Jupyter So much happening . . . - The IPython/Jupyter split - Widgets = awesome - Docker-based backends - Jupyter Hub - new $6M grant this week! Python stack is branching out to benefit other languages!

Slide 32

Slide 32 text

#SciPy2015 Jake VanderPlas And so much more . . .

Slide 33

Slide 33 text

#SciPy2015 Jake VanderPlas A Historical Perspective . . .

Slide 34

Slide 34 text

#SciPy2015 Jake VanderPlas Why Python? Python was created in the 1980s as a teaching language, and to “bridge the gap between the shell and C” 1 1. Guido Van Rossum The Making of Python

Slide 35

Slide 35 text

#SciPy2015 Jake VanderPlas “I thought we'd write small Python programs, maybe 10 lines, maybe 50, maybe 500 lines — that would be a big one” Guido Van Rossum The Making of Python

Slide 36

Slide 36 text

#SciPy2015 Jake VanderPlas . . . why did a “toy language” become the core of a scientific stack? Python is not a scientific programming language!

Slide 37

Slide 37 text

#SciPy2015 Jake VanderPlas Pre-Python workflows: “I had a hodge-podge of work processes. I would have Perl scripts that called C++ numerical routines that would dump data files, and I would load them up into MatLab to plot them. After a while I got tired of the MatLab dependency… so I started loading them up in GnuPlot.” -John Hunter creator of Matplotlib SciPy 2012 Keynote

Slide 38

Slide 38 text

#SciPy2015 Jake VanderPlas Pre-Python workflows: “My advisor had a heavily customized awk/sed/bash workflow to manage job submissions and postprocessing of C codes for supercomputing runs… So I used her scripts to run my jobs, and on top of that had added my own layer of Perl, plus a hefty amount of Gnuplot, IDL and Mathematica.” - Fernando Perez creator of IPython via email

Slide 39

Slide 39 text

#SciPy2015 Jake VanderPlas Pre-Python workflows: “Prior to Python, I used Perl (for a year) and then Matlab and shell scripts & Fortran & C/C++ libraries. When I discovered Python, I really liked the language... But, it was very nascent and lacked a lot of libraries. I felt like I could add value to the world by connecting low-level libraries to high-level usage in Python.” - Travis Oliphant creator of NumPy & SciPy via email

Slide 40

Slide 40 text

#SciPy2015 Jake VanderPlas Python is not a scientific programming language!

Slide 41

Slide 41 text

#SciPy2015 Jake VanderPlas Python is not a scientific programming language! . . . Python is a glue.

Slide 42

Slide 42 text

#SciPy2015 Jake VanderPlas Python glues together this hodge- podge of scientific tools.

Slide 43

Slide 43 text

#SciPy2015 Jake VanderPlas Python glues together this hodge- podge of scientific tools. High-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.

Slide 44

Slide 44 text

#SciPy2015 Jake VanderPlas Python glues together this hodge- podge of scientific tools. High-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens. It is speed of development, not necessarily speed of execution, that has driven Python’s popularity.

Slide 45

Slide 45 text

#SciPy2015 Jake VanderPlas Why don’t you use C instead of Python? It’s so much faster!

Slide 46

Slide 46 text

#SciPy2015 Jake VanderPlas Why don’t you commute by airplane instead of by car? It’s so much faster! Why don’t you use C instead of Python? It’s so much faster!

Slide 47

Slide 47 text

#SciPy2015 Jake VanderPlas 1995 2000 2005 2010 2015 But this efficiency depends on the Scientific Stack . . .

Slide 48

Slide 48 text

#SciPy2015 Jake VanderPlas Numeric 1995 2000 2005 2010 2015 1995: Numeric was an early Python scientific array library, largely written by Jim Hugunin

Slide 49

Slide 49 text

#SciPy2015 Jake VanderPlas Numeric Multipack 1995 2000 2005 2010 2015 1998: Multipack, built on Numeric, was a set of wrappers of Fortran packages written by Travis Oliphant.

Slide 50

Slide 50 text

#SciPy2015 Jake VanderPlas Numeric Numarray Multipack 1995 2000 2005 2010 2015 2002: Numarray, was created by Perry Greenfield, Paul Dubois, and others to address fundamental deficiencies in Numeric for larger datasets.

Slide 51

Slide 51 text

#SciPy2015 Jake VanderPlas Numeric Numarray Multipack 1995 2000 2005 2010 2015 Numpy 2006: In a Herculean effort to head-off this split in the community, Travis Oliphant incorporated best parts of Numeric + Numarray into NumPy

Slide 52

Slide 52 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy 1995 2000 2005 2010 2015 2000: Eric Jones, Travis Oliphant, Pearu Peterson, and others spun multipack into the SciPy package, aiming for a full Python MatLab replacement.

Slide 53

Slide 53 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy IPython 1995 2000 2005 2010 2015 2001: Fernando Perez started the IPython project, aiming for a mathematica-style environment for Scientific Python.

Slide 54

Slide 54 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy IPython Matplotlib 1995 2000 2005 2010 2015 2002: John Hunter wanted an open MatLab replacement, and started Matplotlib as an effort at MatLab-style visualization.

Slide 55

Slide 55 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy IPython Notebook Matplotlib 1995 2000 2005 2010 2015 2012: The IPython team released the IPython Notebook, and the world has never been the same

Slide 56

Slide 56 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy IPython Notebook Matplotlib Pandas 1995 2000 2005 2010 2015 2009: Wes McKinney began Pandas, eventually drawing-in a much larger Python user-base, especially in industry data science.

Slide 57

Slide 57 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy IPython Notebook Matplotlib Scikits Pandas 1995 2000 2005 2010 2015 2009: With SciPy’s sheer size making fast development difficult, community decided to promote “scikits” as an avenue for more specialized algorithms.

Slide 58

Slide 58 text

#SciPy2015 Jake VanderPlas Numeric Numarray Numpy Multipack SciPy IPython Notebook Matplotlib Scikits Pandas Conda 1995 2000 2005 2010 2015 2012: Continuum releases conda, a package manager for scientific computing.

Slide 59

Slide 59 text

#SciPy2015 Jake VanderPlas (Python as glue: you’re not doing scientific computing in Python, you’re using python to glue together tools in Fortran or C) Scientific Python is Federated . . .

Slide 60

Slide 60 text

#SciPy2015 Jake VanderPlas (Python as glue: you’re not doing scientific computing in Python, you’re using python to glue together tools in Fortran or C) Scientific Python is Federated . . .

Slide 61

Slide 61 text

#SciPy2015 Jake VanderPlas

Slide 62

Slide 62 text

#SciPy2015 Jake VanderPlas

Slide 63

Slide 63 text

#SciPy2015 Jake VanderPlas Early on, developers were simply writing tools to solve their problems . . . Com putation Visualization Shell

Slide 64

Slide 64 text

#SciPy2015 Jake VanderPlas Shell Com putation Visualization It took deliberate collaboration to arrive at today’s coherent ecosystem:

Slide 65

Slide 65 text

#SciPy2015 Jake VanderPlas Toward the Future

Slide 66

Slide 66 text

#SciPy2015 Jake VanderPlas Lessons Learned 1. No centralized leadership! What is “core” in the ecosystem evolves & is up to the community.

Slide 67

Slide 67 text

#SciPy2015 Jake VanderPlas Evolving Computational Core: Numba?

Slide 68

Slide 68 text

#SciPy2015 Jake VanderPlas Evolving Computational Core: Numba? Just as Cython matured to become a core piece, perhaps Numba might as well? How might a JIT- enabled scipy, sklearn, pandas, etc. look? Numba

Slide 69

Slide 69 text

#SciPy2015 Jake VanderPlas Lessons Learned 1. No centralized leadership! What is “core” in the ecosystem evolves & is up to the community. 2. To be most useful as an ecosystem, we must be willing for packages to adapt to the changing landscape.

Slide 70

Slide 70 text

#SciPy2015 Jake VanderPlas Modern data is sparse, heterogeneous, and labeled, and NumPy arrays don’t measure up: Let’s make Pandas a core dependency! Evolving Computational Core: Pandas?

Slide 71

Slide 71 text

#SciPy2015 Jake VanderPlas Modern data is sparse, heterogeneous, and labeled, and NumPy arrays don’t measure up: Let’s make Pandas a core dependency! Evolving Computational Core: Pandas?

Slide 72

Slide 72 text

#SciPy2015 Jake VanderPlas Imagining Pandas-enabled Matplotlib: Old Way: plt.plot(data[‘time’], data[‘temperature’]) plt.xlabel(‘time’) plt.legend([‘temperature’])

Slide 73

Slide 73 text

#SciPy2015 Jake VanderPlas Imagining Pandas-enabled Matplotlib: Old Way: New Way? plt.plot(data[‘time’], data[‘temperature’]) plt.xlabel(‘time’) plt.legend([‘temperature’]) plt.plot(‘time’, ‘temperature’, data=data)

Slide 74

Slide 74 text

#SciPy2015 Jake VanderPlas With Pandas core dependency, what elements of Seaborn & Pandas could be moved into matplotlib? Evolving Computational Core Seaborn

Slide 75

Slide 75 text

#SciPy2015 Jake VanderPlas Evolving the core: SciPy SciPy’s monolithic design was driven by packaging & distribution difficulties. With conda, do we still need a single SciPy package? Should it be split-up into smaller packages? “In 2001… we came out with what we called the Scipy library, but it was really the SciPy distribution. I didn’t realize it at the time, but that’s really what it was: a collection of tools with a single installer, so you get everything up and running quickly. - Travis Oliphant, EuroSciPy 2014 Keynote

Slide 76

Slide 76 text

#SciPy2015 Jake VanderPlas Lessons Learned 1. No centralized leadership! What is “core” in the ecosystem evolves & is up to the community. 2. To be most useful as an ecosystem, we must be willing for packages to adapt to the changing landscape. 3. Interoperability with core pieces of other languages has been key to the success of the SciPy stack (e.g. C/Fortran libraries, new Jupyter framework).

Slide 77

Slide 77 text

#SciPy2015 Jake VanderPlas Universal Plotting Serialization? Much of modern interactive plotting (d3, HTML5, Bokeh, ggvis, mpld3, etc.) involves generating & processing plot serializations.

Slide 78

Slide 78 text

#SciPy2015 Jake VanderPlas Universal Plotting Serialization? Much of modern interactive plotting (d3, HTML5, Bokeh, ggvis, mpld3, etc.) involves generating & processing plot serializations. Doing this natively in matplotlib would open up extensibility!

Slide 79

Slide 79 text

#SciPy2015 Jake VanderPlas Universal DataFrames? C/Fortran Memory Blocks

Slide 80

Slide 80 text

#SciPy2015 Jake VanderPlas Universal DataFrames? C/Fortran Memory Blocks Dataframe.jl DataFrame

Slide 81

Slide 81 text

#SciPy2015 Jake VanderPlas Universal DataFrames? C/Fortran Memory Blocks Über Dataframe?

Slide 82

Slide 82 text

#SciPy2015 Jake VanderPlas Lessons Learned 1. No centralized leadership! What is “core” in the ecosystem evolves & is up to the community. 2. To be most useful as an ecosystem, we must be willing for packages to adapt to the changing landscape. 3. Interoperability with core pieces of other languages has been key to the success of the SciPy stack. (esp. C/Fortran libraries, new Jupyter framework). 4. The stack was built from both continuity (e.g. Numeric /Numarray→NumPy) and brand-new efforts (e.g. matplotlib, Pandas). Don’t discount either approach!

Slide 83

Slide 83 text

#SciPy2015 Jake VanderPlas Considering the Future of Matplolib Usual complaints about Matplotlib: - Non-optimal stylistic defaults - Non-optimal API - Difficulty exporting interactive plots - Difficulty with large datasets

Slide 84

Slide 84 text

#SciPy2015 Jake VanderPlas Considering the Future of Matplolib Usual complaints about Matplotlib: - Non-optimal stylistic defaults - Non-optimal API - Difficulty exporting interactive plots - Difficulty with large datasets Matplotlib 2.0!

Slide 85

Slide 85 text

#SciPy2015 Jake VanderPlas Considering the Future of Matplolib Usual complaints about Matplotlib: - Non-optimal stylistic defaults - Non-optimal API - Difficulty exporting interactive plots - Difficulty with large datasets Matplotlib 2.0! Seaborn, ggplot!

Slide 86

Slide 86 text

#SciPy2015 Jake VanderPlas Considering the Future of Matplolib Usual complaints about Matplotlib: - Non-optimal stylistic defaults - Non-optimal API - Difficulty exporting interactive plots - Difficulty with large datasets Matplotlib 2.0! Seaborn, ggplot! Serialization to mpld3/Bokeh/etc?

Slide 87

Slide 87 text

#SciPy2015 Jake VanderPlas Considering the Future of Matplolib Usual complaints about Matplotlib: - Non-optimal stylistic defaults - Non-optimal API - Difficulty exporting interactive plots - Difficulty with large datasets Matplotlib 2.0! Seaborn, ggplot! Serialization to mpld3/Bokeh/etc?

Slide 88

Slide 88 text

#SciPy2015 Jake VanderPlas Lesson from Numeric/Numarray, etc.: Stick with matplotlib & modify it! (e.g. serialization to VisPy? Numba-driven backend? new backend architecture?, etc.) Lesson from Pandas & Matplotlib, etc.: Start something from scratch; features will draw users! (e.g. VisPy, Bokeh, Something new?)

Slide 89

Slide 89 text

#SciPy2015 Jake VanderPlas The State of the Stack is up to You.

Slide 90

Slide 90 text

#SciPy2015 Jake VanderPlas ~ Thank You! ~ Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/

Slide 91

Slide 91 text

#SciPy2015 Jake VanderPlas

Slide 92

Slide 92 text

#SciPy2015 Jake VanderPlas (Python as glue: you’re not doing scientific computing in Python, you’re using python to glue together tools in Fortran or C) “We were trying to plug the major holes in the Python world that made it inferior to Matlab for scientific computing... [Plotting] was probably the biggest obvious gap between Python and Matlab, so we thought that this would be a central piece of SciPy.” - Eric Jones via email “[SciPy] was intended to be a Matlab replacement and so needed libraries, plus, plotting, and a shell to work in, along with tools to integrate with C/C++.” - Travis Oliphant via email “... from the very beginning, Mathematica and its notebooks (and the Maple worksheets before) were in my mind as the ideal environment for daily scientific work.” - Fernando Perez The IPython Notebook: A Historical Retrospective

Slide 93

Slide 93 text

#SciPy2015 Jake VanderPlas photo courtesey of Fernando Perez