Slide 1

Slide 1 text

A journey through the scientific python ecosystem David Cournapeau @cournape

Slide 2

Slide 2 text

• Notes: • This presentation took a lot of inspiration from “the unexpected effectiveness of python in science” by Jake VanderPlas

Slide 3

Slide 3 text

Who am I • I am David Cournapeau, cournape on twitter/github/ stackoverflow

Slide 4

Slide 4 text

Where I come from • Strasbourg, France

Slide 5

Slide 5 text

Me on the internet • Code • (mostly in the past)

Slide 6

Slide 6 text

Me at work • Cogent Labs: https:// www.cogent.co.jp • We are applying AI/Deep Learning to difficult business problems: • Handwriting recognition (tegaki.ai) • Language understanding (kaidoku) • Time series analysis (finance, etc.) • We are hiring: experience software engineers, ML engineers, Research Scientists in DL/statistics

Slide 7

Slide 7 text

A bit of history

Slide 8

Slide 8 text

My journey to python • Started using python around 2005 for audio processing • Heavy Matlab user at that time • Hit limitations of matlab/C integration • Built a hodgepodge of Matlab, C, python and hdf5 for data transfer • Python was easy to integrate with C, had libraries to parse XML, audio files, do complex GUI, etc.

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

This was typical “Scientists... work with a wide variety of systems ranging from simulation codes, data analysis packages, databases, visualization tools, and home-grown software-each of which presents the user with a different set of interfaces and file formats. As a result, a scientist may spend a considerable amount of time simply trying to get all of these components to work together in some manner...” By David Beazley Scientific Computing with Python In ASP Conf. Ser., Vol. 216, ADASS

Slide 11

Slide 11 text

Python as a glue language • As python could replace bash, sed/awk, and also call into other programs, python became an increasingly popular choice in the 90ies as a glue language • It was also “easy” to interface with C and Fortran libraries • But python was not the only such language: Perl, Tcl/TK, GNU guile or ruby • Something else needed to happen

Slide 12

Slide 12 text

Array computing • At the lowest level, lots of scientific work is about numerical computation • They need to be efficient • People in the 90ies work on array computing in python (matrix-sig) • Matrix package by Jim Fulton, extended by Jim Hugunin -> become Numeric • Paul Dubois, Konrad Hinsen, David Ascher, Travis Oliphant and other continue that work • “grand unification” into NumPy around 2005

Slide 13

Slide 13 text

“Exploratory computing” • IPython started around 2000 by Fernando Perez: python shell optimized for exploratory scientific work • Matplotlib started around 2000 by late John Hunter

Slide 14

Slide 14 text

Mentions of software in astronomy publications From The unexpected effectiveness of python in science by Jake VanderPlas

Slide 15

Slide 15 text

Python as a language for science • Its main strengths come from being a general programming language • Benefit from a large community outside scientists • Also its main weaknesses: • Not integrated (no “python IDE with everything in it”) • Can be confusing for new comers

Slide 16

Slide 16 text

Python in science today Python’s Scientific Ecosystem (and many, many more) Bokeh From The unexpected effectiveness of python in science by Jake VanderPlas

Slide 17

Slide 17 text

A brief tour

Slide 18

Slide 18 text

Installing python 1. Use what your colleagues use 2. Otherwise, use one of the binary distribution available: anaconda, canopy, python(x, y), etc. 3. People with more experience at the command line: `python -m pip —user install …`

Slide 19

Slide 19 text

Pandas: “excel in python” • Pandas is a library for labeled data: ideal for time series, csv, data cleaning, etc… import numpy as np import pandas as pd df = pd.DataFrame( {“normal_1": np.random.randn(1024), "normal_2": np.random.randn(1024) + 5}) df.hist(bins=50)

Slide 20

Slide 20 text

Pandas: example ED normalized_ED count Error_distribution field_type field_name 1.057 0.150 174 [0.575 0.144 0.132 0.046 0.034 0.040 0.029 0.000] sentence form1/fields/69 0.914 0.344 174 [0.316 0.500 0.155 0.017 0.006 0.006 0.000 0.000] sentence form1/fields/31 import pandas as pd df = pd.read_table("report.txt") print("columns: {}".format(", ".join(df.columns))) print("Total count: {}".format(df["count"].sum())) print("Average of normalized ED: {}".format((df["count"] * df["normalized_ED"]).sum() / df["count"].sum())) columns: ED, normalized_ED, count, Error_distribution, field_type, field_name Total count: 7479 Average of normalized ED: 0.17594825511432008

Slide 21

Slide 21 text

When to use pandas • Use pandas when you need to munge / plot data quickly • Can often replace simple use cases of excel (plot, pivoting, aggregation, etc.), but in a more manageable manner

Slide 22

Slide 22 text

Numerical computations • NumPy: the backbone of scientific computing in python • Provides the ndarray object for efficient manipulation of data arrays import numpy as np x = np.random.randn(1024) y = 0.1 * np.random.randn(1024) + 5 # for every 0 <= i < 1024, z[i] = x[i] + y[i] z = x + y

Slide 23

Slide 23 text

Vectorization • Key to good performances in NumPy is to use vectorization • If vectorization too difficult: look at numba, cython import numpy as np def naive_version(x, y): s = 0 for i in range(len(x)): s += x[i] * y[i] return s def numpy_version(x, y): return np.sum(x * y) x = np.random.randn(int(1e6)) y = np.random.randn(int(1e6)) In [6]: %timeit naive_version(x, y) 276 ms ± 8.95 ms per loop (mean ± std. dev.) In [7]: %timeit numpy_version(x, y) 3.01 ms ± 51.5 µs per loop (mean ± std. dev.) ~NumPy 90x faster !

Slide 24

Slide 24 text

When to use NumPy • The common data array structure used by most scientific libraries • If you are new to python, or deals with time-series, or comes from R, or uses excel a lot: starts with pandas • If you are more experienced, and/or doing numerical computing, machine learning, etc.: maybe starts with NumPy

Slide 25

Slide 25 text

Matplotlib • Was initially designed as a replacement for Matlab for plotting import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 1000) # Noisy sinusoid y = np.sin(x) + 0.1 * np.random.randn(len(x)) plt.plot(x, y)

Slide 26

Slide 26 text

Visualization with pandas • Pandas provide shortcuts for simple plots through matplotlib import pandas as pd data = pd.read_csv('iris.csv') data.plot.scatter( ‘PetalLength', 'PetalWidth')

Slide 27

Slide 27 text

Seaborn • Seaborn is built on top of matplotlib, for statistical plots import pandas as pd import seaborn data = pd.read_csv('iris.csv') seaborn.pairplot(data, hue='Name')

Slide 28

Slide 28 text

Other visualization libraries • Visualization landscape is changing rapidly • I am not a specialist in viz • Recent libraries focus on: • visualization of large datasets • web-based interfaces • Examples: bokeh, plotly, plotnine

Slide 29

Slide 29 text

Bokeh: interactive plotting in the browser From bokeh website

Slide 30

Slide 30 text

Plotly: modern platform for data science From plotly website

Slide 31

Slide 31 text

scikit-learn: machine learning in python • Provides many recent Machine Learning algorithms, under a common API • Appropriate for many classification problems • Can be used for unsupervised classification as well • Some algorithms have online versions as well (for out of core computation, see also dask) • But: • Purposely does not handle complex neural networks (no GPU support, etc.) • API does not fit every ML problem

Slide 32

Slide 32 text

Scikit-learn: example import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestRegressor x = 10 * np.random.rand(100) y = np.sin(x) + 0.1 * np.random.randn(100) model = RandomForestRegressor() model.fit(x[:, np.newaxis], y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit)

Slide 33

Slide 33 text

Scikit-learn: example import numpy as np import matplotlib.pyplot as plt from sklearn.svm import SVR x = 10 * np.random.rand(100) y = np.sin(x) + 0.1 * np.random.randn(100) model = SVR() model.fit(x[:, np.newaxis], y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit)

Slide 34

Slide 34 text

Numba: accelerate numeric python • Numba can be used to optimize python code. • More specialized than general JIT python interpreters (is e.g. pypy) • Designed to run within standard CPython import numpy as np import numba @numba.jit def naive_version(x, y): s = 0 for i in range(len(x)): s += x[i] * y[i] return s def numpy_version(x, y): return np.sum(x * y) x = np.random.randn(int(1e6)) y = np.random.randn(int(1e6)) In [2]: %timeit numpy_version(x, y) 1.73 ms ± 46.2 µs per loop (mean ± std. dev.) In [3]: %timeit naive_version(x, y) 1.22 ms ± 29.1 µs per loop (mean ± std. dev.) Faster than NumPy !

Slide 35

Slide 35 text

Misc • For performance, look at cython and numba • For statistics, look at statsmodels (e.g. TimeSeries model like ARIMA, etc.) • For Deep Learning: very rapidly changing ecosystem (tensorflow, keras, pytorch, dynet, etc.) • For image processing: scikit-images

Slide 36

Slide 36 text

Thank you ! • On github: https://github.com/cournape • On Twitter: https://twitter.com/cournape • I will be there at the party tonight if you want to chat !