A journey through the
scientiﬁc python
ecosystem
David Cournapeau @cournape

• Notes:
• This presentation took a lot of inspiration from “the
unexpected eﬀectiveness of python in science” by
Jake VanderPlas

Who am I
• I am David Cournapeau,
cournape on twitter/github/
stackoverﬂow

Where I come from
• Strasbourg, France

Me on the internet
• Code
• (mostly in the past)

Me at work
• Cogent Labs: https://
www.cogent.co.jp
• We are applying AI/Deep Learning
to diﬃcult business problems:
• Handwriting recognition (tegaki.ai)
• Language understanding
(kaidoku)
• Time series analysis (ﬁnance, etc.)
• We are hiring: experience software
engineers, ML engineers, Research
Scientists in DL/statistics

A bit of history

My journey to python
• Started using python around 2005 for audio processing
• Heavy Matlab user at that time
• Hit limitations of matlab/C integration
• Built a hodgepodge of Matlab, C, python and hdf5 for
data transfer
• Python was easy to integrate with C, had libraries to
parse XML, audio ﬁles, do complex GUI, etc.

No content

This was typical
“Scientists... work with a wide variety of systems ranging from
simulation codes, data analysis packages, databases, visualization
tools, and home-grown software-each of which presents the user
with a different set of interfaces and ﬁle formats. As a result, a
scientist may spend a considerable amount of time simply trying to
get all of these components to work together in some manner...”
By David Beazley
Scientiﬁc Computing with Python
In ASP Conf. Ser., Vol. 216, ADASS

Python as a glue language
• As python could replace bash, sed/awk, and also call into
other programs, python became an increasingly popular
choice in the 90ies as a glue language
• It was also “easy” to interface with C and Fortran libraries
• But python was not the only such language: Perl, Tcl/TK,
GNU guile or ruby
• Something else needed to happen

Array computing
• At the lowest level, lots of scientiﬁc work is about numerical
computation
• They need to be eﬃcient
• People in the 90ies work on array computing in python (matrix-sig)
• Matrix package by Jim Fulton, extended by Jim Hugunin ->
become Numeric
• Paul Dubois, Konrad Hinsen, David Ascher, Travis Oliphant and
other continue that work
• “grand uniﬁcation” into NumPy around 2005

“Exploratory computing”
• IPython started around 2000 by Fernando Perez: python
shell optimized for exploratory scientiﬁc work
• Matplotlib started around 2000 by late John Hunter

Mentions of software in
astronomy publications
From The unexpected eﬀectiveness of python in science by Jake VanderPlas

Python as a language for
science
• Its main strengths come from being a general
programming language
• Beneﬁt from a large community outside scientists
• Also its main weaknesses:
• Not integrated (no “python IDE with everything in it”)
• Can be confusing for new comers

Python in science today
Python’s Scientific Ecosystem (and
many,
many
more)
Bokeh
From The unexpected eﬀectiveness of python in science by Jake VanderPlas

A brief tour

Installing python
1. Use what your colleagues use
2. Otherwise, use one of the binary distribution available:
anaconda, canopy, python(x, y), etc.
3. People with more experience at the command line:
`python -m pip —user install …`

Pandas: “excel in python”
• Pandas is a library for labeled data: ideal for time series, csv,
data cleaning, etc…
import numpy as np
import pandas as pd
df = pd.DataFrame(
{“normal_1": np.random.randn(1024),
"normal_2": np.random.randn(1024) +
5})
df.hist(bins=50)

Pandas: example
ED normalized_ED count Error_distribution field_type field_name
1.057 0.150 174 [0.575 0.144 0.132 0.046 0.034 0.040 0.029 0.000] sentence
form1/fields/69
0.914 0.344 174 [0.316 0.500 0.155 0.017 0.006 0.006 0.000 0.000] sentence
form1/fields/31
import pandas as pd
df = pd.read_table("report.txt")
print("columns: {}".format(", ".join(df.columns)))
print("Total count: {}".format(df["count"].sum()))
print("Average of normalized ED: {}".format((df["count"] *
df["normalized_ED"]).sum() / df["count"].sum()))
columns: ED, normalized_ED, count, Error_distribution, ﬁeld_type, ﬁeld_name
Total count: 7479
Average of normalized ED: 0.17594825511432008

When to use pandas
• Use pandas when you need to munge / plot data quickly
• Can often replace simple use cases of excel (plot,
pivoting, aggregation, etc.), but in a more manageable
manner

Numerical computations
• NumPy: the backbone of scientiﬁc computing in python
• Provides the ndarray object for eﬃcient manipulation of data
arrays
import numpy as np
x = np.random.randn(1024)
y = 0.1 * np.random.randn(1024) + 5
# for every 0 <= i < 1024, z[i] = x[i] + y[i]
z = x + y

Vectorization
• Key to good performances in NumPy is to use vectorization
• If vectorization too diﬃcult: look at numba, cython
import numpy as np
def naive_version(x, y):
s = 0
for i in range(len(x)):
s += x[i] * y[i]
return s
def numpy_version(x, y):
return np.sum(x * y)
x = np.random.randn(int(1e6))
y = np.random.randn(int(1e6))
In [6]: %timeit naive_version(x, y)
276 ms ± 8.95 ms per loop (mean ± std. dev.)
In [7]: %timeit numpy_version(x, y)
3.01 ms ± 51.5 µs per loop (mean ± std. dev.)
~NumPy 90x faster !

When to use NumPy
• The common data array structure used by most scientiﬁc
libraries
• If you are new to python, or deals with time-series, or
comes from R, or uses excel a lot: starts with pandas
• If you are more experienced, and/or doing numerical
computing, machine learning, etc.: maybe starts with
NumPy

Matplotlib
• Was initially designed as a replacement for Matlab for
plotting
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 1000)
# Noisy sinusoid
y = np.sin(x) + 0.1 *
np.random.randn(len(x))
plt.plot(x, y)

Visualization with pandas
• Pandas provide shortcuts for simple plots through
matplotlib
import pandas as pd
data = pd.read_csv('iris.csv')
data.plot.scatter(
‘PetalLength', 'PetalWidth')

Seaborn
• Seaborn is built on top of matplotlib, for statistical plots
import pandas as pd
import seaborn
data = pd.read_csv('iris.csv')
seaborn.pairplot(data,
hue='Name')

Other visualization libraries
• Visualization landscape is changing rapidly
• I am not a specialist in viz
• Recent libraries focus on:
• visualization of large datasets
• web-based interfaces
• Examples: bokeh, plotly, plotnine

Bokeh: interactive plotting
in the browser
From bokeh website

Plotly: modern platform for
data science
From plotly website

scikit-learn: machine
learning in python
• Provides many recent Machine Learning algorithms, under a common API
• Appropriate for many classiﬁcation problems
• Can be used for unsupervised classiﬁcation as well
• Some algorithms have online versions as well (for out of core
computation, see also dask)
• But:
• Purposely does not handle complex neural networks (no GPU support,
etc.)
• API does not ﬁt every ML problem

Scikit-learn: example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
x = 10 * np.random.rand(100)
y = np.sin(x) + 0.1 * np.random.randn(100)
model = RandomForestRegressor()
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(-1, 11, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.plot(x, y, '.k')
plt.plot(xfit, yfit)

Scikit-learn: example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
x = 10 * np.random.rand(100)
y = np.sin(x) + 0.1 * np.random.randn(100)
model = SVR()
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(-1, 11, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.plot(x, y, '.k')
plt.plot(xfit, yfit)

Numba: accelerate numeric python
• Numba can be used to optimize python code.
• More specialized than general JIT python interpreters (is e.g. pypy)
• Designed to run within standard CPython
import numpy as np
import numba
@numba.jit
def naive_version(x, y):
s = 0
for i in range(len(x)):
s += x[i] * y[i]
return s
def numpy_version(x, y):
return np.sum(x * y)
x = np.random.randn(int(1e6))
y = np.random.randn(int(1e6))
In [2]: %timeit numpy_version(x, y)
1.73 ms ± 46.2 µs per loop (mean ± std. dev.)
In [3]: %timeit naive_version(x, y)
1.22 ms ± 29.1 µs per loop (mean ± std. dev.)
Faster than NumPy !

Misc
• For performance, look at cython and numba
• For statistics, look at statsmodels (e.g. TimeSeries model
like ARIMA, etc.)
• For Deep Learning: very rapidly changing ecosystem
(tensorﬂow, keras, pytorch, dynet, etc.)
• For image processing: scikit-images

Thank you !
• On github: https://github.com/cournape
• On Twitter: https://twitter.com/cournape
• I will be there at the party tonight if you want to chat !