Introduction to scientific programming in python

Introduction to scientiﬁc programming in python Olivier Hervieu - Pycon
JP - 2014/09/13

• born on twitter • not my initial proposal for
pyconjp* ! * speakerdeck.com/ohe/shit-happens-dot-dot-dot-v2 Why this talk?

You already are scientiﬁc programmers! what can I teach you?

What to expect? • tour of tools that can (must?)
be used by the everyday scientiﬁc programmer • some guidelines on how to industrialize your stack

A little about me • software engineer at @tinyclues •
10 years of experience, work everyday with python for 6 years (I know, I’m old) • ﬁrst conference in japan (yeah!) • more about.me/ohe • slides can be found on speakerdeck.com/ohe

ipython

• ipython is a must-use (for every pythonista) • if
you don’t use it, install it now (pip install ipython) • ipython provides: • a powerful interactive shell • a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media (as described on their website) • easy to use, high performance tools for parallel computing

• notebook mode supports literate programming and reproducible sessions •
notebook allows to store chunks of python along side the results and additional comments (HTML, Latex, MarkDown) • a notebook can be exported in various ﬁle formats

• ipython is the de-facto standard for sharing python sessions
—> see nbviewer.ipython.org • the project is well maintained, very stable (no surprises when you upgrade your version)

• numpy provides a powerful N-dimensions array object • methods
on these arrays are fast because they relies on well-optimised librairies for linear algebra (BLAS, ATLAS, MKL) • numpy is tolerant to python’s lists

python vs numpy def matmult(a,b): zip_b
= zip(*b) return [[sum(ele_a*ele_b for ele_a, ele_b in zip(row_a, col_b)) for col_b in zip_b] for row_a in a] matmult np.dot speedup (10, 10) 936µs 2µs 450x (100, 100) 693000µs 53µs 13000x (1000, 1000) 744000000µs 13900µs 53000x

• you don’t want to implement your matrix multiplication method
:-) • numpy inherits from years of computer based numerical analysis problem solving • don’t believe benchmarks about python performance (who says Julia?)

• provides numerous numerical routines, that run efﬁciently on top
of numpy arrays for: • optimization • signal processing • linear algebra … • provides also some convenient data structures as compressed sparse matrix and spatial data structures

• if you had already use some scikits (scikit-learn, scikit-image)
you already used scipy extensively • in other words, scipy is a toolbox for mathematicians, it contains many hidden treasures for them • for the programmer, APIs are a bit harsh, as for the naming of methods (but this naming is totally explicit for mathematicians)

matplotlib

• The ultimate plotting library that renders 2D and 3D
high-quality plots for python (I think other languages are a bit jealous too ;) • The API mimics, in many ways the MATLAB one, easing the transition from MATLAB users to python • Once again, no surprises, matplotlib is a very stable and mature project (expect one major release per year) • I recommend you to watch “Introduction to Numpy and Matplotlib” (4hours!) on youtube* * https://www.youtube.com/watch?v=3Fp1zn5ao2M

scikit-learn

• scikit-learn is one of the numerous scikits that have
been developed in the last years (there’s also scikit-image, scikit-statsmodel etc…) • it provides a ready-to-use environment to play with standard machine learning algorithms • expect a very clean API • the project is very active and have an awesome community

pandas

• fairly “new” project (open-sourced in 2009) but development is
really active since 2012 • data manipulation library based on Numpy • provides a DataFrame data structure that furnishes methods for accessing, merging/grouping, indexing data easily • doesn’t play well (yet?) with scikits (there’s some attempt like sklearn-pandas)

DEMO TIME! (nbviewer.ipython.org/gist/ohe/9f8b0e5b872e217c2fd4)

Industrial-grade scientiﬁc python (lessons learned)

• numpy/scipy/scikit-learn rely on many low-level Fortran/C library such as
BLAS, ATLAS, the Intel MKL… • most of these libraries are shipped by your favorite OS unoptimized (well, this is not the case for Mac OS) • you may want to re-compile these librairies

• re-compile is the (very) long way! • we did
that at tinyclues for two years, we’re now using a packaged python distribution. Some of them: • anaconda (powered by continuum analytics) • canopy (powered by enthought)

• sadly, these distributions come with another package management tool
(conda, enpkg) that are sometimes not playing nice with pip and/or virtualenv • adds a new step to this famous tweet about python package managers :)

We’re not done • librairies for performance: numba, cython, …
• domain speciﬁc librairies: sympy, nltk, … • bindings: rpy2, … • storage: pytables, …

Free (as in free beer) • All these libraries come
for free and are developed by passionate developers. • Please, be grateful; help them! • by finding and filling bugs (we always love to see that our code is really used by someone) • by fixing bugs or giving a beer to developers • by supporting them financially • by hosting one of their sprint (if your office is big enough)

scikit-learn sprint hosted at tinyclues july 2014

͋Γ͕ͱ͏!

Recommended • API design for machine learning software: experiences from
the scikit-learn project: http:// arxiv.org/abs/1309.0238 • Programming Collective Intelligence: http:// shop.oreilly.com/product/9780596529321.do • PyData Channel on Vimeo: http://vimeo.com/pydata

Introduction to scientific programming in python

Introduction to scientific programming in python

Olivier Hervieu

More Decks by Olivier Hervieu

Other Decks in Programming

Featured

Transcript

Introduction to scientiﬁc programming in python Olivier Hervieu - Pycon

• born on twitter • not my initial proposal for

You already are scientiﬁc programmers! what can I teach you?

What to expect? • tour of tools that can (must?)

A little about me • software engineer at @tinyclues •

ipython

• ipython is a must-use (for every pythonista) • if

• notebook mode supports literate programming and reproducible sessions •

• ipython is the de-facto standard for sharing python sessions

numpy

• numpy provides a powerful N-dimensions array object • methods

python vs numpy def matmult(a,b): zip_b

• you don’t want to implement your matrix multiplication method

scipy

• provides numerous numerical routines, that run efﬁciently on top

• if you had already use some scikits (scikit-learn, scikit-image)

matplotlib

• The ultimate plotting library that renders 2D and 3D

scikit-learn

• scikit-learn is one of the numerous scikits that have

pandas

• fairly “new” project (open-sourced in 2009) but development is

DEMO TIME! (nbviewer.ipython.org/gist/ohe/9f8b0e5b872e217c2fd4)

Industrial-grade scientiﬁc python (lessons learned)

• numpy/scipy/scikit-learn rely on many low-level Fortran/C library such as

• re-compile is the (very) long way! • we did

• sadly, these distributions come with another package management tool

We’re not done • librairies for performance: numba, cython, …

Free (as in free beer) • All these libraries come

scikit-learn sprint hosted at tinyclues july 2014

͋Γ͕ͱ͏!

Recommended • API design for machine learning software: experiences from