Slide 1

Slide 1 text

Introduction to scientific programming in python Olivier Hervieu - Pycon JP - 2014/09/13

Slide 2

Slide 2 text

• born on twitter • not my initial proposal for pyconjp* ! * speakerdeck.com/ohe/shit-happens-dot-dot-dot-v2 Why this talk?

Slide 3

Slide 3 text

You already are scientific programmers! what can I teach you?

Slide 4

Slide 4 text

What to expect? • tour of tools that can (must?) be used by the everyday scientific programmer • some guidelines on how to industrialize your stack

Slide 5

Slide 5 text

A little about me • software engineer at @tinyclues • 10 years of experience, work everyday with python for 6 years (I know, I’m old) • first conference in japan (yeah!) • more about.me/ohe • slides can be found on speakerdeck.com/ohe

Slide 6

Slide 6 text

ipython

Slide 7

Slide 7 text

• ipython is a must-use (for every pythonista) • if you don’t use it, install it now (pip install ipython) • ipython provides: • a powerful interactive shell • a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media (as described on their website) • easy to use, high performance tools for parallel computing

Slide 8

Slide 8 text

• notebook mode supports literate programming and reproducible sessions • notebook allows to store chunks of python along side the results and additional comments (HTML, Latex, MarkDown) • a notebook can be exported in various file formats

Slide 9

Slide 9 text

• ipython is the de-facto standard for sharing python sessions —> see nbviewer.ipython.org • the project is well maintained, very stable (no surprises when you upgrade your version)

Slide 10

Slide 10 text

numpy

Slide 11

Slide 11 text

• numpy provides a powerful N-dimensions array object • methods on these arrays are fast because they relies on well-optimised librairies for linear algebra (BLAS, ATLAS, MKL) • numpy is tolerant to python’s lists

Slide 12

Slide 12 text

python vs numpy def  matmult(a,b):          zip_b  =  zip(*b)          return  [[sum(ele_a*ele_b  for  ele_a,  ele_b  in  zip(row_a,  col_b))                              for  col_b  in  zip_b]  for  row_a  in  a] matmult np.dot speedup (10, 10) 936µs 2µs 450x (100, 100) 693000µs 53µs 13000x (1000, 1000) 744000000µs 13900µs 53000x

Slide 13

Slide 13 text

• you don’t want to implement your matrix multiplication method :-) • numpy inherits from years of computer based numerical analysis problem solving • don’t believe benchmarks about python performance (who says Julia?)

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

scipy

Slide 16

Slide 16 text

• provides numerous numerical routines, that run efficiently on top of numpy arrays for: • optimization • signal processing • linear algebra … • provides also some convenient data structures as compressed sparse matrix and spatial data structures

Slide 17

Slide 17 text

• if you had already use some scikits (scikit-learn, scikit-image) you already used scipy extensively • in other words, scipy is a toolbox for mathematicians, it contains many hidden treasures for them • for the programmer, APIs are a bit harsh, as for the naming of methods (but this naming is totally explicit for mathematicians)

Slide 18

Slide 18 text

matplotlib

Slide 19

Slide 19 text

• The ultimate plotting library that renders 2D and 3D high-quality plots for python (I think other languages are a bit jealous too ;) • The API mimics, in many ways the MATLAB one, easing the transition from MATLAB users to python • Once again, no surprises, matplotlib is a very stable and mature project (expect one major release per year) • I recommend you to watch “Introduction to Numpy and Matplotlib” (4hours!) on youtube* * https://www.youtube.com/watch?v=3Fp1zn5ao2M

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

scikit-learn

Slide 22

Slide 22 text

• scikit-learn is one of the numerous scikits that have been developed in the last years (there’s also scikit-image, scikit-statsmodel etc…) • it provides a ready-to-use environment to play with standard machine learning algorithms • expect a very clean API • the project is very active and have an awesome community

Slide 23

Slide 23 text

pandas

Slide 24

Slide 24 text

• fairly “new” project (open-sourced in 2009) but development is really active since 2012 • data manipulation library based on Numpy • provides a DataFrame data structure that furnishes methods for accessing, merging/grouping, indexing data easily • doesn’t play well (yet?) with scikits (there’s some attempt like sklearn-pandas)

Slide 25

Slide 25 text

DEMO TIME! (nbviewer.ipython.org/gist/ohe/9f8b0e5b872e217c2fd4)

Slide 26

Slide 26 text

Industrial-grade scientific python (lessons learned)

Slide 27

Slide 27 text

• numpy/scipy/scikit-learn rely on many low-level Fortran/C library such as BLAS, ATLAS, the Intel MKL… • most of these libraries are shipped by your favorite OS unoptimized (well, this is not the case for Mac OS) • you may want to re-compile these librairies

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

• re-compile is the (very) long way! • we did that at tinyclues for two years, we’re now using a packaged python distribution. Some of them: • anaconda (powered by continuum analytics) • canopy (powered by enthought)

Slide 30

Slide 30 text

• sadly, these distributions come with another package management tool (conda, enpkg) that are sometimes not playing nice with pip and/or virtualenv • adds a new step to this famous tweet about python package managers :)

Slide 31

Slide 31 text

We’re not done • librairies for performance: numba, cython, … • domain specific librairies: sympy, nltk, … • bindings: rpy2, … • storage: pytables, …

Slide 32

Slide 32 text

Free (as in free beer) • All these libraries come for free and are developed by passionate developers. • Please, be grateful; help them! • by finding and filling bugs (we always love to see that our code is really used by someone) • by fixing bugs or giving a beer to developers • by supporting them financially • by hosting one of their sprint (if your office is big enough)

Slide 33

Slide 33 text

scikit-learn sprint hosted at tinyclues july 2014

Slide 34

Slide 34 text

͋Γ͕ͱ͏!

Slide 35

Slide 35 text

Recommended • API design for machine learning software: experiences from the scikit-learn project: http:// arxiv.org/abs/1309.0238 • Programming Collective Intelligence: http:// shop.oreilly.com/product/9780596529321.do • PyData Channel on Vimeo: http://vimeo.com/pydata