Introduction to scientific programming in python

7b1e4945bdf78fc74b1fdbbe67877575?s=47 Olivier Hervieu
September 13, 2014

Introduction to scientific programming in python

Talk given at the Pycon JP - 2014/09/13

7b1e4945bdf78fc74b1fdbbe67877575?s=128

Olivier Hervieu

September 13, 2014
Tweet

Transcript

  1. 2.

    • born on twitter • not my initial proposal for

    pyconjp* ! * speakerdeck.com/ohe/shit-happens-dot-dot-dot-v2 Why this talk?
  2. 4.

    What to expect? • tour of tools that can (must?)

    be used by the everyday scientific programmer • some guidelines on how to industrialize your stack
  3. 5.

    A little about me • software engineer at @tinyclues •

    10 years of experience, work everyday with python for 6 years (I know, I’m old) • first conference in japan (yeah!) • more about.me/ohe • slides can be found on speakerdeck.com/ohe
  4. 6.
  5. 7.

    • ipython is a must-use (for every pythonista) • if

    you don’t use it, install it now (pip install ipython) • ipython provides: • a powerful interactive shell • a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media (as described on their website) • easy to use, high performance tools for parallel computing
  6. 8.

    • notebook mode supports literate programming and reproducible sessions •

    notebook allows to store chunks of python along side the results and additional comments (HTML, Latex, MarkDown) • a notebook can be exported in various file formats
  7. 9.

    • ipython is the de-facto standard for sharing python sessions

    —> see nbviewer.ipython.org • the project is well maintained, very stable (no surprises when you upgrade your version)
  8. 10.
  9. 11.

    • numpy provides a powerful N-dimensions array object • methods

    on these arrays are fast because they relies on well-optimised librairies for linear algebra (BLAS, ATLAS, MKL) • numpy is tolerant to python’s lists
  10. 12.

    python vs numpy def  matmult(a,b):          zip_b

     =  zip(*b)          return  [[sum(ele_a*ele_b  for  ele_a,  ele_b  in  zip(row_a,  col_b))                              for  col_b  in  zip_b]  for  row_a  in  a] matmult np.dot speedup (10, 10) 936µs 2µs 450x (100, 100) 693000µs 53µs 13000x (1000, 1000) 744000000µs 13900µs 53000x
  11. 13.

    • you don’t want to implement your matrix multiplication method

    :-) • numpy inherits from years of computer based numerical analysis problem solving • don’t believe benchmarks about python performance (who says Julia?)
  12. 14.
  13. 15.
  14. 16.

    • provides numerous numerical routines, that run efficiently on top

    of numpy arrays for: • optimization • signal processing • linear algebra … • provides also some convenient data structures as compressed sparse matrix and spatial data structures
  15. 17.

    • if you had already use some scikits (scikit-learn, scikit-image)

    you already used scipy extensively • in other words, scipy is a toolbox for mathematicians, it contains many hidden treasures for them • for the programmer, APIs are a bit harsh, as for the naming of methods (but this naming is totally explicit for mathematicians)
  16. 19.

    • The ultimate plotting library that renders 2D and 3D

    high-quality plots for python (I think other languages are a bit jealous too ;) • The API mimics, in many ways the MATLAB one, easing the transition from MATLAB users to python • Once again, no surprises, matplotlib is a very stable and mature project (expect one major release per year) • I recommend you to watch “Introduction to Numpy and Matplotlib” (4hours!) on youtube* * https://www.youtube.com/watch?v=3Fp1zn5ao2M
  17. 20.
  18. 22.

    • scikit-learn is one of the numerous scikits that have

    been developed in the last years (there’s also scikit-image, scikit-statsmodel etc…) • it provides a ready-to-use environment to play with standard machine learning algorithms • expect a very clean API • the project is very active and have an awesome community
  19. 23.
  20. 24.

    • fairly “new” project (open-sourced in 2009) but development is

    really active since 2012 • data manipulation library based on Numpy • provides a DataFrame data structure that furnishes methods for accessing, merging/grouping, indexing data easily • doesn’t play well (yet?) with scikits (there’s some attempt like sklearn-pandas)
  21. 27.

    • numpy/scipy/scikit-learn rely on many low-level Fortran/C library such as

    BLAS, ATLAS, the Intel MKL… • most of these libraries are shipped by your favorite OS unoptimized (well, this is not the case for Mac OS) • you may want to re-compile these librairies
  22. 28.
  23. 29.

    • re-compile is the (very) long way! • we did

    that at tinyclues for two years, we’re now using a packaged python distribution. Some of them: • anaconda (powered by continuum analytics) • canopy (powered by enthought)
  24. 30.

    • sadly, these distributions come with another package management tool

    (conda, enpkg) that are sometimes not playing nice with pip and/or virtualenv • adds a new step to this famous tweet about python package managers :)
  25. 31.

    We’re not done • librairies for performance: numba, cython, …

    • domain specific librairies: sympy, nltk, … • bindings: rpy2, … • storage: pytables, …
  26. 32.

    Free (as in free beer) • All these libraries come

    for free and are developed by passionate developers. • Please, be grateful; help them! • by finding and filling bugs (we always love to see that our code is really used by someone) • by fixing bugs or giving a beer to developers • by supporting them financially • by hosting one of their sprint (if your office is big enough)
  27. 35.

    Recommended • API design for machine learning software: experiences from

    the scikit-learn project: http:// arxiv.org/abs/1309.0238 • Programming Collective Intelligence: http:// shop.oreilly.com/product/9780596529321.do • PyData Channel on Vimeo: http://vimeo.com/pydata