Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to scientific programming in python

Olivier Hervieu
September 13, 2014

Introduction to scientific programming in python

Talk given at the Pycon JP - 2014/09/13

Olivier Hervieu

September 13, 2014
Tweet

More Decks by Olivier Hervieu

Other Decks in Programming

Transcript

  1. • born on twitter • not my initial proposal for

    pyconjp* ! * speakerdeck.com/ohe/shit-happens-dot-dot-dot-v2 Why this talk?
  2. What to expect? • tour of tools that can (must?)

    be used by the everyday scientific programmer • some guidelines on how to industrialize your stack
  3. A little about me • software engineer at @tinyclues •

    10 years of experience, work everyday with python for 6 years (I know, I’m old) • first conference in japan (yeah!) • more about.me/ohe • slides can be found on speakerdeck.com/ohe
  4. • ipython is a must-use (for every pythonista) • if

    you don’t use it, install it now (pip install ipython) • ipython provides: • a powerful interactive shell • a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media (as described on their website) • easy to use, high performance tools for parallel computing
  5. • notebook mode supports literate programming and reproducible sessions •

    notebook allows to store chunks of python along side the results and additional comments (HTML, Latex, MarkDown) • a notebook can be exported in various file formats
  6. • ipython is the de-facto standard for sharing python sessions

    —> see nbviewer.ipython.org • the project is well maintained, very stable (no surprises when you upgrade your version)
  7. • numpy provides a powerful N-dimensions array object • methods

    on these arrays are fast because they relies on well-optimised librairies for linear algebra (BLAS, ATLAS, MKL) • numpy is tolerant to python’s lists
  8. python vs numpy def  matmult(a,b):          zip_b

     =  zip(*b)          return  [[sum(ele_a*ele_b  for  ele_a,  ele_b  in  zip(row_a,  col_b))                              for  col_b  in  zip_b]  for  row_a  in  a] matmult np.dot speedup (10, 10) 936µs 2µs 450x (100, 100) 693000µs 53µs 13000x (1000, 1000) 744000000µs 13900µs 53000x
  9. • you don’t want to implement your matrix multiplication method

    :-) • numpy inherits from years of computer based numerical analysis problem solving • don’t believe benchmarks about python performance (who says Julia?)
  10. • provides numerous numerical routines, that run efficiently on top

    of numpy arrays for: • optimization • signal processing • linear algebra … • provides also some convenient data structures as compressed sparse matrix and spatial data structures
  11. • if you had already use some scikits (scikit-learn, scikit-image)

    you already used scipy extensively • in other words, scipy is a toolbox for mathematicians, it contains many hidden treasures for them • for the programmer, APIs are a bit harsh, as for the naming of methods (but this naming is totally explicit for mathematicians)
  12. • The ultimate plotting library that renders 2D and 3D

    high-quality plots for python (I think other languages are a bit jealous too ;) • The API mimics, in many ways the MATLAB one, easing the transition from MATLAB users to python • Once again, no surprises, matplotlib is a very stable and mature project (expect one major release per year) • I recommend you to watch “Introduction to Numpy and Matplotlib” (4hours!) on youtube* * https://www.youtube.com/watch?v=3Fp1zn5ao2M
  13. • scikit-learn is one of the numerous scikits that have

    been developed in the last years (there’s also scikit-image, scikit-statsmodel etc…) • it provides a ready-to-use environment to play with standard machine learning algorithms • expect a very clean API • the project is very active and have an awesome community
  14. • fairly “new” project (open-sourced in 2009) but development is

    really active since 2012 • data manipulation library based on Numpy • provides a DataFrame data structure that furnishes methods for accessing, merging/grouping, indexing data easily • doesn’t play well (yet?) with scikits (there’s some attempt like sklearn-pandas)
  15. • numpy/scipy/scikit-learn rely on many low-level Fortran/C library such as

    BLAS, ATLAS, the Intel MKL… • most of these libraries are shipped by your favorite OS unoptimized (well, this is not the case for Mac OS) • you may want to re-compile these librairies
  16. • re-compile is the (very) long way! • we did

    that at tinyclues for two years, we’re now using a packaged python distribution. Some of them: • anaconda (powered by continuum analytics) • canopy (powered by enthought)
  17. • sadly, these distributions come with another package management tool

    (conda, enpkg) that are sometimes not playing nice with pip and/or virtualenv • adds a new step to this famous tweet about python package managers :)
  18. We’re not done • librairies for performance: numba, cython, …

    • domain specific librairies: sympy, nltk, … • bindings: rpy2, … • storage: pytables, …
  19. Free (as in free beer) • All these libraries come

    for free and are developed by passionate developers. • Please, be grateful; help them! • by finding and filling bugs (we always love to see that our code is really used by someone) • by fixing bugs or giving a beer to developers • by supporting them financially • by hosting one of their sprint (if your office is big enough)
  20. Recommended • API design for machine learning software: experiences from

    the scikit-learn project: http:// arxiv.org/abs/1309.0238 • Programming Collective Intelligence: http:// shop.oreilly.com/product/9780596529321.do • PyData Channel on Vimeo: http://vimeo.com/pydata