Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Paris 2016 keynote: reasons for the success of Scientific Python

PyData Paris 2016 keynote: reasons for the success of Scientific Python

Emmanuelle Gouillart

June 14, 2016
Tweet

More Decks by Emmanuelle Gouillart

Other Decks in Technology

Transcript

  1. Simple APIs and innovative documentation Reasons for the success of

    Scientific Python Emmanuelle Gouillart joint Unit CNRS/Saint-Gobain SVI and the scikit-image team @EGouillart
  2. NumpPy: Python objects for numerical arrays Multi-dimensional numerical data container

    (based on compiled code) + utility functions to create/manipulate them >>> a = np.random. r a n d o m i n t e g e r s (0, 1, (2, 2, 2)) >>> a a r r a y ([[[0 , 1], [1, 0]], [[0, 0], [0, 1]]]) >>> a. shape , a. dtype ((2, 2, 2), dtype ( ’ int64 ’ )) x
  3. NumpPy: Python objects for numerical arrays Multi-dimensional numerical data container

    (based on compiled code) + utility functions to create/manipulate them >>> a = np.random. r a n d o m i n t e g e r s (0, 1, (2, 2, 2)) >>> a a r r a y ([[[0 , 1], [1, 0]], [[0, 0], [0, 1]]]) >>> a. shape , a. dtype ((2, 2, 2), dtype ( ’ int64 ’ )) x Efficient and versatile data access indexing and slicing fancy indexing
  4. What is scikit-image? An open-source (BSD) generic image processing library

    for the Python language (and NumPy data arrays)
  5. What is scikit-image? An open-source (BSD) generic image processing library

    for the Python language (and NumPy data arrays) for 2D & 3D images simple API & gentle learning curve
  6. A flood of images hundreds of terabytes of scientific data

    for scientific experiment http://sdo.gsfc.nasa.gov/
  7. Datasheet Package statistics http://scikit-image.org/ Release 0.12 (1 - 2 release

    per year) Among 1000 best ranked packages on PyPi 20000 unique visitors / month
  8. The people A quite healthy curve... we can do better!

    Fernando Perez & Aaron Meurer, Gist 5843625
  9. The people Origin & diversity Different fields of application 10

    largest contributors: 4 continents and 7 countries of origin Where we could do better: Academic / business / industry Gender balance Africa, South America, ...
  10. We code when(ever) we can 00:00 06:00 12:00 18:00 24:00

    0 100 200 300 400 500 600 Coding hours 0 200 400 600 800 1000 1200 1400 Number of commits per day Sun Sat Fri Thu Wed Tue Mon
  11. Development model Mature algorithms Only Python + Cython code for

    easier maintainability Focus on good practices: testing, documentation, version control Hosted on GitHub: thorough code reivew + continuous integration Core team of 5 − 10 persons (close to applications)
  12. Who is your typical user? Windows 54% Linux 26% OS

    X 20% Not a lot of hardcore geeks Not a lot of time on her plate Learning / finding information is hard
  13. Manipulating images as numerical (numpy) arrays Pixels are arrays elements

    import numpy as np image = np. ones ((5, 5)) image [0, 0] = 0 image [2, :] = 0 x
  14. Manipulating images as numerical (numpy) arrays Pixels are arrays elements

    import numpy as np image = np. ones ((5, 5)) image [0, 0] = 0 image [2, :] = 0 x >>> coffee.shape (400, 600, 3) >>> red channel = coffee[..., 0] >>> image 3d = np.ones((100, 100, 100))
  15. NumPy-native: images as NumPy arrays NumPy arrays as arguments and

    outputs >>> from skimage import io , f i l t e r s >>> c a m e r a a r r a y = i o . imread ( ’ camera image . png ’ ) >>> type( c a m e r a a r r a y ) <type ’numpy . ndarray ’ > >>> c a m e r a a r r a y . dtype dtype ( ’ uint8 ’ ) >>> f i l t e r e d a r r a y = f i l t e r s . g a u s s i a n ( camera array , sigma =5) >>> type( f i l t e r e d a r r a y ) <type ’numpy . ndarray ’ > >>> import m a t p l o t l i b . p y p l o t as p l t >>> p l t .imshow( f i l t e r e d a r r a y , cmap= ’ gray ’ ) x
  16. How we simplified the API Before 2013 >>> from skimage

    import io , f i l t e r s >>> c a m e r a a r r a y = i o . imread ( ’ camera image . png ’ ) >>> type( c a m e r a a r r a y ) Image ... >> camera .max() Image (255 , dtype = u i n t 8 ) x
  17. Versatile use for 2D, 2D-RGB, 3D... >>> from skimage import

    measure >>> l a b e l s 2 d = measure . l a b e l ( image 2d ) >>> l a b e l s 3 d = measure . l a b e l ( image 3d ) x
  18. Versatile use for 2D, 2D-RGB, 3D... def q u i

    c k s h i f t (image , r a t i o =1.0 , k e r n e l s i z e =5, ma x di st =10, sigma =0, random seed =42): ””” Segments image using q u i c k s h i f t c l u s t e r i n g in Color −(x , y ) space . . . . ””” image = i m g a s f l o a t (np. a t l e a s t 3 d ( image )) ... x
  19. An API relying mostly on functions skimage . f i

    l t e r s . g a u s s i a n (image , sigma , output =None, mode= ’ n e a r e s t ’ , c v a l =0, m u l t i c h a n n e l =None) Multi - d i m e n s i o n a l Gaussian filter Parameters ---------- image : array - l i k e input image ( g r a y s c a l e or c o l o r ) to filter. sigma : s c a l a r or sequence of s c a l a r s st and ard d e v i a t i o n f o r Gaussian k e r n e l . The st and ard d e v i a t i o n s of the Gaussian filter are g i v e n f o r each a x i s as a sequence , or as a s i n g l e number , in which case i t i s equal f o r all axes . output : array , o p t i o n a l The ‘‘ output ‘‘ parameter p a s s e s an a r r a y in which to s t o r e the filter output . mode : { ’ r e f l e c t ’ , ’ constant ’ , ’ n ea re st ’ , ’ mirror ’ , ’ wrap ’ }, o p t i o n a l One filter = one function Use keyword argument for parameter tuning
  20. Denoising tomography images In-situ imaging of phase separation in silicate

    melts From basic (generic) to advanced (specific) filters
  21. Denoising tomography images Histogram of pixel values From basic (generic)

    to advanced (specific) filters bilateral = restoration . denoise bilateral (dat) bilateral = restoration . denoise bilateral (dat, sigma range=2.5, sigma spatial=2) tv = restoration . denoise tv chambolle (dat, weight=0.5)
  22. Example: segmentation of low-constrast regions In-situ imaging of glass batch

    reactive melting Non-local means denoising to preserve texture Histogram-based markers extraction Random walker segmentation Non-local means: average similar patches Random walker:anisotropic diffusion from markers Random walker less sensitive to noise than watershed, but slower
  23. Feature extraction followed by classification Combining scikit-image and scikit-learn Extract

    features (skimage.feature) Pixels intensity values (R, G, B) Local gradients More advanced descriptors: HOGs, Gabor, ... Train classifier with known regions here, random forest classifier Classify pixels
  24. API of scikit-image skimage filters restoration segmentation ... denoise_bilateral input

    array + optional parameters output (array) submodule module function variables
  25. What is good documentation? ”Documenting code is like writing ”Tasty!”

    on the side of a coffee cup. If the code isn’t readable on a grey Monday morning before coffee, chuck it out and start again. What you document are APIs (...). That is fine. Explaining what this funky loop does is not fine.” Pieter Hintjens
  26. Docstrings now and then docstring in 2008 D e f

    i n i t i o n : np. d i f f (a, n=1, a x i s =-1) D o c s t r i n g : C a l c u l a t e the n- th o r d e r d i s c r e t e d i f f e r e n c e along g i v e n a x i s . x
  27. Docstrings now and then D e f i n i

    t i o n : np. d i f f (a, n=1, a x i s =-1) Docstring : C a l c u l a t e the n- th o r d e r d i s c r e t e d i f f e r e n c e along given a x i s . The f i r s t o r d e r d i f f e r e n c e i s given by ‘‘ out [n] = a[n+1] - a[n]‘‘ along the given axis , h i g h e r o r d e r d i f f e r e n c e s are c a l c u l a t e d by using ‘ d i f f ‘ r e c u r s i v e l y . Parameters ---------- a : a r r a y l i k e Input a r r a y n : int , o p t i o n a l The number of times v a l u e s are d i f f e r e n c e d . a x i s : int , o p t i o n a l The a x i s along which the d i f f e r e n c e i s taken , d e f a u l t i s the l a s t a x i s . Returns ------- d i f f : ndarray The ‘n‘ o r d e r d i f f e r e n c e s . The shape of the output i s the same as ‘a‘ except along ‘ axis ‘ where the dimension i s s m a l l e r by ‘n‘. See Also -------- gradient , e d i f f 1 d , cumsum Examples -------- >>> x = np. a r r a y ([1, 2, 4, 7, 0]) >>> np. d i f f (x) a r r a y ([ 1, 2, 3, -7]) >>> np. d i f f (x , n=2) a r r a y ([ 1, 1, -10]) much better now! Parameters and their type Suggestion of other functions Simple example
  28. pydocweb and NumPy documentation Marathon Tools by Pauli Virtnanen, with

    enthusiastic cheering from my side Documentation effort led by St´ efan van der Walt Easy as Wikipedia A wiki to improve the docs We didn’t have Github!
  29. NumPy documentation standard https://github.com/numpy/numpy/blob/master/doc/example.py def foo ( var1 , var2

    , long var name = ’ hi ’) : r”””A one−line summary that does not use variable names or the function name. Several sentences providing an extended description . Refer to variables using back−ticks , e . g . ‘var ‘ . Parameters − − − − − − − − − − var1 : array like Array like means all those objects − − lists , nested lists , etc . − − that can be converted to an array . We can also refer to variables like ‘var1 ‘ . var2 : int The type above can either refer to an actual Python type (e . g . ‘ ‘ int ‘ ‘) , or describe the type of the variable in more detail , e . g . ‘ ‘(N,) ndarray ‘ ‘ or ‘ ‘ array like ‘ ‘ . Long variable name : {’ hi ’ , ’ho ’} , optional Choices in brackets , default f i r s t when optional . Returns − − − − − − − type Explanation of anonymous return value of type ‘ ‘type ‘ ‘ . describe : type Explanation of return value named ‘ describe ‘ . out : type Explanation of ‘out ‘ . Other Parameters − − − − − − − − − − − − − − − − only seldom used keywords : type Explanation common parameters listed above : type Explanation
  30. Outcome and impact of documentation marathon # of words in

    Numpy reference: 8600 → 140,000 New contributors: 250 accounts Lower entry barrier to contribute Increased the standard for other packages Made people proud about docs
  31. Outcome and impact of documentation marathon # of words in

    Numpy reference: 8600 → 140,000 New contributors: 250 accounts Lower entry barrier to contribute Increased the standard for other packages Made people proud about docs
  32. My first experience of programming... >>> cd new experiment >>>

    a c q u i r e t e m p e r a t u r e () >>> name exp = ’ convection ’ >>> c o n t r o l p a r a m e t e r () >>> ... and o t h e r magical s p e l l s x
  33. My first experience of programming... >>> cd new experiment >>>

    a c q u i r e t e m p e r a t u r e () >>> name exp = ’ convection ’ >>> c o n t r o l p a r a m e t e r () >>> ... and o t h e r magical s p e l l s x
  34. Euroscipy conferences Every August: Leipzig, Paris, Brussels, Cambridge 2016 :

    Erlangen 2 days of tutorials, beginners and advanced 2 days of conference Help from volunteers always welcome!
  35. Scipy lecture notes Train a lot of people: need tools

    that scale Several weeks of tutorials! Beginners: the core of Scientific Python Advanced: learn more tricks Packages: specific applications and packages Developed and used for Euroscipy conferences Curated and enriched over the years
  36. Achieving a sustainable growth Balance users’ and contributors’ goals: robustness

    and smooth learning curve vs cool factor and bleeding-edge tools Feature development should not be faster than quality improvement Documentation and training for users Low entry barriers for contributors
  37. Massive data processing and parallelization Competitive environment: some other tools

    use GPUs, Spark, etc. scikit-image uses NumPy! I/O: large images might not fit into memory use memory mapping of different file formats (raw binary with NumPy, hdf5 with pytables). Divide into blocks: use util.view as blocks to iterate conveniently over blocks Parallel processing: use joblib or dask Better integration desirable
  38. Massive data processing and parallelization Competitive environment: some other tools

    use GPUs, Spark, etc. scikit-image uses NumPy! I/O: large images might not fit into memory use memory mapping of different file formats (raw binary with NumPy, hdf5 with pytables). Divide into blocks: use util.view as blocks to iterate conveniently over blocks Parallel processing: use joblib or dask Better integration desirable
  39. joblib: easy simple parallel computing + lazy re-evaluation >>> from

    skimage import data <>> hubble = data . h u b b l e d e e p f i e l d () >>> width = 10 >>> p i c s = u t i l . view as windows ( hubble , ( width , hubble . shape [1], hubble . shape [2]) , s t e p = width ) >>> from j o b l i b import P a r a l l e l , d e l a y e d >>> # task is an image processing function >>> P a r a l l e l ( n j o b s =4)( d e l a y e d ( t a s k )( p i c ) f o r p i c in p i c s ) x
  40. A platform to build an ecosystem upon Tool for users,

    platform for other tools $ apt-cache rdepends python-matplotlib ... 96 Python packages & applications Specific applications that could build on scikit-image Imaging techniques; microscopy, tomography, ... Fields: cell biology, astronomy, ... Requirements: stable API, good docs
  41. No need to be a programming genius to contribute to

    OSS Social and pedagogical skills useful and welcome You will learn a lot and make friends. P. Hintjens
  42. No need to be a programming genius to contribute to

    OSS Social and pedagogical skills useful and welcome You will learn a lot and make friends. P. Hintjens Try it out! http://scikit-image.org/ Feedback welcome github.com/scikit-image/scikit-image Please cite the paper Let’s talk about scikit-image @EGouillart