Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Italy 2016 keynote: scikit-image, APIs and documentation

PyData Italy 2016 keynote: scikit-image, APIs and documentation

Emmanuelle Gouillart

April 16, 2016
Tweet

More Decks by Emmanuelle Gouillart

Other Decks in Science

Transcript

  1. Simple APIs and innovative documentation Reasons for the success of

    Scientific Python Emmanuelle Gouillart joint Unit CNRS/Saint-Gobain SVI and the scikit-image team @EGouillart
  2. NumpPy: Python objects for numerical arrays Multi-dimensional numerical data container

    (based on compiled code) + utility functions to create/manipulate them >>> a = np.random. r a n d o m i n t e g e r s (0, 1, (2, 2, 2)) >>> a a r r a y ([[[0 , 1], [1, 0]], [[0, 0], [0, 1]]]) >>> a. shape , a. dtype ((2, 2, 2), dtype ( ’ int64 ’ )) x
  3. NumpPy: Python objects for numerical arrays Multi-dimensional numerical data container

    (based on compiled code) + utility functions to create/manipulate them >>> a = np.random. r a n d o m i n t e g e r s (0, 1, (2, 2, 2)) >>> a a r r a y ([[[0 , 1], [1, 0]], [[0, 0], [0, 1]]]) >>> a. shape , a. dtype ((2, 2, 2), dtype ( ’ int64 ’ )) x Efficient and versatile data access indexing and slicing fancy indexing
  4. A flood of images hundreds of terabytes of scientific data

    for scientific experiment http://sdo.gsfc.nasa.gov/
  5. A flood of images hundreds of terabytes of scientific data

    for scientific experiment http://sdo.gsfc.nasa.gov/ Image processing Manipulating images in order to retrieve new images or image characteristics (features, measurements, ...) Often combined with machine learning
  6. What is scikit-image? An open-source (BSD) generic image processing library

    for the Python language (and NumPy data arrays)
  7. What is scikit-image? An open-source (BSD) generic image processing library

    for the Python language (and NumPy data arrays) for 2D & 3D images simple API & gentle learning curve
  8. Datasheet Package statistics http://scikit-image.org/ Release 0.12 (1 - 2 release

    per year) Among 1000 best ranked packages on PyPi 20000 unique visitors / month
  9. Manipulating images as numerical (numpy) arrays Pixels are arrays elements

    import numpy as np image = np. ones ((5, 5)) image [0, 0] = 0 image [2, :] = 0 x
  10. Manipulating images as numerical (numpy) arrays Pixels are arrays elements

    import numpy as np image = np. ones ((5, 5)) image [0, 0] = 0 image [2, :] = 0 x >>> coffee.shape (400, 600, 3) >>> red channel = coffee[..., 0] >>> image 3d = np.ones((100, 100, 100))
  11. NumPy-native: images as NumPy arrays NumPy arrays as arguments and

    outputs >>> from skimage import io , f i l t e r s >>> c a m e r a a r r a y = i o . imread ( ’ camera image . png ’ ) >>> type( c a m e r a a r r a y ) <type ’numpy . ndarray ’ > >>> c a m e r a a r r a y . dtype dtype ( ’ uint8 ’ ) >>> f i l t e r e d a r r a y = f i l t e r s . g a u s s i a n ( camera array , sigma =5) >>> type( f i l t e r e d a r r a y ) <type ’numpy . ndarray ’ > >>> import m a t p l o t l i b . p y p l o t as p l t >>> p l t .imshow( f i l t e r e d a r r a y , cmap= ’ gray ’ ) x
  12. An API relying mostly on functions skimage . f i

    l t e r s . g a u s s i a n (image , sigma , output =None, mode= ’ n e a r e s t ’ , c v a l =0, m u l t i c h a n n e l =None) Multi - d i m e n s i o n a l Gaussian filter Parameters ---------- image : array - l i k e input image ( g r a y s c a l e or c o l o r ) to filter. sigma : s c a l a r or sequence of s c a l a r s st and ard d e v i a t i o n f o r Gaussian k e r n e l . The st and ard d e v i a t i o n s of the Gaussian filter are g i v e n f o r each a x i s as a sequence , or as a s i n g l e number , in which case i t i s equal f o r all axes . output : array , o p t i o n a l The ‘‘ output ‘‘ parameter p a s s e s an a r r a y in which to s t o r e the filter output . mode : { ’ r e f l e c t ’ , ’ constant ’ , ’ n ea re st ’ , ’ mirror ’ , ’ wrap ’ }, o p t i o n a l One filter = one function Use keyword argument for parameter tuning
  13. How we simplified the API Before 2013 >>> from skimage

    import io , f i l t e r s >>> c a m e r a a r r a y = i o . imread ( ’ camera image . png ’ ) >>> type( c a m e r a a r r a y ) Image ... >> camera .max() Image (255 , dtype = u i n t 8 ) x
  14. Denoising tomography images In-situ imaging of phase separation in silicate

    melts From basic (generic) to advanced (specific) filters
  15. Denoising tomography images Histogram of pixel values From basic (generic)

    to advanced (specific) filters bilateral = restoration . denoise bilateral (dat) bilateral = restoration . denoise bilateral (dat, sigma range=2.5, sigma spatial=2) tv = restoration . denoise tv chambolle (dat, weight=0.5)
  16. Example: segmentation of low-constrast regions In-situ imaging of glass batch

    reactive melting Non-local means denoising to preserve texture Histogram-based markers extraction Random walker segmentation Non-local means: average similar patches Random walker:anisotropic diffusion from markers Random walker less sensitive to noise than watershed, but slower
  17. Mathematical morphology skimage.morphology: binary + grayscale morphology dilation, erosion, closing,

    opening several structural elements remove small objects watershed
  18. Feature extraction followed by classification Combining scikit-image and scikit-learn Extract

    features (skimage.feature) Pixels intensity values (R, G, B) Local gradients More advanced descriptors: HOGs, Gabor, ... Train classifier with known regions here, random forest classifier Classify pixels
  19. API of scikit-image skimage filters restoration segmentation ... denoise_bilateral input

    array + optional parameters output (array) submodule module function variables
  20. Versatile use for 2D, 2D-RGB, 3D... >>> from skimage import

    measure >>> l a b e l s 2 d = measure . l a b e l ( image 2d ) >>> l a b e l s 3 d = measure . l a b e l ( image 3d ) x
  21. Versatile use for 2D, 2D-RGB, 3D... def q u i

    c k s h i f t (image , r a t i o =1.0 , k e r n e l s i z e =5, ma x di st =10, sigma =0, random seed =42): ””” Segments image using q u i c k s h i f t c l u s t e r i n g in Color −(x , y ) space . . . . ””” image = i m g a s f l o a t (np. a t l e a s t 3 d ( image )) ... x
  22. Development model Mature algorithms Only Python + Cython code for

    easier maintainability Focus on good practices: testing, documentation, version control Hosted on GitHub: thorough code reivew + continuous integration Core team of 5 − 10 persons (close to applications)
  23. Who is your typical user? Windows 54% Linux 26% OS

    X 20% Not a lot of hardcore geeks Not a lot of time on her plate Learning / finding information is hard
  24. My first experience of programming... >>> cd new experiment >>>

    a c q u i r e t e m p e r a t u r e () >>> name exp = ’ convection ’ >>> c o n t r o l p a r a m e t e r () >>> ... and o t h e r magical s p e l l s x
  25. My first experience of programming... >>> cd new experiment >>>

    a c q u i r e t e m p e r a t u r e () >>> name exp = ’ convection ’ >>> c o n t r o l p a r a m e t e r () >>> ... and o t h e r magical s p e l l s x
  26. What is good documentation? ”Documenting code is like writing ”Tasty!”

    on the side of a coffee cup. If the code isn’t readable on a grey Monday morning before coffee, chuck it out and start again. What you document are APIs (...). That is fine. Explaining what this funky loop does is not fine.” Pieter Hintjens
  27. Docstrings now and then docstring in 2008 D e f

    i n i t i o n : np. d i f f (a, n=1, a x i s =-1) D o c s t r i n g : C a l c u l a t e the n- th o r d e r d i s c r e t e d i f f e r e n c e along g i v e n a x i s . x
  28. Docstrings now and then D e f i n i

    t i o n : np. d i f f (a, n=1, a x i s =-1) Docstring : C a l c u l a t e the n- th o r d e r d i s c r e t e d i f f e r e n c e along given a x i s . The f i r s t o r d e r d i f f e r e n c e i s given by ‘‘ out [n] = a[n+1] - a[n]‘‘ along the given axis , h i g h e r o r d e r d i f f e r e n c e s are c a l c u l a t e d by using ‘ d i f f ‘ r e c u r s i v e l y . Parameters ---------- a : a r r a y l i k e Input a r r a y n : int , o p t i o n a l The number of times v a l u e s are d i f f e r e n c e d . a x i s : int , o p t i o n a l The a x i s along which the d i f f e r e n c e i s taken , d e f a u l t i s the l a s t a x i s . Returns ------- d i f f : ndarray The ‘n‘ o r d e r d i f f e r e n c e s . The shape of the output i s the same as ‘a‘ except along ‘ axis ‘ where the dimension i s s m a l l e r by ‘n‘. See Also -------- gradient , e d i f f 1 d , cumsum Examples -------- >>> x = np. a r r a y ([1, 2, 4, 7, 0]) >>> np. d i f f (x) a r r a y ([ 1, 2, 3, -7]) >>> np. d i f f (x , n=2) a r r a y ([ 1, 1, -10]) much better now! Parameters and their type Suggestion of other functions Simple example
  29. pydocweb and NumPy documentation Marathon Tools by Pauli Virtnanen, with

    enthusiastic cheering from my side Documentation effort led by St´ efan van der Walt Easy as Wikipedia A wiki to improve the docs We didn’t have Github!
  30. NumPy documentation standard https://github.com/numpy/numpy/blob/master/doc/example.py def foo ( var1 , var2

    , long var name = ’ hi ’) : r”””A one−line summary that does not use variable names or the function name. Several sentences providing an extended description . Refer to variables using back−ticks , e . g . ‘var ‘ . Parameters − − − − − − − − − − var1 : array like Array like means all those objects − − lists , nested lists , etc . − − that can be converted to an array . We can also refer to variables like ‘var1 ‘ . var2 : int The type above can either refer to an actual Python type (e . g . ‘ ‘ int ‘ ‘) , or describe the type of the variable in more detail , e . g . ‘ ‘(N,) ndarray ‘ ‘ or ‘ ‘ array like ‘ ‘ . Long variable name : {’ hi ’ , ’ho ’} , optional Choices in brackets , default f i r s t when optional . Returns − − − − − − − type Explanation of anonymous return value of type ‘ ‘type ‘ ‘ . describe : type Explanation of return value named ‘ describe ‘ . out : type Explanation of ‘out ‘ . Other Parameters − − − − − − − − − − − − − − − − only seldom used keywords : type Explanation common parameters listed above : type Explanation
  31. Outcome and impact of documentation marathon # of words in

    Numpy reference: 8600 → 140,000 New contributors: 250 accounts Lower entry barrier to contribute Increased the standard for other packages Made people proud about docs
  32. Outcome and impact of documentation marathon # of words in

    Numpy reference: 8600 → 140,000 New contributors: 250 accounts Lower entry barrier to contribute Increased the standard for other packages Made people proud about docs
  33. Scipy lecture notes Train a lot of people: need tools

    that scale Several weeks of tutorials! Beginners: the core of Scientific Python Advanced: learn more tricks Packages: specific applications and packages Developed and used for Euroscipy conferences Curated and enriched over the years
  34. Euroscipy conferences Every August: Leipzig, Paris, Brussels, Cambridge Next time

    : Erlangen 2 days of tutorials, beginners and advanced 2 days of conference Help from volunteers always welcome!
  35. Achieving a sustainable growth Balance users’ and contributors’ goals: robustness

    and smooth learning curve vs cool factor and bleeding-edge tools Feature development should not be faster than quality improvement Documentation and training for users Low entry barriers for contributors
  36. Massive data processing and parallelization Competitive environment: some other tools

    use GPUs, Spark, etc. scikit-image uses NumPy! I/O: large images might not fit into memory use memory mapping of different file formats (raw binary with NumPy, hdf5 with pytables). Divide into blocks: use util.view as blocks to iterate conveniently over blocks Parallel processing: use joblib or dask Better integration desirable
  37. Massive data processing and parallelization Competitive environment: some other tools

    use GPUs, Spark, etc. scikit-image uses NumPy! I/O: large images might not fit into memory use memory mapping of different file formats (raw binary with NumPy, hdf5 with pytables). Divide into blocks: use util.view as blocks to iterate conveniently over blocks Parallel processing: use joblib or dask Better integration desirable
  38. joblib: easy simple parallel computing + lazy re-evaluation >>> from

    skimage import data <>> hubble = data . h u b b l e d e e p f i e l d () >>> width = 10 >>> p i c s = u t i l . view as windows ( hubble , ( width , hubble . shape [1], hubble . shape [2]) , s t e p = width ) >>> from j o b l i b import P a r a l l e l , d e l a y e d >>> # task is an image processing function >>> P a r a l l e l ( n j o b s =4)( d e l a y e d ( t a s k )( p i c ) f o r p i c in p i c s ) x
  39. joblib: easy simple parallel computing + lazy re-evaluation Familiar with

    this mess? from skimage import f i l t e r s # Comment to save some time # filter_im = filters.median(im) # binary_im = filters. threshold_otsu (filter_im) v a l u e s = np. unique (im) x
  40. joblib: easy simple parallel computing + lazy re-evaluation Familiar with

    this mess? from skimage import f i l t e r s # Comment to save some time # filter_im = filters.median(im) # binary_im = filters. threshold_otsu (filter_im) v a l u e s = np. unique (im) x >>> from j o b l i b import Memory >>> mem = Memory( c a c h e d i r = ’ /tmp/ j o b l i b ’ ) >>> square = mem. cache (np. square ) >>> b = square (a) [Memory] C a l l i n g square ... square ( a r r a y ([[ 0., 0., 1.], [ 1., 1., 1.], [ 4., 2., 1.]])) s q u a r e - 0... s , 0.0 min >>> c = square (a) >>> # The above call did not trigger an evaluation x
  41. A platform to build an ecosystem upon Tool for users,

    platform for other tools $ apt-cache rdepends python-matplotlib ... 96 Python packages & applications Specific applications that could build on scikit-image Imaging techniques; microscopy, tomography, ... Fields: cell biology, astronomy, ... Requirements: stable API, good docs
  42. No need to be a programming genius to contribute to

    OSS Social and pedagogical skills useful and welcome You will learn a lot and make friends. P. Hintjens
  43. No need to be a programming genius to contribute to

    OSS Social and pedagogical skills useful and welcome You will learn a lot and make friends. P. Hintjens Try it out! http://scikit-image.org/ Feedback welcome github.com/scikit-image/scikit-image Please cite the paper Let’s talk about scikit-image @EGouillart
  44. Python African Tour Dakar 2009 Project started by Kamon Ayeva

    Python for IT students (web development, etc.) Scientific Python for engineering/science students
  45. A shortage of developers Fernando Perez & Aaron Meurer Gist

    5843625 A low bus factor A few people do most of the work