Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData London: How I learnt programming and science with Scientific Python

PyData London: How I learnt programming and science with Scientific Python

PyData London 2018 keynote

Emmanuelle Gouillart

April 29, 2018
Tweet

More Decks by Emmanuelle Gouillart

Other Decks in Programming

Transcript

  1. Learning programming & science with Scientific Python Emmanuelle Gouillart joint

    Unit CNRS/Saint-Gobain SVI and the scikit-image team @EGouillart
  2. Some problems I want to solve Scientific / development problems

    Finding image processing pipelines for extracting scientific information Running these operations on very large datasets (> 1 Tb) in reasonable time Removing bugs in scikit-image
  3. Some problems I want to solve Scientific / development problems

    Finding image processing pipelines for extracting scientific information Running these operations on very large datasets (> 1 Tb) in reasonable time Removing bugs in scikit-image Social and learning problems Working with people with little experience in programming and image processing Helping people to learn programming and image processing How can we have more developers and users of scikit-image?
  4. What is image processing? Manipulations of images to Transform them

    for other purposes: color balance, enhancement, filtering, inpainting, ... Extract information from them: classification, measurements... a) flltering b) feature extraction c) segmentation d) measures non-local means denoising total variation denoising blob detection super-pixels skeleton & local diameter particle properties segmentation from markers ridge detection
  5. Traditional image processing: no future? Is deep learning the universal

    solution? Image classification From Keras blog Image segmentation FCN, from Daniil Pakhomov’s blog
  6. Still room for other tools! Non-natural images Avoid sledgehammer approach

    simple operations: image resiz- ing, histogram equalization, ... pre-processing for machine learning
  7. What is scikit-image? An open-source (BSD) generic image processing library

    for the Python language (and NumPy data arrays)
  8. What is scikit-image? An open-source (BSD) generic image processing library

    for the Python language (and NumPy data arrays) for 2D & 3D images simple API & gentle learning curve
  9. Feature extraction followed by classification Combining scikit-image and scikit-learn Daisy

    features Random Forest classifier Back to image processing: Gaussian filtering, thresholding and mathematical morphology.
  10. Datasheet Package statistics http://scikit-image.org/ Release 0.13 (1 - 2 release

    per year) Among 1000 best ranked packages on PyPi 20000 unique visitors / month
  11. Who is your typical user? Windows 54% Linux 26% OS

    X 20% Not a lot of hardcore geeks Not a lot of time on her plate Learning / finding information is hard
  12. The people 2 4 6 8 10 12 Individual Committer

    0.0 0.2 0.4 0.6 0.8 1.0 Commit rate Normalized commit rates since 2014-06 matplotlib numpy pandas scikit-image scikit-learn scipy A quite healthy curve... we can do better! Fernando Perez & Aaron Meurer, Gist 5843625
  13. My first experience of programming... >>> cd new_experiment >>> acquire_temperature

    () >>> name_exp = ’← convection ’ >>> control_parameter () >>> ... and other magical← spells x
  14. My first experience of programming... >>> cd new_experiment >>> acquire_temperature

    () >>> name_exp = ’← convection ’ >>> control_parameter () >>> ... and other magical← spells x
  15. NumpPy: Python objects for numerical arrays Multi-dimensional numerical data container

    (based on compiled code) + utility functions to create/manipulate them >>> a = np.random. random_integers (0, 1, (2, 2, 2)) >>> a array ([[[0 , 1], [1, 0]], [[0, 0], [0, 1]]]) >>> a.shape , a.dtype ((2, 2, 2), dtype(’int64 ’)) x
  16. NumpPy: Python objects for numerical arrays Multi-dimensional numerical data container

    (based on compiled code) + utility functions to create/manipulate them >>> a = np.random. random_integers (0, 1, (2, 2, 2)) >>> a array ([[[0 , 1], [1, 0]], [[0, 0], [0, 1]]]) >>> a.shape , a.dtype ((2, 2, 2), dtype(’int64 ’)) x Efficient and versatile data access indexing and slicing fancy indexing
  17. Manipulating images as numerical (numpy) arrays Pixels are arrays elements

    import numpy as np image = np.ones ((5, 5)) image [0, 0] = 0 image [2, :] = 0 x
  18. Manipulating images as numerical (numpy) arrays Pixels are arrays elements

    import numpy as np image = np.ones ((5, 5)) image [0, 0] = 0 image [2, :] = 0 x >>> coffee.shape (400, 600, 3) >>> red channel = coffee[..., 0] >>> image 3d =
  19. NumPy-native: images as NumPy arrays NumPy arrays as arguments and

    outputs >>> from skimage import io , filters >>> camera_array = io.imread(’camera_image .png’) >>> type(camera_array ) <type ’numpy.ndarray ’> >>> camera_array .dtype dtype(’uint8 ’) >>> filtered_array = filters.gaussian(camera_array , ← sigma =5) >>> type( filtered_array ) <type ’numpy.ndarray ’> >>> import matplotlib.pyplot as plt >>> plt.imshow(filtered_array , cmap=’gray ’) x
  20. How we simplified the API Before 2013 >>> from skimage

    import io , filters >>> camera_array = io.imread(’camera_image .png’) >>> type(camera_array ) Image ... >> camera.max() Image (255 , dtype=uint8) x
  21. Versatile use for 2D, 2D-RGB, 3D... >>> from skimage import

    measure >>> labels_2d = measure.label(image_2d) >>> labels_3d = measure.label(image_3d) x
  22. Versatile use for 2D, 2D-RGB, 3D... def quickshift(image , ratio

    =1.0, kernel_size =5, ← max_dist =10, sigma =0, random_seed =42): """ Segments image using quickshift clustering in← Color -(x,y) space. ... """ image = img_as_float(np.atleast_3d(image)) ... x
  23. Versatile use for 2D, 2D-RGB, 3D... def quickshift(image , ratio

    =1.0, kernel_size =5, ← max_dist =10, sigma =0, random_seed =42): """ Segments image using quickshift clustering in← Color -(x,y) space. ... """ image = img_as_float(np.atleast_3d(image)) ... x One filter = one function Use keyword argument for parameter tuning
  24. API of scikit-image skimage filters restoration segmentation ... denoise_bilateral input

    array + optional parameters output (array) submodule module function variables
  25. Consistent names filters . median ( image , selem=None ,

    ← out=None , . . . ) morphology . binary_erosion ( image , ← selem=None , out=None ) restoration . denoise_bilateral (← image , win_size=None , . . . )
  26. Consistent names filters . median ( image , selem=None ,

    ← out=None , . . . ) morphology . binary_erosion ( image , ← selem=None , out=None ) restoration . denoise_bilateral (← image , win_size=None , . . . )
  27. Learning from mistakes: confusing behavior im = img_as_float ( data

    . camera ( ) ) hist_1 = exposure . histogram ( im ) im_med = filters . median ( im ) hist_2 = exposure . histogram (← im_med ) 0 50 100 150 200 250 im im_med
  28. Docstrings now and then docstring in 2008 Definition: np.diff(a, n=1,

    axis =-1) Docstring: Calculate the n-th order discrete difference along ← given axis. x
  29. Docstrings now and then Definition: np.diff(a, n=1, axis =-1) Docstring:

    Calculate the n-th order discrete difference along given axis. The first order difference is given by ‘‘out[n] = a[n+1] - a[n]‘‘ along the given axis , higher order differences are calculated by using ‘diff ‘ recursively . Parameters ---------- a : array_like Input array n : int , optional The number of times values are differenced. axis : int , optional The axis along which the difference is taken , default is the last axis. Returns ------- diff : ndarray The ‘n‘ order differences . The shape of the output is the same as ‘a‘ except along ‘axis ‘ where the dimension is smaller by ‘n‘. See Also -------- gradient , ediff1d , cumsum Examples -------- >>> x = np.array ([1, 2, 4, 7, 0]) >>> np.diff(x) much better now! Parameters and their type Suggestion of other functions Simple example
  30. pydocweb and NumPy documentation Marathon Tools by Pauli Virtnanen, with

    enthusiastic cheering from my side Documentation effort led by St´ efan van der Walt Easy as Wikipedia A wiki to improve the docs We didn’t have Github!
  31. NumPy documentation standard https://github.com/numpy/numpy/blob/master/doc/example.py def foo(var1 , var2 , long_var_name

    =’hi’) : r"""A one -line summary that does not use variable names or the function name. Several sentences providing an extended description . Refer to variables using back -ticks , e.g. ‘var ‘. Parameters ---------- var1 : array_like Array_like means all those objects -- lists , nested lists , etc. -- that can be converted to an array. We can also refer to variables like ‘var1 ‘. var2 : int The type above can either refer to an actual Python type (e.g. ‘‘int ‘‘), or describe the type of the variable in more detail , e.g. ‘‘(N,) ndarray ‘‘ or ‘‘array_like ‘‘. Long_variable_name : {’hi ’, ’ho ’}, optional Choices in brackets , default first when optional. Returns ------- type Explanation of anonymous return value of type ‘‘type ‘‘. describe : type Explanation of return value named ‘describe ‘. out : type Explanation of ‘out ‘. Other Parameters ---------------- only_seldom_used_keywords : type Explanation
  32. Outcome and impact of documentation marathon # of words in

    Numpy reference: 8600 → 140,000 New contributors: 250 accounts Lower entry barrier to contribute Increased the standard for other packages Made people proud about docs
  33. Outcome and impact of documentation marathon # of words in

    Numpy reference: 8600 → 140,000 New contributors: 250 accounts Lower entry barrier to contribute Increased the standard for other packages Made people proud about docs
  34. Example: segmentation of low-constrast regions In-situ imaging of glass batch

    reactive melting Non-local means denoising to preserve texture Histogram-based markers extraction Random walker segmentation Non-local means: average similar patches Random walker:anisotropic diffusion from markers Random walker less sensitive to noise than watershed, but slower
  35. Keeping interaction easy for large data from joblib import Memory

    memory = Memory ( cachedir=’ . / c a c h e d i r ’ , verbose=0) @memory . cache def mem_label ( x ) : r e t u r n measure . label ( x ) @memory . cache def mem_threshold_otsu ( x ) : r e t u r n filters . threshold_otsu ( x ) [ . . . ] val = mem_threshold_otsu ( dat ) objects = dat > val median_dat = mem_median_filter ( dat , 3) val2 = mem_threshold_otsu ( median_dat [ objects ] ) liquid = median_dat > val2 segmentation_result = np . copy ( objects ) . astype ( np . uint8 ) segmentation_result [ liquid ] = 2 aggregates = mem_binary_fill_holes ( objects ) aggregates_ds = np . copy ( aggregates [ : : 4 , : : 4 , : : 4 ] ) cores = mem_binary_erosion ( aggregates_ds , np . ones ((10 , 10 ,← 10) ) )
  36. Euroscipy conferences Leipzig, Paris, Brussels, Cambridge, Erlangen 2018: Trento 2

    days of tutorials, beginners and advanced 2 days of conference Help from volunteers always welcome!
  37. Scipy lecture notes Train a lot of people: need tools

    that scale Several weeks of tutorials! Beginners: the core of Scientific Python Advanced: learn more tricks Packages: specific applications and packages Developed and used for Euroscipy conferences Curated and enriched over the years
  38. Denoising tomography images In-situ imaging of phase separation in silicate

    melts From basic (generic) to advanced (specific) filters
  39. Denoising tomography images Histogram of pixel values From basic (generic)

    to advanced (specific) filters bilateral = restoration.denoise_bilateral(dat) bilateral = restoration.denoise_bilateral(dat, sigma_range← =2.5, sigma_spatial=2) tv = restoration.denoise_tv_chambolle(dat, weight=0.5)
  40. Achieving a sustainable growth Balance users’ and contributors’ goals: robustness

    and smooth learning curve vs cool factor and bleeding-edge tools Feature development should not be faster than quality improvement Documentation and training for users Low entry barriers for contributors
  41. Massive data processing and parallelization Competitive environment: some other tools

    use GPUs, Spark, etc. scikit-image uses NumPy! I/O: large images might not fit into memory use memory mapping of different file formats (raw binary with NumPy, hdf5 with pytables). Divide into blocks: use util.view as blocks to iterate conveniently over blocks Parallel processing: use joblib or dask Better integration desirable
  42. Massive data processing and parallelization Competitive environment: some other tools

    use GPUs, Spark, etc. scikit-image uses NumPy! I/O: large images might not fit into memory use memory mapping of different file formats (raw binary with NumPy, hdf5 with pytables). Divide into blocks: use util.view as blocks to iterate conveniently over blocks Parallel processing: use joblib or dask Better integration desirable
  43. joblib: easy simple parallel computing + lazy re-evaluation import numpy

    as np from sklearn . externals . joblib import Parallel , delayed def apply_parallel ( func , data , ∗args , chunk =100, overlap =10, n_jobs=4, ∗∗kwargs ) : ””” Apply a f u n c t i o n i n p a r a l l e l to o v e r l a p p i n g chunks of an a r r a y . j o b l i b i s used f o r p a r a l l e l p r o c e s s i n g . [ . . . ] Examples − − − − − − − − > > > from skimage import data , f i l t e r s > > > c o i n s = data . c o i n s () > > > r e s = a p p l y p a r a l l e l ( f i l t e r s . gaussian , coins , 2) ””” sh0 = data . shape [ 0 ] nb_chunks = sh0 // chunk end_chunk = sh0 % chunk arg_list = [ data [ max (0 , i∗chunk − overlap ) : min (( i+1)∗chunk + overlap , sh0 ) ] f o r i i n range (0 , nb_chunks ) ] i f end_chunk > 0 : arg_list . append ( data[−end_chunk − overlap : ] ) res_list = Parallel ( n_jobs=n_jobs ) ( delayed ( func ) ( sub_im , ∗args , ∗∗kwargs ) f o r sub_im i n arg_list ) output_dtype = res_list [ 0 ] . dtype out_data = np . empty ( data . shape , dtype=output_dtype ) f o r i i n range (1 , nb_chunks ) : out_data [ i∗chunk : ( i+1)∗chunk ] = res_list [ i ] [ overlap : overlap+chunk ] out_data [ : chunk ] = res_list [0][: − overlap ] i f end_chunk > 0 : out_data[−end_chunk : ] = res_list [ −1][ overlap : ] r e t u r n out_data
  44. A platform to build an ecosystem upon Tool for users,

    platform for other tools $ apt-cache rdepends python-matplotlib ... 96 Python packages & applications Specific applications that could build on scikit-image Imaging techniques; microscopy, tomography, ... Fields: cell biology, astronomy, ... Requirements: stable API, good docs
  45. Where Do We Come From? What Are We? Where Are

    We Going? What is great with our tools? What can we learn about them to create new tools? Do we need other tools?
  46. We’ve got great tools Can we make them even more

    ”self-learning”? Consider contributing to the ecosystem: you don’t have to be a programming genius. Which tools are we lacking yet? For which problems are we at the pre-numpy / pre-pandas stage? User interfaces? Data retrieval and cleaning?