The scientific Python ecosystem: open source tools for better computing in science

The scientific Python ecosystem: open source tools for better computing in science

A talk given at the CU Boulder BioFrontiers institute, discussing a general view of the scientific Python landscape.

95198572b00e5fbcd97fb5315215bf7a?s=128

Fernando Perez

April 02, 2012
Tweet

Transcript

  1. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? The scientific Python ecosystem: open source tools for better computing in science Fernando Pérez http://fperez.org Fernando.Perez@berkeley.edu Helen Wills Neuroscience Institute, UC Berkeley BioFrontiers, CU Boulder April 2, 2012
  2. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 2 / 54
  3. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 3 / 54
  4. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing “Big Data”, “Cloud computing”, etc: lots of buzzwords... They will NOT automatically produce good science Good computing is now a necessary (though not sufficient!) condition for good science. The rigor, openness, culture of validation, collaboration and other aspects of science must also become part of scientific computing. FP (UC Berkeley) Python for science 4/2/12 4 / 54
  5. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing “Big Data”, “Cloud computing”, etc: lots of buzzwords... They will NOT automatically produce good science Good computing is now a necessary (though not sufficient!) condition for good science. The rigor, openness, culture of validation, collaboration and other aspects of science must also become part of scientific computing. FP (UC Berkeley) Python for science 4/2/12 4 / 54
  6. Not all clouds are necessarily good...

  7. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54
  8. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54
  9. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54
  10. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
  11. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
  12. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
  13. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
  14. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 8 / 54
  15. #1: Adaptive, multiwavelet PDE tools Gregory Beylkin, Vani Cheruvu, FP.

    Applied Math, CU Boulder. Fast application of integral kernels. (Partial Differential Equations) Implementation went from 1 to 3 dimensions directly (extremely unusual). Complex algorithm: beyond pure numerics. Very good performance, thanks to NumPy, F2PY and weave. Dynamically generated C++ sources: code as a run-time resource. Nnod = 10, ǫ = 1.0e − 10, Nblocks = 445
  16. #2: Mining the literature on macaque brain connectivity Mark D’Esposito,

    Rob Blumenfeld, Daniel Bliss, FP; UC Berkeley. Anatomical brain connectivity experiments: difficult and expensive. Invaluable dataset in a web server.
  17. From messy data to graph descriptions Programatically query web server.

    Parse XML into rich graphs. NetworkX graph library: Aric Hagberg at LANL. 24d 8A 24a 47/12 24b BL BM PGOp MST V3D TPPro PEA PEC 25 9/46D 24c TEO 46 TEA OPAL PGM 8AV 1 V1 V2 2 PG TH PECg TF PaI 3a 8AD PO ST1 PaAR 13a AA ST2 FST CL TPO 9/46V PE Me CE TPt 3b TE1 TE2 TE3 ST3 TEM 11 10 13 IPA OPro 14 PFG 8B 32 31 PGA OPt 6M PAAC PFOp La 45A 45B 44 40 30 20 10 0 10 20 0 10 20 30 40 Full Graph - Sagittal view
  18. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 12 / 54
  19. Beyond (floating point) number crunching Hardware floating point Arbitrary precision

    integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...
  20. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Python in this context Open Source, free, highly portable. Extremely readable: “executable pseudo-code”. Simple: “fits your brain”. Rich types and library: “batteries included” Easy to wrap C, C++ and FORTRAN. NumPy: IDL/Matlab-like arrays. FP (UC Berkeley) Python for science 4/2/12 14 / 54
  21. The scientific Python ecosystem (incomplete view) IPython NetworkX

  22. NumPy: the foundation for array processing A flexible, efficient, multidimensional

    array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation
  23. SciPy: numerical algorithms galore linalg : Linear algebra routines (including

    BLAS/LAPACK) sparse : Sparse Matrices (including UMFPACK, ARPACK,...) fftpack : Discrete Fourier Transform algorithms cluster : Vector Quantization / Kmeans odr : Orthogonal Distance Regression special : Special Functions (Airy, Bessel, etc). stats : Statistical Functions optimize : Optimization Tools maxentropy : Routines for fitting maximum entropy models integrate : Numerical Integration routines ndimage : n-dimensional image package interpolate : Interpolation Tools signal : Signal Processing Tools io : Data input and output Lots more...
  24. Matplotlib: high-quality data visualization

  25. MayaVi: 3d visualization (VTK)

  26. FluidLab: a MayaVi based CFD visualization tool K. Julien, P.

    Schmitt (now NCAR), B. Barrow, F. Pérez (App. Math, CU).
  27. Sympy: symbolic and multiprecision computing

  28. NetworkX: tools for complex networks Aric Hagberg, Pieter Swart et.

    al., Los Alamos Theory Division
  29. Scikits Learn: (easy to use) machine learning 3 2 1

    0 1 2 3 x 3 2 1 0 1 2 3 y True Independent Sources 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y Observations PCA ICA 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y PCA scores 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y FastICA on 2D point clouds 4 5 6 7 8 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Linear Discr. Analysis versicolor virginica 4 5 6 7 8 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Quadratic Discr. Analysis versicolor virginica 6 4 2 0 2 4 6 6 4 2 0 2 4 6 SVM with non-linear kernel (RBF) inliers outliers SVM: Weighted samples
  30. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 24 / 54
  31. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? IPython: Interactive Scientific Computing A CU Boulder project Started when I was a graduate student in Physics (2001). Continued as a postdoc in Applied Mathematics. Brian Granger: CU Physics. In brief 1 A better Python shell 2 Embeddable Kernel and powerful interactive clients 1 Terminal 2 Qt console 3 Web notebook 3 Flexible parallel computing FP (UC Berkeley) Python for science 4/2/12 25 / 54
  32. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? IPython: Interactive Scientific Computing A CU Boulder project Started when I was a graduate student in Physics (2001). Continued as a postdoc in Applied Mathematics. Brian Granger: CU Physics. In brief 1 A better Python shell 2 Embeddable Kernel and powerful interactive clients 1 Terminal 2 Qt console 3 Web notebook 3 Flexible parallel computing FP (UC Berkeley) Python for science 4/2/12 25 / 54
  33. IPython: Matlab/IDL-like interactive use

  34. Qt console: inline plots, html, multiline editing, ... Evan Patterson

    (Enthought)
  35. Microsoft Visual Studio 2010 integrated console Dino Viehland and Shahrokh

    Mortazavi; http://pytools.codeplex.com
  36. Browser-based notebook: rich text, code, plots, ... Brian Granger, James

    Gao (Berkeley), rest of the team
  37. Interactive and high-level parallel APIs Min Ragan-Kelley, Brian Granger

  38. A mid-size project by now

  39. Other projects using IPython Scientific EPD: Enthought Python Distribution. Sage:

    open source mathematics. PyRAF: Space Telescope Science Institute CASA: Nat. Radio Astronomy Observatory Ganga: CERN PyMAD: neutron spectrom., Laue Langevin Sardana: European Synchrotron Radiation ASCEND: eng. modeling (Carnegie Mellon). JModelica: dynamical systems. DASH: Denver Aerosol Sources and Health. Trilinos: Sandia National Lab. DoD: baseline configuration. Mayavi: 3d visualization, Enthought. NiPype: computational pipelines, MIT. PyIMSL Studio, by Visual Numerics. ... Web/Other Visual Studio 2010: MS. Django. Turbo Gears. Pylons web framework Zope and Plone CMS. Axon Shell, BBC Kamaelia. Schevo database. Pitz: distributed task/bug tracking. iVR (interactive Virtual Reality). Movable Python (portable Python environment). ...
  40. Support Enthought, Austin, TX: Lots! Tech-X Corporation, Boulder, CO: Parallel/notebook

    (previous versions) Microsoft: WinHPC support, Visual Studio integration NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. DoD/HPTi.
  41. (Incomplete) Cast of Characters Brian Granger - Cal State San

    Luis Obispo Physics Min Ragan-Kelley - UC Berkeley Nuclear engineering. Thomas Kluyver - U. Sheffield Plant biology Jörgen Stenarson - SP Technical Research Institute of Sweden Paul Ivanov - UC Berkeley neuroscience Robert Kern - Enthought Evan Patterson - Caltech Physics/Enthought Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Julian Taylor - Debian/Ubuntu Many more! (~140 commit authors)
  42. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 35 / 54
  43. What does it take to get reproducible research results? Reproducible

    research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
  44. What does it take to get reproducible research results? Reproducible

    research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
  45. What does it take to get reproducible research results? Reproducible

    research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
  46. FOSS better than scientific research? FOSS: Free and Open Source

    Software Public distributed version control: provenance tracking
  47. Pull requests: ongoing peer review

  48. Pull requests: back and forth discussion

  49. Branches: exploratory work with control

  50. Automated tests: validation The IPython build Dashboard: immediate feedback

  51. Public bug trackers

  52. Versioned science Git: the tool you didn’t know you needed

    Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) Everyday research: track your results Teaching (never accept an emailed homework assignment again!)
  53. Versioned science Git: the tool you didn’t know you needed

    Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) Everyday research: track your results Teaching (never accept an emailed homework assignment again!)
  54. Git for running a course?

  55. One student’s work

  56. Details

  57. Benefits of teaching with Git Automatic timestamping of all work.

    Distributed backup: the dog can not eat their homework! They can work from any computer. Easy downloading of all class materials without a million clicks. The end of the email attachment madness. Version control as an natural tool, as common as email.
  58. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 48 / 54
  59. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? A brief demo of the IPython notebook FP (UC Berkeley) Python for science 4/2/12 49 / 54
  60. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? IPython and the lifecycle of scientific ideas Individual exploration Collaboration “Google docs with a brain” Large-scale parallel production work IPython notebook on Amazon EC2: MIT’s StarCluster Publication Generation of HTML/PDF/EPub... “Executable papers” Education Workshops and bootcamps (UC Berkeley, elsewhere) FP (UC Berkeley) Python for science 4/2/12 50 / 54
  61. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? The executable paper: Titus Brown (MSU), 3/21/12 http://arxiv.org/abs/1203.4802 FP (UC Berkeley) Python for science 4/2/12 51 / 54
  62. Changes Two vignettes Scientific Python IPython Lessons from the open

    source world Where is this going? Titus’ IPython notebook, runs on Amazon Cloud FP (UC Berkeley) Python for science 4/2/12 52 / 54
  63. Next steps... IPython Executable examples in books (with a large

    US publisher) A full book on brain imaging and statistics (JB Poline - Neurospin). DoD - classic HPC environments. Notebook: a format beyond Python (R, matlab, etc...) UK: Python in education and the Raspberry Pi. Numfocus.org: a foundation interface with industry. support open source scientific Python produce educational materials Github.com: collaborations on ’versioned science’.
  64. Things are changing... Journal policies... Funding agencies... Needs of everyday

    science... So we must also change: Improve our computational praxis Better educate our students Acknowledge computational work alongside other metrics of academic work.