Slide 1

Slide 1 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? The scientific Python ecosystem: open source tools for better computing in science Fernando Pérez http://fperez.org Fernando.Perez@berkeley.edu Helen Wills Neuroscience Institute, UC Berkeley BioFrontiers, CU Boulder April 2, 2012

Slide 2

Slide 2 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 2 / 54

Slide 3

Slide 3 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 3 / 54

Slide 4

Slide 4 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing “Big Data”, “Cloud computing”, etc: lots of buzzwords... They will NOT automatically produce good science Good computing is now a necessary (though not sufficient!) condition for good science. The rigor, openness, culture of validation, collaboration and other aspects of science must also become part of scientific computing. FP (UC Berkeley) Python for science 4/2/12 4 / 54

Slide 5

Slide 5 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing “Big Data”, “Cloud computing”, etc: lots of buzzwords... They will NOT automatically produce good science Good computing is now a necessary (though not sufficient!) condition for good science. The rigor, openness, culture of validation, collaboration and other aspects of science must also become part of scientific computing. FP (UC Berkeley) Python for science 4/2/12 4 / 54

Slide 6

Slide 6 text

Not all clouds are necessarily good...

Slide 7

Slide 7 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54

Slide 8

Slide 8 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54

Slide 9

Slide 9 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54

Slide 10

Slide 10 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54

Slide 11

Slide 11 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54

Slide 12

Slide 12 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54

Slide 13

Slide 13 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54

Slide 14

Slide 14 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 8 / 54

Slide 15

Slide 15 text

#1: Adaptive, multiwavelet PDE tools Gregory Beylkin, Vani Cheruvu, FP. Applied Math, CU Boulder. Fast application of integral kernels. (Partial Differential Equations) Implementation went from 1 to 3 dimensions directly (extremely unusual). Complex algorithm: beyond pure numerics. Very good performance, thanks to NumPy, F2PY and weave. Dynamically generated C++ sources: code as a run-time resource. Nnod = 10, ǫ = 1.0e − 10, Nblocks = 445

Slide 16

Slide 16 text

#2: Mining the literature on macaque brain connectivity Mark D’Esposito, Rob Blumenfeld, Daniel Bliss, FP; UC Berkeley. Anatomical brain connectivity experiments: difficult and expensive. Invaluable dataset in a web server.

Slide 17

Slide 17 text

From messy data to graph descriptions Programatically query web server. Parse XML into rich graphs. NetworkX graph library: Aric Hagberg at LANL. 24d 8A 24a 47/12 24b BL BM PGOp MST V3D TPPro PEA PEC 25 9/46D 24c TEO 46 TEA OPAL PGM 8AV 1 V1 V2 2 PG TH PECg TF PaI 3a 8AD PO ST1 PaAR 13a AA ST2 FST CL TPO 9/46V PE Me CE TPt 3b TE1 TE2 TE3 ST3 TEM 11 10 13 IPA OPro 14 PFG 8B 32 31 PGA OPt 6M PAAC PFOp La 45A 45B 44 40 30 20 10 0 10 20 0 10 20 30 40 Full Graph - Sagittal view

Slide 18

Slide 18 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 12 / 54

Slide 19

Slide 19 text

Beyond (floating point) number crunching Hardware floating point Arbitrary precision integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...

Slide 20

Slide 20 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Python in this context Open Source, free, highly portable. Extremely readable: “executable pseudo-code”. Simple: “fits your brain”. Rich types and library: “batteries included” Easy to wrap C, C++ and FORTRAN. NumPy: IDL/Matlab-like arrays. FP (UC Berkeley) Python for science 4/2/12 14 / 54

Slide 21

Slide 21 text

The scientific Python ecosystem (incomplete view) IPython NetworkX

Slide 22

Slide 22 text

NumPy: the foundation for array processing A flexible, efficient, multidimensional array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation

Slide 23

Slide 23 text

SciPy: numerical algorithms galore linalg : Linear algebra routines (including BLAS/LAPACK) sparse : Sparse Matrices (including UMFPACK, ARPACK,...) fftpack : Discrete Fourier Transform algorithms cluster : Vector Quantization / Kmeans odr : Orthogonal Distance Regression special : Special Functions (Airy, Bessel, etc). stats : Statistical Functions optimize : Optimization Tools maxentropy : Routines for fitting maximum entropy models integrate : Numerical Integration routines ndimage : n-dimensional image package interpolate : Interpolation Tools signal : Signal Processing Tools io : Data input and output Lots more...

Slide 24

Slide 24 text

Matplotlib: high-quality data visualization

Slide 25

Slide 25 text

MayaVi: 3d visualization (VTK)

Slide 26

Slide 26 text

FluidLab: a MayaVi based CFD visualization tool K. Julien, P. Schmitt (now NCAR), B. Barrow, F. Pérez (App. Math, CU).

Slide 27

Slide 27 text

Sympy: symbolic and multiprecision computing

Slide 28

Slide 28 text

NetworkX: tools for complex networks Aric Hagberg, Pieter Swart et. al., Los Alamos Theory Division

Slide 29

Slide 29 text

Scikits Learn: (easy to use) machine learning 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y True Independent Sources 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y Observations PCA ICA 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y PCA scores 3 2 1 0 1 2 3 x 3 2 1 0 1 2 3 y FastICA on 2D point clouds 4 5 6 7 8 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Linear Discr. Analysis versicolor virginica 4 5 6 7 8 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Quadratic Discr. Analysis versicolor virginica 6 4 2 0 2 4 6 6 4 2 0 2 4 6 SVM with non-linear kernel (RBF) inliers outliers SVM: Weighted samples

Slide 30

Slide 30 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 24 / 54

Slide 31

Slide 31 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? IPython: Interactive Scientific Computing A CU Boulder project Started when I was a graduate student in Physics (2001). Continued as a postdoc in Applied Mathematics. Brian Granger: CU Physics. In brief 1 A better Python shell 2 Embeddable Kernel and powerful interactive clients 1 Terminal 2 Qt console 3 Web notebook 3 Flexible parallel computing FP (UC Berkeley) Python for science 4/2/12 25 / 54

Slide 32

Slide 32 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? IPython: Interactive Scientific Computing A CU Boulder project Started when I was a graduate student in Physics (2001). Continued as a postdoc in Applied Mathematics. Brian Granger: CU Physics. In brief 1 A better Python shell 2 Embeddable Kernel and powerful interactive clients 1 Terminal 2 Qt console 3 Web notebook 3 Flexible parallel computing FP (UC Berkeley) Python for science 4/2/12 25 / 54

Slide 33

Slide 33 text

IPython: Matlab/IDL-like interactive use

Slide 34

Slide 34 text

Qt console: inline plots, html, multiline editing, ... Evan Patterson (Enthought)

Slide 35

Slide 35 text

Microsoft Visual Studio 2010 integrated console Dino Viehland and Shahrokh Mortazavi; http://pytools.codeplex.com

Slide 36

Slide 36 text

Browser-based notebook: rich text, code, plots, ... Brian Granger, James Gao (Berkeley), rest of the team

Slide 37

Slide 37 text

Interactive and high-level parallel APIs Min Ragan-Kelley, Brian Granger

Slide 38

Slide 38 text

A mid-size project by now

Slide 39

Slide 39 text

Other projects using IPython Scientific EPD: Enthought Python Distribution. Sage: open source mathematics. PyRAF: Space Telescope Science Institute CASA: Nat. Radio Astronomy Observatory Ganga: CERN PyMAD: neutron spectrom., Laue Langevin Sardana: European Synchrotron Radiation ASCEND: eng. modeling (Carnegie Mellon). JModelica: dynamical systems. DASH: Denver Aerosol Sources and Health. Trilinos: Sandia National Lab. DoD: baseline configuration. Mayavi: 3d visualization, Enthought. NiPype: computational pipelines, MIT. PyIMSL Studio, by Visual Numerics. ... Web/Other Visual Studio 2010: MS. Django. Turbo Gears. Pylons web framework Zope and Plone CMS. Axon Shell, BBC Kamaelia. Schevo database. Pitz: distributed task/bug tracking. iVR (interactive Virtual Reality). Movable Python (portable Python environment). ...

Slide 40

Slide 40 text

Support Enthought, Austin, TX: Lots! Tech-X Corporation, Boulder, CO: Parallel/notebook (previous versions) Microsoft: WinHPC support, Visual Studio integration NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. DoD/HPTi.

Slide 41

Slide 41 text

(Incomplete) Cast of Characters Brian Granger - Cal State San Luis Obispo Physics Min Ragan-Kelley - UC Berkeley Nuclear engineering. Thomas Kluyver - U. Sheffield Plant biology Jörgen Stenarson - SP Technical Research Institute of Sweden Paul Ivanov - UC Berkeley neuroscience Robert Kern - Enthought Evan Patterson - Caltech Physics/Enthought Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Julian Taylor - Debian/Ubuntu Many more! (~140 commit authors)

Slide 42

Slide 42 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 35 / 54

Slide 43

Slide 43 text

What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)

Slide 44

Slide 44 text

What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)

Slide 45

Slide 45 text

What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)

Slide 46

Slide 46 text

FOSS better than scientific research? FOSS: Free and Open Source Software Public distributed version control: provenance tracking

Slide 47

Slide 47 text

Pull requests: ongoing peer review

Slide 48

Slide 48 text

Pull requests: back and forth discussion

Slide 49

Slide 49 text

Branches: exploratory work with control

Slide 50

Slide 50 text

Automated tests: validation The IPython build Dashboard: immediate feedback

Slide 51

Slide 51 text

Public bug trackers

Slide 52

Slide 52 text

Versioned science Git: the tool you didn’t know you needed Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) Everyday research: track your results Teaching (never accept an emailed homework assignment again!)

Slide 53

Slide 53 text

Versioned science Git: the tool you didn’t know you needed Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) Everyday research: track your results Teaching (never accept an emailed homework assignment again!)

Slide 54

Slide 54 text

Git for running a course?

Slide 55

Slide 55 text

One student’s work

Slide 56

Slide 56 text

Details

Slide 57

Slide 57 text

Benefits of teaching with Git Automatic timestamping of all work. Distributed backup: the dog can not eat their homework! They can work from any computer. Easy downloading of all class materials without a million clicks. The end of the email attachment madness. Version control as an natural tool, as common as email.

Slide 58

Slide 58 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 48 / 54

Slide 59

Slide 59 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? A brief demo of the IPython notebook FP (UC Berkeley) Python for science 4/2/12 49 / 54

Slide 60

Slide 60 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? IPython and the lifecycle of scientific ideas Individual exploration Collaboration “Google docs with a brain” Large-scale parallel production work IPython notebook on Amazon EC2: MIT’s StarCluster Publication Generation of HTML/PDF/EPub... “Executable papers” Education Workshops and bootcamps (UC Berkeley, elsewhere) FP (UC Berkeley) Python for science 4/2/12 50 / 54

Slide 61

Slide 61 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? The executable paper: Titus Brown (MSU), 3/21/12 http://arxiv.org/abs/1203.4802 FP (UC Berkeley) Python for science 4/2/12 51 / 54

Slide 62

Slide 62 text

Changes Two vignettes Scientific Python IPython Lessons from the open source world Where is this going? Titus’ IPython notebook, runs on Amazon Cloud FP (UC Berkeley) Python for science 4/2/12 52 / 54

Slide 63

Slide 63 text

Next steps... IPython Executable examples in books (with a large US publisher) A full book on brain imaging and statistics (JB Poline - Neurospin). DoD - classic HPC environments. Notebook: a format beyond Python (R, matlab, etc...) UK: Python in education and the Raspberry Pi. Numfocus.org: a foundation interface with industry. support open source scientific Python produce educational materials Github.com: collaborations on ’versioned science’.

Slide 64

Slide 64 text

Things are changing... Journal policies... Funding agencies... Needs of everyday science... So we must also change: Improve our computational praxis Better educate our students Acknowledge computational work alongside other metrics of academic work.