Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing
Slides for my keynote at the Michigan State University Cyber-Infrastructure days. Note that a good part of the talk were interactive demos of various aspects of IPython.
Changes Open Source SciPy IPython Where Next? Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing Fernando Pérez http://fperez.org, @fperez_org [email protected] Helen Wills Neuroscience Institute, UC Berkeley CI Days, BEACON Center MSU, East Lansing October 26, 2012
Changes Open Source SciPy IPython Where Next? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42
Changes Open Source SciPy IPython Where Next? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42
Changes Open Source SciPy IPython Where Next? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
Changes Open Source SciPy IPython Where Next? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
Changes Open Source SciPy IPython Where Next? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
Versioned science Git: the tool you didn’t know you needed Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone [email protected]:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!
Versioned science Git: the tool you didn’t know you needed Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone [email protected]:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!
Beyond (Floating Point) Number Crunching Hardware floating point Arbitrary precision integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...
NumPy: the Foundation. Modern Array Processing A flexible, efficient, multidimensional array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation
Changes Open Source SciPy IPython Where Next? The Lifecycle of a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42
Changes Open Source SciPy IPython Where Next? The Lifecycle of a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42
Changes Open Source SciPy IPython Where Next? Pillar #2: the Notebook Format JSON but version control-friendly Easy for machine processing, fixable by hand if need be. Lots of hooks for metadata Not Python-specific (R and Ruby notebooks exist, Julia planned) Produce Markdown, reST, L A TEX, HTML, etc... An open format for sharing, publishing and archiving executable computational work FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 24 / 42
(Incomplete) Cast of Characters Brian Granger - Physics, Cal State San Luis Obispo Min Ragan-Kelley - Nuclear Engineering, UC Berkeley Matthias Bussonnier - Physics, Institut Curie, Paris Jonathan March- Enthought Thomas Kluyver - Biology, U. Sheffield Jörgen Stenarson - Elect. Engineering, Sweden. Paul Ivanov - Neuroscience, UC Berkeley. Robert Kern - Enthought Evan Patterson - Physics, Caltech/Enthought Brad Froehle - Mathematics, UC Berkeley Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay. Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Many more! (~150 commit authors)
Support Thank you! Enthought, Austin, TX: Lots! Microsoft: WinHPC support, Visual Studio integration, Azure (thanks to Shahrokh Mortazavi). DoD/DRC Inc: funding through Sept. 2012 (thanks to Jose Unpingco and Chris Keees). NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. Tech-X Corp., Boulder, CO: Parallel/notebook (previous versions)
Changes Open Source SciPy IPython Where Next? The executable paper: Titus Brown (MSU), 3/21/12 http://arxiv.org/abs/1203.4802 FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 39 / 42
Changes Open Source SciPy IPython Where Next? One day, from code to paper. CU Boulder, 4/3/12 ISME Journal, http://dx.doi.org/10.1038/ismej.2012.123 FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 40 / 42