Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Openness, Reproducibility, Interactivity: a Bia...

Fernando Perez
October 27, 2012

Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing

Slides for my keynote at the Michigan State University Cyber-Infrastructure days. Note that a good part of the talk were interactive demos of various aspects of IPython.

Fernando Perez

October 27, 2012
Tweet

More Decks by Fernando Perez

Other Decks in Science

Transcript

  1. Changes Open Source SciPy IPython Where Next? Openness, Reproducibility, Interactivity:

    a Biased View on the Relation between Science and Computing Fernando Pérez http://fperez.org, @fperez_org [email protected] Helen Wills Neuroscience Institute, UC Berkeley CI Days, BEACON Center MSU, East Lansing October 26, 2012
  2. Changes Open Source SciPy IPython Where Next? Outline 1 Changes

    in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 2 / 42
  3. Changes Open Source SciPy IPython Where Next? Outline 1 Changes

    in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 3 / 42
  4. Changes Open Source SciPy IPython Where Next? Computing: part of

    the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42
  5. Changes Open Source SciPy IPython Where Next? Computing: part of

    the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42
  6. Changes Open Source SciPy IPython Where Next? A crisis of

    credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
  7. Changes Open Source SciPy IPython Where Next? A crisis of

    credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
  8. Changes Open Source SciPy IPython Where Next? A crisis of

    credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
  9. Changes Open Source SciPy IPython Where Next? Outline 1 Changes

    in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 6 / 42
  10. What does it take to get reproducible research results? Reproducible

    research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
  11. What does it take to get reproducible research results? Reproducible

    research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
  12. What does it take to get reproducible research results? Reproducible

    research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
  13. FOSS better than scientific research? FOSS: Free and Open Source

    Software Public distributed version control: provenance tracking
  14. Versioned science Git: the tool you didn’t know you needed

    Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone git@server:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!
  15. Versioned science Git: the tool you didn’t know you needed

    Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone git@server:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!
  16. Changes Open Source SciPy IPython Where Next? Outline 1 Changes

    in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 15 / 42
  17. Beyond (Floating Point) Number Crunching Hardware floating point Arbitrary precision

    integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...
  18. NumPy: the Foundation. Modern Array Processing A flexible, efficient, multidimensional

    array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation
  19. Changes Open Source SciPy IPython Where Next? Outline 1 Changes

    in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 19 / 42
  20. Changes Open Source SciPy IPython Where Next? The Lifecycle of

    a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42
  21. Changes Open Source SciPy IPython Where Next? The Lifecycle of

    a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42
  22. Changes Open Source SciPy IPython Where Next? Pillar #1: An

    architecture for interactive computing FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 23 / 42
  23. Changes Open Source SciPy IPython Where Next? Pillar #2: the

    Notebook Format JSON but version control-friendly Easy for machine processing, fixable by hand if need be. Lots of hooks for metadata Not Python-specific (R and Ruby notebooks exist, Julia planned) Produce Markdown, reST, L A TEX, HTML, etc... An open format for sharing, publishing and archiving executable computational work FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 24 / 42
  24. Microsoft Visual Studio 2010 integrated console Dino Viehland and Shahrokh

    Mortazavi (Microsoft); http://pytools.codeplex.com
  25. A vim client to control an IPython kernel/console Paul Ivanov

    (Berkeley), https://github.com/ivanov/vim-ipython
  26. Other projects using IPython Scientific EPD: Enthought Python Distribution. Sage:

    open source mathematics. PyRAF: Space Telescope Science Institute CASA: Nat. Radio Astronomy Observatory Ganga: CERN PyMAD: neutron spectrom., Laue Langevin Sardana: European Synchrotron Radiation ASCEND: eng. modeling (Carnegie Mellon). JModelica: dynamical systems. DASH: Denver Aerosol Sources and Health. Trilinos: Sandia National Lab. DoD: baseline configuration. Mayavi: 3d visualization, Enthought. NiPype: computational pipelines, MIT. PyIMSL Studio, by Visual Numerics. ... Web/Other Visual Studio 2010: MS. Django. Turbo Gears. Pylons web framework Zope and Plone CMS. Axon Shell, BBC Kamaelia. Schevo database. Pitz: distributed task/bug tracking. iVR (interactive Virtual Reality). Movable Python (portable Python environment). ...
  27. (Incomplete) Cast of Characters Brian Granger - Physics, Cal State

    San Luis Obispo Min Ragan-Kelley - Nuclear Engineering, UC Berkeley Matthias Bussonnier - Physics, Institut Curie, Paris Jonathan March- Enthought Thomas Kluyver - Biology, U. Sheffield Jörgen Stenarson - Elect. Engineering, Sweden. Paul Ivanov - Neuroscience, UC Berkeley. Robert Kern - Enthought Evan Patterson - Physics, Caltech/Enthought Brad Froehle - Mathematics, UC Berkeley Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay. Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Many more! (~150 commit authors)
  28. Support Thank you! Enthought, Austin, TX: Lots! Microsoft: WinHPC support,

    Visual Studio integration, Azure (thanks to Shahrokh Mortazavi). DoD/DRC Inc: funding through Sept. 2012 (thanks to Jose Unpingco and Chris Keees). NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. Tech-X Corp., Boulder, CO: Parallel/notebook (previous versions)
  29. Changes Open Source SciPy IPython Where Next? Outline 1 Changes

    in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 38 / 42
  30. Changes Open Source SciPy IPython Where Next? The executable paper:

    Titus Brown (MSU), 3/21/12 http://arxiv.org/abs/1203.4802 FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 39 / 42
  31. Changes Open Source SciPy IPython Where Next? One day, from

    code to paper. CU Boulder, 4/3/12 ISME Journal, http://dx.doi.org/10.1038/ismej.2012.123 FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 40 / 42
  32. Changes Open Source SciPy IPython Where Next? Supplementary materials: execute

    the paper! FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 41 / 42