Slide 1

Slide 1 text

Changes Open Source SciPy IPython Where Next? Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing Fernando Pérez http://fperez.org, @fperez_org [email protected] Helen Wills Neuroscience Institute, UC Berkeley CI Days, BEACON Center MSU, East Lansing October 26, 2012

Slide 2

Slide 2 text

Changes Open Source SciPy IPython Where Next? Outline 1 Changes in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 2 / 42

Slide 3

Slide 3 text

Changes Open Source SciPy IPython Where Next? Outline 1 Changes in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 3 / 42

Slide 4

Slide 4 text

Changes Open Source SciPy IPython Where Next? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42

Slide 5

Slide 5 text

Changes Open Source SciPy IPython Where Next? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42

Slide 6

Slide 6 text

Changes Open Source SciPy IPython Where Next? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42

Slide 7

Slide 7 text

Changes Open Source SciPy IPython Where Next? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42

Slide 8

Slide 8 text

Changes Open Source SciPy IPython Where Next? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42

Slide 9

Slide 9 text

Changes Open Source SciPy IPython Where Next? Outline 1 Changes in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 6 / 42

Slide 10

Slide 10 text

What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)

Slide 11

Slide 11 text

What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)

Slide 12

Slide 12 text

What does it take to get reproducible research results? Reproducible research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)

Slide 13

Slide 13 text

FOSS better than scientific research? FOSS: Free and Open Source Software Public distributed version control: provenance tracking

Slide 14

Slide 14 text

Pull requests: ongoing peer review

Slide 15

Slide 15 text

Pull requests: back and forth discussion

Slide 16

Slide 16 text

Branches: exploratory work with control

Slide 17

Slide 17 text

Automated tests: validation The IPython build Dashboard: immediate feedback

Slide 18

Slide 18 text

Public bug trackers

Slide 19

Slide 19 text

Versioned science Git: the tool you didn’t know you needed Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone git@server:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!

Slide 20

Slide 20 text

Versioned science Git: the tool you didn’t know you needed Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone git@server:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!

Slide 21

Slide 21 text

Changes Open Source SciPy IPython Where Next? Outline 1 Changes in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 15 / 42

Slide 22

Slide 22 text

Beyond (Floating Point) Number Crunching Hardware floating point Arbitrary precision integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...

Slide 23

Slide 23 text

NumPy: the Foundation. Modern Array Processing A flexible, efficient, multidimensional array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation

Slide 24

Slide 24 text

Scientific Python: a Rich Ecosystem IPython NetworkX

Slide 25

Slide 25 text

Changes Open Source SciPy IPython Where Next? Outline 1 Changes in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 19 / 42

Slide 26

Slide 26 text

Changes Open Source SciPy IPython Where Next? The Lifecycle of a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42

Slide 27

Slide 27 text

Changes Open Source SciPy IPython Where Next? The Lifecycle of a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42

Slide 28

Slide 28 text

IPython’s goal: Fluid transitions in all these steps

Slide 29

Slide 29 text

Demo

Slide 30

Slide 30 text

Changes Open Source SciPy IPython Where Next? Pillar #1: An architecture for interactive computing FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 23 / 42

Slide 31

Slide 31 text

Changes Open Source SciPy IPython Where Next? Pillar #2: the Notebook Format JSON but version control-friendly Easy for machine processing, fixable by hand if need be. Lots of hooks for metadata Not Python-specific (R and Ruby notebooks exist, Julia planned) Produce Markdown, reST, L A TEX, HTML, etc... An open format for sharing, publishing and archiving executable computational work FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 24 / 42

Slide 32

Slide 32 text

Text console with visualization

Slide 33

Slide 33 text

Qt console: inline plots, html, multiline editing, ... Evan Patterson (Enthought)

Slide 34

Slide 34 text

Browser-based notebook Brian Granger, James Gao (Berkeley), rest of the team

Slide 35

Slide 35 text

High-level parallel computing Min Ragan-Kelley, Brian Granger

Slide 36

Slide 36 text

Documented protocols and formats: a growing ecosystem around IPython

Slide 37

Slide 37 text

Microsoft Visual Studio 2010 integrated console Dino Viehland and Shahrokh Mortazavi (Microsoft); http://pytools.codeplex.com

Slide 38

Slide 38 text

A vim client to control an IPython kernel/console Paul Ivanov (Berkeley), https://github.com/ivanov/vim-ipython

Slide 39

Slide 39 text

Notebooks on Windows Azure Cloud Shahrokh Mortazavi (Microsoft), B.G., F.P.: http://bit.ly/JQeojD.

Slide 40

Slide 40 text

Star Cluster: IPython parallel+Notebook on Amazon EC2 Justin Riley (MIT): http://web.mit.edu/star/cluster

Slide 41

Slide 41 text

One-click single notebook on Amazon EC2 Carl Smith (UK): https://notebookcloud.appspot.com.

Slide 42

Slide 42 text

Other projects using IPython Scientific EPD: Enthought Python Distribution. Sage: open source mathematics. PyRAF: Space Telescope Science Institute CASA: Nat. Radio Astronomy Observatory Ganga: CERN PyMAD: neutron spectrom., Laue Langevin Sardana: European Synchrotron Radiation ASCEND: eng. modeling (Carnegie Mellon). JModelica: dynamical systems. DASH: Denver Aerosol Sources and Health. Trilinos: Sandia National Lab. DoD: baseline configuration. Mayavi: 3d visualization, Enthought. NiPype: computational pipelines, MIT. PyIMSL Studio, by Visual Numerics. ... Web/Other Visual Studio 2010: MS. Django. Turbo Gears. Pylons web framework Zope and Plone CMS. Axon Shell, BBC Kamaelia. Schevo database. Pitz: distributed task/bug tracking. iVR (interactive Virtual Reality). Movable Python (portable Python environment). ...

Slide 43

Slide 43 text

(Incomplete) Cast of Characters Brian Granger - Physics, Cal State San Luis Obispo Min Ragan-Kelley - Nuclear Engineering, UC Berkeley Matthias Bussonnier - Physics, Institut Curie, Paris Jonathan March- Enthought Thomas Kluyver - Biology, U. Sheffield Jörgen Stenarson - Elect. Engineering, Sweden. Paul Ivanov - Neuroscience, UC Berkeley. Robert Kern - Enthought Evan Patterson - Physics, Caltech/Enthought Brad Froehle - Mathematics, UC Berkeley Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay. Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Many more! (~150 commit authors)

Slide 44

Slide 44 text

Support Thank you! Enthought, Austin, TX: Lots! Microsoft: WinHPC support, Visual Studio integration, Azure (thanks to Shahrokh Mortazavi). DoD/DRC Inc: funding through Sept. 2012 (thanks to Jose Unpingco and Chris Keees). NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. Tech-X Corp., Boulder, CO: Parallel/notebook (previous versions)

Slide 45

Slide 45 text

Changes Open Source SciPy IPython Where Next? Outline 1 Changes in Science & Computing 2 Lessons from the Open Source World 3 Scientific Python 4 IPython: Interactive Python 5 Where Next? FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 38 / 42

Slide 46

Slide 46 text

Changes Open Source SciPy IPython Where Next? The executable paper: Titus Brown (MSU), 3/21/12 http://arxiv.org/abs/1203.4802 FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 39 / 42

Slide 47

Slide 47 text

Changes Open Source SciPy IPython Where Next? One day, from code to paper. CU Boulder, 4/3/12 ISME Journal, http://dx.doi.org/10.1038/ismej.2012.123 FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 40 / 42

Slide 48

Slide 48 text

Changes Open Source SciPy IPython Where Next? Supplementary materials: execute the paper! FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 41 / 42

Slide 49

Slide 49 text

Plug: Reproducibility in Computational and Experimental Mathematics December 10-14 2012, Brown, Rhode Island. http://icerm.brown.edu/tw12-5-rcem