Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing

Fernando Perez
October 27, 2012

Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing

Slides for my keynote at the Michigan State University Cyber-Infrastructure days. Note that a good part of the talk were interactive demos of various aspects of IPython.

Fernando Perez

October 27, 2012
Tweet

More Decks by Fernando Perez

Other Decks in Science

Transcript

  1. Changes Open Source SciPy IPython Where Next?
    Openness, Reproducibility, Interactivity:
    a Biased View on the Relation
    between Science and Computing
    Fernando Pérez
    http://fperez.org, @fperez_org
    [email protected]
    Helen Wills Neuroscience Institute, UC Berkeley
    CI Days, BEACON Center
    MSU, East Lansing
    October 26, 2012

    View Slide

  2. Changes Open Source SciPy IPython Where Next?
    Outline
    1 Changes in Science & Computing
    2 Lessons from the Open Source World
    3 Scientific Python
    4 IPython: Interactive Python
    5 Where Next?
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 2 / 42

    View Slide

  3. Changes Open Source SciPy IPython Where Next?
    Outline
    1 Changes in Science & Computing
    2 Lessons from the Open Source World
    3 Scientific Python
    4 IPython: Interactive Python
    5 Where Next?
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 3 / 42

    View Slide

  4. Changes Open Source SciPy IPython Where Next?
    Computing: part of the DNA of science
    Much more than “the third branch” of science
    An avalanche of experimental quantitative data
    Biology, genetics, neuroscience, astronomy, climate modeling...
    All scientists must now do real computing
    Good computing is now a necessary (though not sufficient!) condition
    for good science.
    Computing in science must improve drastically before we can really
    call it scientific.
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42

    View Slide

  5. Changes Open Source SciPy IPython Where Next?
    Computing: part of the DNA of science
    Much more than “the third branch” of science
    An avalanche of experimental quantitative data
    Biology, genetics, neuroscience, astronomy, climate modeling...
    All scientists must now do real computing
    Good computing is now a necessary (though not sufficient!) condition
    for good science.
    Computing in science must improve drastically before we can really
    call it scientific.
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42

    View Slide

  6. Changes Open Source SciPy IPython Where Next?
    A crisis of credibility and real issues
    The Duke clinical trials scandal - Potti/Nevin
    A compounding of (common and otherwise) data analysis errors.
    No materials allowing validation/reproduction of results.
    Lawsuits, resignations, careers destroyed.
    More importantly: Patients were harmed.
    Major policy reviews and changes: NCI, IOM, ...
    More: see K. Baggerly’s "starter set" page.
    The Duke situation is more common than we’d like to believe!
    Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards
    for preclinical cancer research.
    47 out of 53 “landmark papers” could not be replicated.
    Nature, Feb 2012, Ince et al: The case for open computer programs
    “The scientific community places more faith in computation than is
    justified”
    “anything less than the release of actual source code is an indefensible
    approach for any scientific results that depend on computation”
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42

    View Slide

  7. Changes Open Source SciPy IPython Where Next?
    A crisis of credibility and real issues
    The Duke clinical trials scandal - Potti/Nevin
    A compounding of (common and otherwise) data analysis errors.
    No materials allowing validation/reproduction of results.
    Lawsuits, resignations, careers destroyed.
    More importantly: Patients were harmed.
    Major policy reviews and changes: NCI, IOM, ...
    More: see K. Baggerly’s "starter set" page.
    The Duke situation is more common than we’d like to believe!
    Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards
    for preclinical cancer research.
    47 out of 53 “landmark papers” could not be replicated.
    Nature, Feb 2012, Ince et al: The case for open computer programs
    “The scientific community places more faith in computation than is
    justified”
    “anything less than the release of actual source code is an indefensible
    approach for any scientific results that depend on computation”
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42

    View Slide

  8. Changes Open Source SciPy IPython Where Next?
    A crisis of credibility and real issues
    The Duke clinical trials scandal - Potti/Nevin
    A compounding of (common and otherwise) data analysis errors.
    No materials allowing validation/reproduction of results.
    Lawsuits, resignations, careers destroyed.
    More importantly: Patients were harmed.
    Major policy reviews and changes: NCI, IOM, ...
    More: see K. Baggerly’s "starter set" page.
    The Duke situation is more common than we’d like to believe!
    Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards
    for preclinical cancer research.
    47 out of 53 “landmark papers” could not be replicated.
    Nature, Feb 2012, Ince et al: The case for open computer programs
    “The scientific community places more faith in computation than is
    justified”
    “anything less than the release of actual source code is an indefensible
    approach for any scientific results that depend on computation”
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42

    View Slide

  9. Changes Open Source SciPy IPython Where Next?
    Outline
    1 Changes in Science & Computing
    2 Lessons from the Open Source World
    3 Scientific Python
    4 IPython: Interactive Python
    5 Where Next?
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 6 / 42

    View Slide

  10. What does it take to get reproducible research results?
    Reproducible research practices!
    Reproducibility at publication time?
    It’s already too late.
    Learn from a community (open source) where
    reproducibility is an everyday practice
    (by necessity)

    View Slide

  11. What does it take to get reproducible research results?
    Reproducible research practices!
    Reproducibility at publication time?
    It’s already too late.
    Learn from a community (open source) where
    reproducibility is an everyday practice
    (by necessity)

    View Slide

  12. What does it take to get reproducible research results?
    Reproducible research practices!
    Reproducibility at publication time?
    It’s already too late.
    Learn from a community (open source) where
    reproducibility is an everyday practice
    (by necessity)

    View Slide

  13. FOSS better than scientific research?
    FOSS: Free and Open Source Software
    Public distributed version control: provenance tracking

    View Slide

  14. Pull requests: ongoing peer review

    View Slide

  15. Pull requests: back and forth discussion

    View Slide

  16. Branches: exploratory work with control

    View Slide

  17. Automated tests: validation
    The IPython build Dashboard: immediate feedback

    View Slide

  18. Public bug trackers

    View Slide

  19. Versioned science
    Git: the tool you didn’t know you needed
    Reproducibility?
    Tracking and recreating every step of your work
    In the software world: it’s called Version Control!
    Git: an enabling technology. Use version control for everything
    Paper/grant writing (never get paper_v5_john.tex by email again!)
    git clone [email protected]:/my/grant/repo.git
    cd repo
    make nsf-fastlane
    Everyday research: track your results
    Collaboration: synchronize multi-author work.
    Teaching!

    View Slide

  20. Versioned science
    Git: the tool you didn’t know you needed
    Reproducibility?
    Tracking and recreating every step of your work
    In the software world: it’s called Version Control!
    Git: an enabling technology. Use version control for everything
    Paper/grant writing (never get paper_v5_john.tex by email again!)
    git clone [email protected]:/my/grant/repo.git
    cd repo
    make nsf-fastlane
    Everyday research: track your results
    Collaboration: synchronize multi-author work.
    Teaching!

    View Slide

  21. Changes Open Source SciPy IPython Where Next?
    Outline
    1 Changes in Science & Computing
    2 Lessons from the Open Source World
    3 Scientific Python
    4 IPython: Interactive Python
    5 Where Next?
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 15 / 42

    View Slide

  22. Beyond (Floating Point) Number Crunching
    Hardware
    floating point
    Arbitrary precision
    integers
    Rationals
    Interval arithmetic
    Symbolic manipulation
    FORTRAN
    Extended precision
    floating point
    Text processing
    Databases
    Graphical user
    interfaces
    Web interfaces
    Hardware
    control
    Multi-language
    integration
    Data formats: HDF5, XML, ...

    View Slide

  23. NumPy: the Foundation. Modern Array Processing
    A flexible, efficient, multidimensional array object.
    Convenient syntax: c = a+b.
    Math library that operates on arrays: y = sin(k*t).
    Basic scientific functionality:
    Linear algebra
    FFTs
    Random number generation

    View Slide

  24. Scientific Python: a Rich Ecosystem
    IPython
    NetworkX

    View Slide

  25. Changes Open Source SciPy IPython Where Next?
    Outline
    1 Changes in Science & Computing
    2 Lessons from the Open Source World
    3 Scientific Python
    4 IPython: Interactive Python
    5 Where Next?
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 19 / 42

    View Slide

  26. Changes Open Source SciPy IPython Where Next?
    The Lifecycle of a Scientific Idea (schematically)
    1 Individual exploratory work
    2 Collaborative development
    3 Production work (HPC, cloud, parallel)
    4 Publication (with reproducible results!)
    5 Education
    6 Goto 1.
    The Problem with most tools
    Barriers and discontinuities in workflow in between all the steps
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42

    View Slide

  27. Changes Open Source SciPy IPython Where Next?
    The Lifecycle of a Scientific Idea (schematically)
    1 Individual exploratory work
    2 Collaborative development
    3 Production work (HPC, cloud, parallel)
    4 Publication (with reproducible results!)
    5 Education
    6 Goto 1.
    The Problem with most tools
    Barriers and discontinuities in workflow in between all the steps
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42

    View Slide

  28. IPython’s goal:
    Fluid transitions in all these steps

    View Slide

  29. Demo

    View Slide

  30. Changes Open Source SciPy IPython Where Next?
    Pillar #1: An architecture for interactive computing
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 23 / 42

    View Slide

  31. Changes Open Source SciPy IPython Where Next?
    Pillar #2: the Notebook Format
    JSON but version control-friendly
    Easy for machine processing, fixable by hand if need be.
    Lots of hooks for metadata
    Not Python-specific (R and Ruby notebooks exist, Julia planned)
    Produce Markdown, reST, L
    A
    TEX, HTML, etc...
    An open format for sharing, publishing and
    archiving executable computational work
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 24 / 42

    View Slide

  32. Text console with visualization

    View Slide

  33. Qt console: inline plots, html, multiline editing, ...
    Evan Patterson (Enthought)

    View Slide

  34. Browser-based notebook
    Brian Granger, James Gao (Berkeley), rest of the team

    View Slide

  35. High-level parallel computing
    Min Ragan-Kelley, Brian Granger

    View Slide

  36. Documented protocols and formats:
    a growing ecosystem around IPython

    View Slide

  37. Microsoft Visual Studio 2010 integrated console
    Dino Viehland and Shahrokh Mortazavi (Microsoft); http://pytools.codeplex.com

    View Slide

  38. A vim client to control an IPython kernel/console
    Paul Ivanov (Berkeley), https://github.com/ivanov/vim-ipython

    View Slide

  39. Notebooks on Windows Azure Cloud
    Shahrokh Mortazavi (Microsoft), B.G., F.P.: http://bit.ly/JQeojD.

    View Slide

  40. Star Cluster: IPython parallel+Notebook on Amazon EC2
    Justin Riley (MIT): http://web.mit.edu/star/cluster

    View Slide

  41. One-click single notebook on Amazon EC2
    Carl Smith (UK): https://notebookcloud.appspot.com.

    View Slide

  42. Other projects using IPython
    Scientific
    EPD: Enthought Python Distribution.
    Sage: open source mathematics.
    PyRAF: Space Telescope Science Institute
    CASA: Nat. Radio Astronomy Observatory
    Ganga: CERN
    PyMAD: neutron spectrom., Laue Langevin
    Sardana: European Synchrotron Radiation
    ASCEND: eng. modeling (Carnegie Mellon).
    JModelica: dynamical systems.
    DASH: Denver Aerosol Sources and Health.
    Trilinos: Sandia National Lab.
    DoD: baseline configuration.
    Mayavi: 3d visualization, Enthought.
    NiPype: computational pipelines, MIT.
    PyIMSL Studio, by Visual Numerics.
    ...
    Web/Other
    Visual Studio 2010: MS.
    Django.
    Turbo Gears.
    Pylons web framework
    Zope and Plone CMS.
    Axon Shell, BBC
    Kamaelia.
    Schevo database.
    Pitz: distributed
    task/bug tracking.
    iVR (interactive Virtual
    Reality).
    Movable Python
    (portable Python
    environment).
    ...

    View Slide

  43. (Incomplete) Cast of Characters
    Brian Granger - Physics, Cal State San Luis Obispo
    Min Ragan-Kelley - Nuclear Engineering, UC Berkeley
    Matthias Bussonnier - Physics, Institut Curie, Paris
    Jonathan March- Enthought
    Thomas Kluyver - Biology, U. Sheffield
    Jörgen Stenarson - Elect. Engineering, Sweden.
    Paul Ivanov - Neuroscience, UC Berkeley.
    Robert Kern - Enthought
    Evan Patterson - Physics, Caltech/Enthought
    Brad Froehle - Mathematics, UC Berkeley
    Stefan van der Walt - UC Berkeley
    John Hunter - TradeLink Securities, Chicago.
    Prabhu Ramachandran - Aerospace Engineering, IIT Bombay.
    Satra Ghosh- MIT Neuroscience
    Gaël Varoquaux - Neurospin (Orsay, France)
    Ville Vainio - CS, Tampere University of Technology, Finland
    Barry Wark - Neuroscience, U. Washington.
    Ondrej Certik - Physics, U Nevada Reno
    Darren Dale - Cornell
    Justin Riley - MIT
    Mark Voorhies - UC San Francisco
    Nicholas Rougier - INRIA Nancy Grand Est
    Thomas Spura - Fedora project
    Many more! (~150 commit authors)

    View Slide

  44. Support
    Thank you!
    Enthought, Austin, TX: Lots!
    Microsoft: WinHPC support, Visual Studio integration, Azure
    (thanks to Shahrokh Mortazavi).
    DoD/DRC Inc: funding through Sept. 2012 (thanks to Jose
    Unpingco and Chris Keees).
    NIH: via NiPy grant
    NSF: via Sage compmath grant
    Google: summer of code 2005, 2010.
    Tech-X Corp., Boulder, CO: Parallel/notebook (previous versions)

    View Slide

  45. Changes Open Source SciPy IPython Where Next?
    Outline
    1 Changes in Science & Computing
    2 Lessons from the Open Source World
    3 Scientific Python
    4 IPython: Interactive Python
    5 Where Next?
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 38 / 42

    View Slide

  46. Changes Open Source SciPy IPython Where Next?
    The executable paper: Titus Brown (MSU), 3/21/12
    http://arxiv.org/abs/1203.4802
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 39 / 42

    View Slide

  47. Changes Open Source SciPy IPython Where Next?
    One day, from code to paper. CU Boulder, 4/3/12
    ISME Journal, http://dx.doi.org/10.1038/ismej.2012.123
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 40 / 42

    View Slide

  48. Changes Open Source SciPy IPython Where Next?
    Supplementary materials: execute the paper!
    FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 41 / 42

    View Slide

  49. Plug: Reproducibility
    in Computational and Experimental Mathematics
    December 10-14 2012, Brown, Rhode Island. http://icerm.brown.edu/tw12-5-rcem

    View Slide