$30 off During Our Annual Pro Sale. View Details »

Scientific Open Source Software: meat and bits but not papers. Is it real work?

Fernando Perez
October 08, 2019

Scientific Open Source Software: meat and bits but not papers. Is it real work?

A discussion on the role of open source software in science, its sustainability and current outlook.

Co-authored with Lindsey Heagy (https://lindseyjh.ca).

Video of the presentation available at: https://cdac.uchicago.edu/insights/fernando-perez-scientific-open-source-software

Fernando Perez

October 08, 2019
Tweet

More Decks by Fernando Perez

Other Decks in Science

Transcript

  1. Fernando Pérez
    Lindsey Heagy
    Scientific Open Source Software: meat
    and bits but not papers. Is it real work?

    View Slide

  2. A really odd career
    Physics PhD: Lattice QCD
    Simulations
    Applied Math Postdoc:
    numerical algorithms

    View Slide

  3. A really odd career
    Physics PhD: Lattice QCD
    Simulations
    Applied Math Postdoc:
    numerical algorithms

    View Slide

  4. View Slide

  5. –Hamming'62
    “The purpose of computing is insight,
    not numbers”

    View Slide

  6. Maslov’s hierarchy of OSS
    Services and content
    Software
    Standards and Protocols
    Community

    View Slide

  7. Services/Software

    View Slide

  8. JupyterLab: a grand unified theory of Jupyter
    Huge Team Effort!
    C. Colbert, S. Corlay, A. Darian, B. Granger, J. Grout, P.
    Ivanov, I. Rose, S. Silvester, C. Willing, J. Zosa-Forde …

    View Slide

  9. Standards and Protocols

    View Slide

  10. Core ideas of the web: HTTP & HTML
    HTML: format to represent content
    HyperText Markup Language
    HTTP: protocol to connect clients and servers
    HyperText Transport Protocol
    Image credit: eviltester.com

    View Slide

  11. Core ideas of Jupyter
    Document Format
    https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-
    for-Hackers
    Interactive Computing Protocol
    SUB SUB DEAL
    Client
    SUB
    DEAL
    DEAL
    DEAL
    ROUT
    PUB ROUT
    ROUT
    Kernel
    ØMQ + JSON

    View Slide

  12. A language agnostic protocol

    View Slide

  13. A language agnostic protocol
    u a
    l
    j i

    View Slide

  14. A language agnostic protocol
    u a
    l
    j i

    View Slide

  15. A language agnostic protocol
    u a
    l
    j i
    ~100 different kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels

    View Slide

  16. Community

    View Slide

  17. IPython: an afternoon hack, 2001

    View Slide

  18. Plus ~ 1500 more Open source contributors!
    A true team effort

    View Slide

  19. Formalized governance
    Formal fiscal sponsorship
    Brian Granger
    Cal Poly, Amazon
    Me :)

    View Slide

  20. Educational impact:
    a view from Berkeley

    View Slide

  21. Fall 2018
    Data 100:
    ~800 students
    Data 8:
    ~1,300
    students

    View Slide

  22. With these tools, we provide:
    ❖ Broad disciplinary reach and impact of statistical
    thinking.
    ❖ Drastically lowered barriers to student access -
    intellectual and economic.
    ❖ Lowered barriers for faculty* to engage with
    statistical and computational ideas.
    ❖ (*) typically from non computational/statistical
    domains)
    Organizational and intellectual leadership:
    Cathryn Carson, Ani Adhikari, John DeNero, … (many more)

    View Slide

  23. Not just teaching toys:
    real research tools

    View Slide

  24. Reproducible Research
    An article about computational science in a scientific
    publication is not the scholarship itself, it is merely
    advertising of the scholarship. The actual scholarship is the
    complete software development environment and the
    complete set of instructions which generated the figures.
    Buckheit and Donoho, WaveLab and Reproducible Research,
    1995

    View Slide

  25. mybinder.org: shareable reproducibility
    github.com/freeman-lab
    Explicit Dependencies
    +
    +
    Origins:
    Jeremy Freeman’s
    lab at Janelia farm.
    That “incentives"
    business…

    View Slide

  26. LIGO: September 14, 2015

    View Slide

  27. The song of the universe
    http://bit.ly/black-holes-woop

    View Slide

  28. The song of the universe
    http://bit.ly/black-holes-woop

    View Slide

  29. View Slide

  30. April 18/19, 2019: Shep Doeleman & Katie Bouman

    View Slide

  31. Geosciences: research & education
    Lindsey Heagy, Berkeley
    2019 GWH Career Achievement
    Award for outstanding junior scientist
    SimPEG: https://simpeg.xyz http://geosci.xyz

    View Slide

  32. Pangeo: open geosciences (and more!)
    Harnessing the power of cloud
    computing to study the whole
    earth interactively.
    https://pangeo.io
    Ryan Abernathey Joe Hamman

    View Slide

  33. Pangeo: open geosciences (and more!)
    Harnessing the power of cloud
    computing to study the whole
    earth interactively.
    https://pangeo.io
    Ryan Abernathey Joe Hamman

    View Slide

  34. Pangeo: open geosciences (and more!)
    Harnessing the power of cloud
    computing to study the whole
    earth interactively.
    https://pangeo.io
    Ryan Abernathey Joe Hamman

    View Slide

  35. Jupyter meets the Earth: newly funded NSF grant - $2M/3y
    ● CMIP6 Climate data analysis
    ● Large scale hydrological modelling
    ● Geophysical simulations and
    inversions
    ● Data discovery through JupyterLab
    ● Interactivity: Widgets & Dashboards
    ● JupyterHub: Using and managing
    shared computational infrastructure
    Fernando
    Perez
    Joe Hamman Laurel Larsen Kevin Paul Lindsey Heagy Chris Holdgraf Yuvi Panda
    Research use-cases Tech developments

    View Slide

  36. View Slide

  37. Jupyter - funding and resources

    View Slide

  38. So you want to build Data Science tools
    in academia…

    View Slide

  39. Career paths?

    View Slide

  40. John Hunter
    Pediatric
    Neurology

    View Slide

  41. John Hunter

    View Slide

  42. View Slide

  43. Scientific Open Source: Despite (direct) federal $$ support
    ❖ “Indirectly”, lots of $ have supported Scientific OSS
    projects/tools.
    ❖ Under the cover of domain-focused work.

    View Slide

  44. Traditional software
    infrastructure
    funding
    Yes, it’s true, the budget is gone
    again… But you can’t deny that now,
    we get here in an instant!
    Quino (Argentinian cartoonist)

    View Slide

  45. Contrasts in culture and incentives
    Open Source Academia
    Credit Distributed PI & hierarchy
    Output/artifacts Continuous & Project-specific Discrete papers
    Collaborators Fluid: professionals, volunteers, … Structured, funding-dependent
    Governance/
    decision making
    Open, community based Top-down, PI
    Authorship
    Fluid, roles can evolve, no clear “first/
    senior” author
    Need to say more?
    Peer review Continuous, open, pervasive, friendly The opposite
    Value metric Utility, need, impact “Novel and transformative”

    View Slide

  46. “The Stack”: a complete ecosystem

    View Slide

  47. “The Stack”: a complete ecosystem

    View Slide

  48. “The Stack”: a complete ecosystem
    Domain-agnostic backbone/trunk

    View Slide

  49. “The Stack”: a complete ecosystem
    Domain-agnostic backbone/trunk
    • Not “real CS”
    • Not “real research”
    • Nobody’s problem
    • Yet critical to everybody else

    View Slide

  50. Organizations that fill current gaps

    View Slide

  51. Skills in education
    The Carpentries
    Tracy Teal
    Executive Director

    View Slide

  52. Skills in education
    The Carpentries
    Tracy Teal
    Executive Director
    The Society of Research Software Engineering was
    founded on the belief that a world which relies on
    software must recognise the people who develop it.
    https://society-rse.org
    The Society of Research Software Engineering
    Career paths

    View Slide

  53. Open communities & industry
    Leah Silen
    Executive Director
    Andy Terrel
    President

    View Slide

  54. leadership, management, organization building

    View Slide

  55. JOSS/JOSE: academic publication credit
    Arfon Smith
    STScI, Baltimore
    Lorena Barba
    GWU

    View Slide

  56. HackWeeks: training and hacking meet domain research
    Training?

    From undergrads to senior PIs
    In the same room!!

    View Slide

  57. HackWeeks: training and hacking meet domain research
    Training?

    From undergrads to senior PIs
    In the same room!!

    View Slide

  58. HackWeeks: training and hacking meet domain research
    Training?

    From undergrads to senior PIs
    In the same room!!

    View Slide

  59. HackWeeks: training and hacking meet domain research
    Training?

    From undergrads to senior PIs
    In the same room!!

    View Slide

  60. HackWeeks: training and hacking meet domain research
    Training?

    From undergrads to senior PIs
    In the same room!!

    View Slide

  61. HackWeeks - a reproducible model
    https://doi.org/10.1073/pnas.1717196115
    Daniela Huppenkothen
    http://huppenkothen.org/

    View Slide

  62. An economic and organizational problem

    View Slide

  63. Catastrophic Success: an economic problem
    (2015 data) https://arxiv.org/abs/1507.03989

    View Slide

  64. Catastrophic Success: an economic problem
    (2015 data) https://arxiv.org/abs/1507.03989
    ❖ MathWorks: 4,000+ employees
    ❖ Wolfram: 800 employees
    ❖ IDL/Harris: 17,000 employees

    View Slide

  65. Investing to hedge strategic risks
    ❖ It takes investment to have
    a seat at the table.
    ❖ Scientists (and their
    funders) want a voice?
    ❖ The code is already out -
    whose voices will shape it?

    View Slide

  66. Bang for the buck?
    ❖ Federal 2018 R&D budget: $176.8B (AAAS analysis)
    ❖ What fraction of R&D today depends critically on
    computing? 10%? 30%? 50%?
    ❖ $200M is ~0.1% of that.
    ❖ $200M annually (well spent) would have major
    impact.

    View Slide

  67. “Well spent” That should be easy…
    ❖ Some features of successful, resilient projects
    ❖ Broad community engagement
    ❖ Actively managed pipeline for new contributions
    ❖ Capacity for short and long-term planning
    ❖ Writing code only small part of the job
    ❖ Treat OSS projects like real, complex organizations

    View Slide

  68. It’s in the air…
    "many projects of immense
    infrastructural importance are
    simultaneously fundamental to multiple
    business models and also chronically
    underfunded”

    View Slide

  69. Ford Foundation report, authored by Nadia Eghbal
    @nayafia

    View Slide

  70. Multi-stakeholder governance
    @nayafia

    View Slide

  71. ❖ Economic incentives and sustainability
    ❖ Governance models
    ❖ Roles and professional career paths
    ❖ Multi-stakeholder organizational structures
    OSS is a lot more than software

    View Slide

  72. Thank you (Bay Area team)
    Current (Berkeley, LBNL, Bloomberg)
    Stacey Dorton, Lindsey Heagy, Chris Holdgraf, Yuvi
    Panda, Ryan Lovett, Shreyas Cholia, Shane Canon,
    Rollin Thomas, Jason Grout
    Former Berkeley
    Min Ragan-Kelley, Paul Ivanov, Thomas Kluyver, M
    Pacer, Matthias Bussonnier, Jessica Hamrick, Ian
    Rose, Jamie Whitacre.

    View Slide

  73. In Memoriam - John Hunter, 1968-2012

    View Slide