Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open and reproducible scientific programming

Open and reproducible scientific programming

A lunch talk I gave, based on a recent editorial in Nature. Why computer code used in scientific research should be open, and why scientists should be better educated in programming techniques.

Philip Chimento

April 05, 2012
Tweet

More Decks by Philip Chimento

Other Decks in Science

Transcript

  1. OPEN AND REPRODUCIBLE
    SCIENTIFIC PROGRAMMING
    Philip Chimento
    Lunch talk — April 5, 2012

    View full-size slide

  2. Philip. You wrote a nice paper,
    but I require access to your data
    analysis methods. I am trying to
    reproduce your results. Without
    success so far.
    “mad scientist” by “linsepatron” on Flickr, CC-BY-NC-ND

    View full-size slide

  3. After all, you don’t want your
    research to be irreproducible,
    and die alone and unloved, like
    Pons and Fleischmann of “cold
    fusion” fame… DO YOU?

    View full-size slide

  4. Well, I don’t know… my code is
    really messy and I’m not sure if
    I should release it…

    View full-size slide

  5. I SAID COLD FUSION

    View full-size slide

  6. OK, OK, let’s take a look…

    View full-size slide

  7. How did I do that again? I think
    you were supposed to run
    “fit_poling_period_data” and then
    “test_efficiency_data”…

    View full-size slide

  8. Darn, it was
    “generate_data_set” and then
    “remove_bad_data_points”…

    View full-size slide

  9. Oh, right, you had to go into
    Excel and switch the columns
    around, because OOIBase
    didn’t export the data
    properly.

    View full-size slide

  10. Okay, that should do it.
    Wait, the results are different.

    View full-size slide

  11. AAAAAAARGH
    There was a bug in that function
    in Matlab and now they fixed it
    and my results are different??!!

    View full-size slide

  12. COLD FUSION FAIL

    View full-size slide

  13. Should journals require all
    computer code to be released as
    supplementary material?
    Should scientists be using non-
    open programs at all?
    Should scientists even be
    programming?

    View full-size slide

  14. Code release policies
    LENIENT
    “Nature does not require
    authors to make code
    available, but we do expect a
    description detailed enough to
    allow others to write their own
    code to do similar analysis.”
    (Editorial: Devil in the Details, Nature 470, p. 305,
    Feb. 16 2011)
    RIGOROUS
    “To address the growing
    complexity of data and
    analyses, Science is extending
    our data access requirement
    […] to include computer codes
    involved in the creation or
    analysis of data.”
    (Editorial: Making Data Maximally Available, Science
    331, p. 649, Feb. 11 2011)
    This section is a summary of: Ince, Hatton, Graham-Cumming. The case for open computer
    programs. Nature 482, p. 485. (2012).

    View full-size slide

  15. Why release code?
    Code descriptions are ambiguous. A description
    in natural or mathematical language can be
    turned into any number of different programs.
    sinc −


    +
    2
    2
    + 2
    2


    −∞
    Adaptive Simpson
    quadrature won’t work on
    that, because it has a pole
    near the real axis. It’ll give
    you a garbage result.

    View full-size slide

  16. Why release code?
    Algorithm descriptions are imperfect, even if
    they could be entirely unambiguous.
    One data
    set
    6
    significant
    figures
    Nine different
    results with only
    1 or 2 digits
    agreement
    Nine different commercial
    implementations of the same algorithm
    Hatton, Roberts (1994). How accurate is scientific software? IEEE Trans Softw Eng. 20, 785.

    View full-size slide

  17. Why release code?
    Algorithm descriptions are imperfect, even if
    they could be entirely unambiguous.
    Image: “Bulford Dolphin in dry dock” by “Jetset” from Wikimedia Commons, CC-BY-SA
    “These data, however, were
    used by geologists to site
    extremely expensive marine
    drilling rigs”

    View full-size slide

  18. Why release code?
    “Given enough eyeballs, all bugs are shallow”
    (Eric S. Raymond, The Cathedral and the Bazaar)
    One-third of all software failures
    in a large-scale IBM study only
    occurred for the first time after
    5000 execution-years.
    (Adams, IBM J Res. Develop. 28, p. 2, 1984)

    View full-size slide

  19. Journals should
    implement categories
    of code openness
    This signals potential
    reproducibility issues
    Full source code Partial Marginal No source code
    By Caroline Madigan for opensource.com, CC-BY-SA

    View full-size slide

  20. Research funding
    bodies should commission
    R&D on tools that enable code to
    be integrated with data,
    graphs, and article
    text
    By “dmpop”, from sxc.hu

    View full-size slide

  21. We should educate
    our students on
    reproducibility
    “The Bulls Eye” by “Aaron S.” on Flickr, CC-BY-ND

    View full-size slide

  22. Should journals require all
    computer code to be released as
    supplementary material?
    Should scientists be using non-
    open programs at all?
    Should scientists even be
    programming?

    View full-size slide

  23. Ultracal: a case study
    Source: Ophir-Spiricon website

    View full-size slide

  24. Ultracal: a case study
    What is actually going on when
    you use Ultracal?

    View full-size slide

  25. YOU LOSE CONTROL
    OF YOUR OWN DATA

    View full-size slide

  26. Should journals require all
    computer code to be released as
    supplementary material?
    Should scientists be using non-
    open programs at all?
    Should scientists even be
    programming?

    View full-size slide

  27. Source: Z. Merali, Error, Nature 467, p. 775 (2010). Data from G. Wilson, Software Carpentry.

    View full-size slide

  28. Scientists are not programmers.
    Image by “djayo”, from sxc.hu

    View full-size slide

  29. Scientists are not programmers.
    Greg Wilson: "There are terrifying
    statistics showing that almost all of
    what scientists know about coding
    is self-taught. They just don't know
    how bad they are."

    View full-size slide

  30. Use a version
    control system
    It’s a sort of database that
    records every change you
    make.
    Put your raw data,
    processing code, and other
    primary material into it, to
    keep a record of what you
    did, how, and when.
    “Wooden file cabinet” by “Pptudela” on Wikimedia Commons, CC-BY-SA

    View full-size slide

  31. Track your process
    Make sure you can always
    reproduce your results from
    your sources automatically.
    Write everything in scripts so
    you don’t do any manipulation
    by hand.
    “tracks” by “hbakkh” on Flickr, CC-BY-NC

    View full-size slide

  32. Write testable software
    Don’t rely on validation
    testing.
    Instead, build large codes
    from easily testable chunks,
    and test how they react to
    broken input.
    Get other people to review
    your code.
    “Checklist” by “adesigna” on Flickr, CC-BY-NC-SA

    View full-size slide

  33. Encourage sharing
    Make the code that you use in your research
    freely available when possible.

    View full-size slide

  34. Encourage sharing

    View full-size slide

  35. Scientists shouldn’t have to be
    programmers.

    View full-size slide