Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open and reproducible scientific programming

Open and reproducible scientific programming

A lunch talk I gave, based on a recent editorial in Nature. Why computer code used in scientific research should be open, and why scientists should be better educated in programming techniques.

Philip Chimento

April 05, 2012
Tweet

More Decks by Philip Chimento

Other Decks in Science

Transcript

  1. Philip. You wrote a nice paper, but I require access

    to your data analysis methods. I am trying to reproduce your results. Without success so far. “mad scientist” by “linsepatron” on Flickr, CC-BY-NC-ND
  2. After all, you don’t want your research to be irreproducible,

    and die alone and unloved, like Pons and Fleischmann of “cold fusion” fame… DO YOU?
  3. Well, I don’t know… my code is really messy and

    I’m not sure if I should release it…
  4. How did I do that again? I think you were

    supposed to run “fit_poling_period_data” and then “test_efficiency_data”…
  5. Oh, right, you had to go into Excel and switch

    the columns around, because OOIBase didn’t export the data properly.
  6. AAAAAAARGH There was a bug in that function in Matlab

    and now they fixed it and my results are different??!!
  7. Should journals require all computer code to be released as

    supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?
  8. Code release policies LENIENT “Nature does not require authors to

    make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.” (Editorial: Devil in the Details, Nature 470, p. 305, Feb. 16 2011) RIGOROUS “To address the growing complexity of data and analyses, Science is extending our data access requirement […] to include computer codes involved in the creation or analysis of data.” (Editorial: Making Data Maximally Available, Science 331, p. 649, Feb. 11 2011) This section is a summary of: Ince, Hatton, Graham-Cumming. The case for open computer programs. Nature 482, p. 485. (2012).
  9. Why release code? Code descriptions are ambiguous. A description in

    natural or mathematical language can be turned into any number of different programs. sinc − + 2 2 + 2 2 ∞ −∞ Adaptive Simpson quadrature won’t work on that, because it has a pole near the real axis. It’ll give you a garbage result.
  10. Why release code? Algorithm descriptions are imperfect, even if they

    could be entirely unambiguous. One data set 6 significant figures Nine different results with only 1 or 2 digits agreement Nine different commercial implementations of the same algorithm Hatton, Roberts (1994). How accurate is scientific software? IEEE Trans Softw Eng. 20, 785.
  11. Why release code? Algorithm descriptions are imperfect, even if they

    could be entirely unambiguous. Image: “Bulford Dolphin in dry dock” by “Jetset” from Wikimedia Commons, CC-BY-SA “These data, however, were used by geologists to site extremely expensive marine drilling rigs”
  12. Why release code? “Given enough eyeballs, all bugs are shallow”

    (Eric S. Raymond, The Cathedral and the Bazaar) One-third of all software failures in a large-scale IBM study only occurred for the first time after 5000 execution-years. (Adams, IBM J Res. Develop. 28, p. 2, 1984)
  13. Journals should implement categories of code openness This signals potential

    reproducibility issues Full source code Partial Marginal No source code By Caroline Madigan for opensource.com, CC-BY-SA
  14. Research funding bodies should commission R&D on tools that enable

    code to be integrated with data, graphs, and article text By “dmpop”, from sxc.hu
  15. Should journals require all computer code to be released as

    supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?
  16. Should journals require all computer code to be released as

    supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?
  17. Source: Z. Merali, Error, Nature 467, p. 775 (2010). Data

    from G. Wilson, Software Carpentry.
  18. Scientists are not programmers. Greg Wilson: "There are terrifying statistics

    showing that almost all of what scientists know about coding is self-taught. They just don't know how bad they are."
  19. Use a version control system It’s a sort of database

    that records every change you make. Put your raw data, processing code, and other primary material into it, to keep a record of what you did, how, and when. “Wooden file cabinet” by “Pptudela” on Wikimedia Commons, CC-BY-SA
  20. Track your process Make sure you can always reproduce your

    results from your sources automatically. Write everything in scripts so you don’t do any manipulation by hand. “tracks” by “hbakkh” on Flickr, CC-BY-NC
  21. Write testable software Don’t rely on validation testing. Instead, build

    large codes from easily testable chunks, and test how they react to broken input. Get other people to review your code. “Checklist” by “adesigna” on Flickr, CC-BY-NC-SA
  22. Encourage sharing Make the code that you use in your

    research freely available when possible.