Open and reproducible scientific programming

Open and reproducible scientific programming

A lunch talk I gave, based on a recent editorial in Nature. Why computer code used in scientific research should be open, and why scientists should be better educated in programming techniques.

B2014c8170f4d16a313cfa79071fd861?s=128

Philip Chimento

April 05, 2012
Tweet

Transcript

  1. OPEN AND REPRODUCIBLE SCIENTIFIC PROGRAMMING Philip Chimento Lunch talk —

    April 5, 2012
  2. Philip. You wrote a nice paper, but I require access

    to your data analysis methods. I am trying to reproduce your results. Without success so far. “mad scientist” by “linsepatron” on Flickr, CC-BY-NC-ND
  3. After all, you don’t want your research to be irreproducible,

    and die alone and unloved, like Pons and Fleischmann of “cold fusion” fame… DO YOU?
  4. Well, I don’t know… my code is really messy and

    I’m not sure if I should release it…
  5. I SAID COLD FUSION

  6. OK, OK, let’s take a look…

  7. How did I do that again? I think you were

    supposed to run “fit_poling_period_data” and then “test_efficiency_data”…
  8. Darn, it was “generate_data_set” and then “remove_bad_data_points”…

  9. Oh, right, you had to go into Excel and switch

    the columns around, because OOIBase didn’t export the data properly.
  10. Okay, that should do it. Wait, the results are different.

  11. AAAAAAARGH There was a bug in that function in Matlab

    and now they fixed it and my results are different??!!
  12. COLD FUSION FAIL

  13. Should journals require all computer code to be released as

    supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?
  14. Code release policies LENIENT “Nature does not require authors to

    make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.” (Editorial: Devil in the Details, Nature 470, p. 305, Feb. 16 2011) RIGOROUS “To address the growing complexity of data and analyses, Science is extending our data access requirement […] to include computer codes involved in the creation or analysis of data.” (Editorial: Making Data Maximally Available, Science 331, p. 649, Feb. 11 2011) This section is a summary of: Ince, Hatton, Graham-Cumming. The case for open computer programs. Nature 482, p. 485. (2012).
  15. Why release code? Code descriptions are ambiguous. A description in

    natural or mathematical language can be turned into any number of different programs. sinc − + 2 2 + 2 2 ∞ −∞ Adaptive Simpson quadrature won’t work on that, because it has a pole near the real axis. It’ll give you a garbage result.
  16. Why release code? Algorithm descriptions are imperfect, even if they

    could be entirely unambiguous. One data set 6 significant figures Nine different results with only 1 or 2 digits agreement Nine different commercial implementations of the same algorithm Hatton, Roberts (1994). How accurate is scientific software? IEEE Trans Softw Eng. 20, 785.
  17. Why release code? Algorithm descriptions are imperfect, even if they

    could be entirely unambiguous. Image: “Bulford Dolphin in dry dock” by “Jetset” from Wikimedia Commons, CC-BY-SA “These data, however, were used by geologists to site extremely expensive marine drilling rigs”
  18. Why release code? “Given enough eyeballs, all bugs are shallow”

    (Eric S. Raymond, The Cathedral and the Bazaar) One-third of all software failures in a large-scale IBM study only occurred for the first time after 5000 execution-years. (Adams, IBM J Res. Develop. 28, p. 2, 1984)
  19. Journals should implement categories of code openness This signals potential

    reproducibility issues Full source code Partial Marginal No source code By Caroline Madigan for opensource.com, CC-BY-SA
  20. Research funding bodies should commission R&D on tools that enable

    code to be integrated with data, graphs, and article text By “dmpop”, from sxc.hu
  21. We should educate our students on reproducibility “The Bulls Eye”

    by “Aaron S.” on Flickr, CC-BY-ND
  22. Should journals require all computer code to be released as

    supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?
  23. Ultracal: a case study Source: Ophir-Spiricon website

  24. Ultracal: a case study What is actually going on when

    you use Ultracal?
  25. YOU LOSE CONTROL OF YOUR OWN DATA

  26. Should journals require all computer code to be released as

    supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?
  27. Source: Z. Merali, Error, Nature 467, p. 775 (2010). Data

    from G. Wilson, Software Carpentry.
  28. None
  29. None
  30. Scientists are not programmers. Image by “djayo”, from sxc.hu

  31. Scientists are not programmers. Greg Wilson: "There are terrifying statistics

    showing that almost all of what scientists know about coding is self-taught. They just don't know how bad they are."
  32. Use a version control system It’s a sort of database

    that records every change you make. Put your raw data, processing code, and other primary material into it, to keep a record of what you did, how, and when. “Wooden file cabinet” by “Pptudela” on Wikimedia Commons, CC-BY-SA
  33. Track your process Make sure you can always reproduce your

    results from your sources automatically. Write everything in scripts so you don’t do any manipulation by hand. “tracks” by “hbakkh” on Flickr, CC-BY-NC
  34. Write testable software Don’t rely on validation testing. Instead, build

    large codes from easily testable chunks, and test how they react to broken input. Get other people to review your code. “Checklist” by “adesigna” on Flickr, CC-BY-NC-SA
  35. Encourage sharing Make the code that you use in your

    research freely available when possible.
  36. Encourage sharing

  37. Learn more

  38. Scientists shouldn’t have to be programmers.