Open and reproducible scientific programming

OPEN AND REPRODUCIBLE SCIENTIFIC PROGRAMMING Philip Chimento Lunch talk —
April 5, 2012

Philip. You wrote a nice paper, but I require access
to your data analysis methods. I am trying to reproduce your results. Without success so far. “mad scientist” by “linsepatron” on Flickr, CC-BY-NC-ND

After all, you don’t want your research to be irreproducible,
and die alone and unloved, like Pons and Fleischmann of “cold fusion” fame… DO YOU?

Well, I don’t know… my code is really messy and
I’m not sure if I should release it…

I SAID COLD FUSION

OK, OK, let’s take a look…

How did I do that again? I think you were
supposed to run “fit_poling_period_data” and then “test_efficiency_data”…

Darn, it was “generate_data_set” and then “remove_bad_data_points”…

Oh, right, you had to go into Excel and switch
the columns around, because OOIBase didn’t export the data properly.

Okay, that should do it. Wait, the results are different.

AAAAAAARGH There was a bug in that function in Matlab
and now they fixed it and my results are different??!!

COLD FUSION FAIL

Should journals require all computer code to be released as
supplementary material? Should scientists be using non- open programs at all? Should scientists even be programming?

Code release policies LENIENT “Nature does not require authors to
make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.” (Editorial: Devil in the Details, Nature 470, p. 305, Feb. 16 2011) RIGOROUS “To address the growing complexity of data and analyses, Science is extending our data access requirement […] to include computer codes involved in the creation or analysis of data.” (Editorial: Making Data Maximally Available, Science 331, p. 649, Feb. 11 2011) This section is a summary of: Ince, Hatton, Graham-Cumming. The case for open computer programs. Nature 482, p. 485. (2012).

Why release code? Code descriptions are ambiguous. A description in
natural or mathematical language can be turned into any number of different programs. sinc − + 2 2 + 2 2 ∞ −∞ Adaptive Simpson quadrature won’t work on that, because it has a pole near the real axis. It’ll give you a garbage result.

Why release code? Algorithm descriptions are imperfect, even if they
could be entirely unambiguous. One data set 6 significant figures Nine different results with only 1 or 2 digits agreement Nine different commercial implementations of the same algorithm Hatton, Roberts (1994). How accurate is scientific software? IEEE Trans Softw Eng. 20, 785.

Why release code? Algorithm descriptions are imperfect, even if they
could be entirely unambiguous. Image: “Bulford Dolphin in dry dock” by “Jetset” from Wikimedia Commons, CC-BY-SA “These data, however, were used by geologists to site extremely expensive marine drilling rigs”

Why release code? “Given enough eyeballs, all bugs are shallow”
(Eric S. Raymond, The Cathedral and the Bazaar) One-third of all software failures in a large-scale IBM study only occurred for the first time after 5000 execution-years. (Adams, IBM J Res. Develop. 28, p. 2, 1984)

Journals should implement categories of code openness This signals potential
reproducibility issues Full source code Partial Marginal No source code By Caroline Madigan for opensource.com, CC-BY-SA

Research funding bodies should commission R&D on tools that enable
code to be integrated with data, graphs, and article text By “dmpop”, from sxc.hu

We should educate our students on reproducibility “The Bulls Eye”
by “Aaron S.” on Flickr, CC-BY-ND

Ultracal: a case study Source: Ophir-Spiricon website

Ultracal: a case study What is actually going on when
you use Ultracal?

YOU LOSE CONTROL OF YOUR OWN DATA

Source: Z. Merali, Error, Nature 467, p. 775 (2010). Data
from G. Wilson, Software Carpentry.

Scientists are not programmers. Image by “djayo”, from sxc.hu

Scientists are not programmers. Greg Wilson: "There are terrifying statistics
showing that almost all of what scientists know about coding is self-taught. They just don't know how bad they are."

Use a version control system It’s a sort of database
that records every change you make. Put your raw data, processing code, and other primary material into it, to keep a record of what you did, how, and when. “Wooden file cabinet” by “Pptudela” on Wikimedia Commons, CC-BY-SA

Track your process Make sure you can always reproduce your
results from your sources automatically. Write everything in scripts so you don’t do any manipulation by hand. “tracks” by “hbakkh” on Flickr, CC-BY-NC

Write testable software Don’t rely on validation testing. Instead, build
large codes from easily testable chunks, and test how they react to broken input. Get other people to review your code. “Checklist” by “adesigna” on Flickr, CC-BY-NC-SA

Encourage sharing Make the code that you use in your
research freely available when possible.

Encourage sharing

Learn more

Scientists shouldn’t have to be programmers.

Open and reproducible scientific programming

Open and reproducible scientific programming

More Decks by Philip Chimento

Other Decks in Science

Featured

Transcript