Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improve your workflow for reproducible science

Improve your workflow for reproducible science

For data analysis to be reproducible, the data and code should be assembled in a way such that results (e.g. tables and figures) can be re-created. While the scientific community is by and large in agreement that reproducibility is a minimal standard by which data analyses should be evaluated, and a myriad of software tools for reproducible computing exist, it is still not trivial to reproduce someone's (sometimes your own!) results without fiddling with unavailable analysis data, external dependencies, missing packages, out of date software, etc. In this workshop we will demonstrate a workflow for reproducible data science with R, R Markdown, Git, and GitHub. Experience with R is expected but familiarity with the other tools is not required.

81689b093f75cf3f383e581ca57188df?s=128

Mine Cetinkaya-Rundel

October 28, 2021
Tweet

Transcript

  1. Improve your work fl ow for reproducible science Mine Çetinkaya-Rundel

    Duke University + RStudio mine-cetinkaya-rundel cetinkaya.mine@gmail.com @minebocek 🔗 bit.ly/improve-repro-workflow
  2. The results in Table 1 don’t seem to correspond to

    those in Figure 2!
  3. 61 3 44 94 12 4 45 20

  4. 70 have tried and failed to reproduce another scientist's experiments

    more than percent Monya Baker,. "1,500 scientists li ft the lid on reproducibility." Nature News 533.7604 (2016): 452.
  5. 50 have tried and failed to reproduce their own experiments

    more than percent Monya Baker. "1,500 scientists li ft the lid on reproducibility." Nature News 533.7604 (2016): 452.
  6. 1070 Google Scholar Search, 28 October 2021. results containing the

    term reproducibility crisis just in 2021 Google Scholar yields
  7. 1992 Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible

    research a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604. earliest reference reproducibility research* that I could find…
  8. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research

    a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.
  9. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research

    a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.
  10. Michelle Lewis, et al. "Replication Study: Transcriptional amplification in tumor

    cells with elevated c-Myc." Elife 7 (2018): e30274.
  11. Photo by Alexander Dummer on Unsplash]. setting the stage

  12. replicability reproducibility same research question same research question same results

    same results new data same data
  13. term estimate std.error statistic p.value (Intercept) 33.6 1.25 27.0 1.39e-86

    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32 Table 1. Regression output for predicting bill depth from flipper length. Figure 2. Relationship between bill depth and flipper length. e.g.
  14. term estimate std.error statistic p.value (Intercept) 33.6 1.25 27.0 1.39e-86

    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32 Table 1. Regression output for predicting bill depth from flipper length. Figure 2. Relationship between bill depth and flipper length. e.g.
  15. analysis report term estimate std.error statistic p.value (Intercept) 33.6 1.25

    27.0 1.39e-86 flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32 Table 1. Regression output for predicting bill depth from flipper length.
  16. analysis report term estimate std.error statistic p.value (Intercept) 33.6 1.25

    27.0 1.39e-86 flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32 Table 1. Regression output for predicting bill depth from flipper length. Figure 2. Relationship between bill depth and flipper length.
  17. analysis report Figure 2. Relationship between bill depth and flipper

    length. term estimate std.error statistic p.value (Intercept) 33.6 1.25 27.0 1.39e-86 flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32 Table 1. Regression output for predicting bill depth from flipper length.
  18. term estimate std.error statistic p.value (Intercept) 33.6 1.25 27.0 1.39e-86

    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32 Table 1. Regression output for predicting bill depth from flipper length. Figure 2. Relationship between bill depth and flipper length.
  19. making research reproducible

  20. raw data code & documentation to reproduce the analysis specifications

    of your computational environment make available and accessible Peng, Roger. "The reproducibility crisis in science: A statistical counterattack." Significance 12.3 (2015): 30-32. Gentleman, Robert, and Duncan Temple Lang. "Statistical analyses and reproducible research." Journal of Computational and Graphical Statistics 16.1 (2007): 1-23.
  21. – Keith Baggerly “The most important tool is the mindset,

    when starting, that the end product will be reproducible.”
  22. nobody, not even yourself, can recreate any part of your

    analysis push button reproducibility in published work 💃 🎯
  23. “There’s no one-size-fits-all solution for computational reproducibility.” Perkel, Je ff

    rey M. "A toolkit for data transparency takes shape." Nature 560 (2018): 513-515.
  24. but the following might help… 8 principles

  25. organize your project 1

  26. level of organization

  27. simpler analysis raw - data processed - data manuscript |

    - manuscript.Rmd more complex analysis raw - data processed - data scripts manuscript f i gures | - manuscript.Rmd stick with the conventions of your peers
  28. write READMEs liberally 2

  29. raw - data processed - data scripts manuscript f i

    gures | - README.md | - airports.csv | - flights.csv | - planes.csv | - weather.csv # README This folder contains the raw data for the project. All datasets were downloaded from openflights.org/data.html on 2019-04-01. -airlines: Airline names -airports: Airports metadata -flights: F l ight data -planes: Plane metadata -weather: Hourly weather data | - airlines.csv
  30. keep data tidy & machine readable 3

  31. Student Exam Grade Name 1 2 Major Barney Donaldson 89

    76 Data Science, Public Policy Clay Whelan 67 83 Public Policy Simran Bass 82 90 Statistics Chante Munro 45 72 Political Science, Statistics Gabrielle Cherry 32 79 . Kush Piper 98 sick Statistics Faizan Ratliff 82 75 Data Science Torin Ruiz 70 80 Sociology, Statistics Reiss Richardson missed exam 34 Neuroscience Ajwa Cochran 50 65 Data Science Low participation name exam_1 exam_2 f i rst_major second_major participation Barney Donaldson 89 76 Data Science Public Policy ok Clay Whelan 67 83 Public Policy NA ok Simran Bass 82 90 Statistics NA ok Chante Munro 45 72 Political Science Statistics low Gabrielle Cherry 32 79 NA NA ok Kush Piper 98 NA Statistics NA ok Faizan Ratliff 82 75 Data Science NA ok Torin Ruiz 70 80 Sociology Statistics ok Reiss Richardson NA 34 Neuroscience NA low Ajwa Cochran 50 65 Data Science NA low record code + document non-code steps + write tests Broman, Karl W., and Kara H. Woo. "Data organization in spreadsheets." The American Statistician 72.1 (2018): 2-10.
  32. comment your code 4

  33. 🤷

  34. use literate programming 5

  35. None
  36. None
  37. demo rmarkdown

  38. ‣ Learn more about R Markdown: ‣ Documentation: rmarkdown.rstudio.com ‣

    Book: bookdown.org/yihui/rmarkdown ‣ Book: bookdown.org/yihui/rmarkdown-cookbook ‣ Learn more about the visual editor: ‣ Documentation: rstudio.github.io/visual-markdown-editing more resources…
  39. use version control 6

  40. changes tracked by hosted on

  41. 2 Git work fl ows GitHub fi rst Local fi

    rst
  42. GitHub fi rst Today I start a new project! So

    I’ll do the right thing and create a repo first. ‣ Step 1: Create a new repo on GitHub ‣ Step 2: Copy the repo URL ‣ Step 3: Clone it using RStudio ‣ Step 4: Make changes locally ‣ Step 6: Commit and push to GitHub ‣ Step 7: Confirm your changes have propagated to GitHub
  43. Local fi rst I have been working on a project

    for a while, and now I’m realising I should have been tracking it with git. ‣ Step 1: Create an RStudio Project from existing directory (if an .Rproj file doesn’t already exist) ‣ Step 2: usethis::use_git() and follow instructions ‣ Step 3: usethis::use_github() and follow instructions
  44. demo git & github

  45. ‣ Learn more about using Git and GitHub with R:

    ‣ Book: happygitwithr.com ‣ Learn more about Git setup: ‣ Documentation: usethis.r-lib.org/articles/articles/usethis-setup.html more resources…
  46. automate your process 7

  47. raw - data processed - data scripts manuscript f i

    gures | - 01-load - packages.R | - 03-clean - data.R | - 04-explore.R | - 05-model.R | - 06-summarise.R | - 02-load - data.R | - 00-analyse.R
  48. Karl Broman. “Minimal Make”, kbroman.org/minimal_make. 📖 recommended reading

  49. Will Landau. “The targets R Package User Manual”, books.ropensci.org/targets. 📖

    recommended reading
  50. share computing environment 8

  51. None
  52. 1 organize your project 2 write READMEs liberally 3 keep

    data tidy & machine readable 4 comment your code 5 use literate programming 6 use version control 7 automate your process 8 share computing environment
  53. Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt,

    Tracy K. Teal “Good enough practices in scientific computing." PLoS computational biology 13.6 (2017): e1005510.
  54. Improve your work fl ow for reproducible science mine-cetinkaya-rundel cetinkaya.mine@gmail.com

    @minebocek 🔗 bit.ly/improve-repro-workflow