Save 37% off PRO during our Black Friday Sale! »

Workflows for reproducible data science

Workflows for reproducible data science

81689b093f75cf3f383e581ca57188df?s=128

Mine Cetinkaya-Rundel

September 27, 2021
Tweet

Transcript

  1. Work fl ows for reproducible data science Mine Çetinkaya-Rundel Duke

    University + RStudio mine-cetinkaya-rundel cetinkaya.mine@gmail.com @minebocek 🔗 bit.ly/repro-ds-21
  2. The results in Table 1 don’t seem to correspond to

    those in Figure 2!
  3. 61 3 44 94 12 4 45 20

  4. 70 have tried and failed to reproduce another scientist's experiments

    more than percent Monya Baker,. "1,500 scientists li ft the lid on reproducibility." Nature News 533.7604 (2016): 452.
  5. 50 have tried and failed to reproduce their own experiments

    more than percent Monya Baker. "1,500 scientists li ft the lid on reproducibility." Nature News 533.7604 (2016): 452.
  6. 965 Google Scholar Search, September 2021. results containing the term

    reproducibility crisis just in 2021 Google Scholar yields
  7. 1992 Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible

    research a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604. earliest reference reproducibility research* that I could find…
  8. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research

    a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.
  9. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research

    a new meaning." SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.
  10. Michelle Lewis, et al. "Replication Study: Transcriptional amplification in tumor

    cells with elevated c-Myc." Elife 7 (2018): e30274.
  11. Photo by Alexander Dummer on Unsplash setting the stage

  12. replicability reproducibility same research question same research question same results

    same results new data same data
  13. – Mark Holder “Your closest collaborator is you six months

    ago, but you don’t reply to emails.”
  14. Term Estimate Std. Error Statistic p - value (Intercept) 9.06

    0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width. Figure 2. Relationship between petal length and sepal width e.g.
  15. Term Estimate Std. Error Statistic p - value (Intercept) 9.06

    0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width. Figure 2. Relationship between petal length and sepal width e.g.
  16. analysis report Term Estimate Std. Error Statistic p - value

    (Intercept) 9.06 0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width.
  17. analysis report Term Estimate Std. Error Statistic p - value

    (Intercept) 9.06 0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width. Figure 2. Relationship between petal length and sepal width
  18. analysis report Term Estimate Std. Error Statistic p - value

    (Intercept) 9.06 0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width. Figure 2. Relationship between petal length and sepal width
  19. Term Estimate Std. Error Statistic p - value (Intercept) 9.06

    0.929 9.76 1.13e-17 Sepal.Width -1.74 0.301 -5.77 4.51e - 8 Table 1. Regression output for predicting petal length from sepal width. Figure 2. Relationship between petal length and sepal width
  20. making research reproducible

  21. – David Donoho, paraphrasing Jon Claerbout “An article about computational

    science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete so ft ware development environment and the complete set of instructions which generated the figures.” Jonathan Buckheit and David Donoho. "Wavelab and reproducible research." Wavelets and statistics. Springer, New York, NY, 1995. 55-81.
  22. raw data code & documentation to reproduce the analysis specifications

    of your computational environment make available and accessible Peng, Roger. "The reproducibility crisis in science: A statistical counterattack." Significance 12.3 (2015): 30-32. Gentleman, Robert, and Duncan Temple Lang. "Statistical analyses and reproducible research." Journal of Computational and Graphical Statistics 16.1 (2007): 1-23.
  23. – Keith Baggerly “The most important tool is the mindset,

    when starting, that the end product will be reproducible.”
  24. nobody, not even yourself, can recreate any part of your

    analysis push button reproducibility in published work 💃 🎯
  25. “There’s no one-size-fits-all solution for computational reproducibility.” Perkel, Je ff

    rey M. "A toolkit for data transparency takes shape." Nature 560 (2018): 513-515.
  26. but the following might help… 8 principles

  27. organize your project 1

  28. – Jenny Bryan “File organization and naming are powerful weapons

    against chaos.”
  29. level of organization

  30. simpler analysis raw - data processed - data manuscript |

    - manuscript.Rmd more complex analysis raw - data processed - data scripts manuscript f i gures | - manuscript.Rmd stick with the conventions of your peers
  31. Lucy D’Agostino McGowan. “One year to dissertate”, livefreeordichotomize.com/2018/09/14/one-year-to-dissertate. 📖 recommended

    reading
  32. write READMEs liberally 2

  33. raw - data processed - data scripts manuscript f i

    gures | - README.md | - airports.csv | - f l ights.csv | - planes.csv | - weather.csv # README This folder contains the raw data for the project. All datasets were downloaded from open f l ights.org/data.html on 2019-04-01. -airlines: Airline names -airports: Airports metadata - f l ights: F l ight data -planes: Plane metadata -weather: Hourly weather data | - airlines.csv
  34. keep data tidy & machine readable 3

  35. Student Exam Grade Name 1 2 Major Barney Donaldson 89

    76 Data Science, Public Policy Clay Whelan 67 83 Public Policy Simran Bass 82 90 Statistics Chante Munro 45 72 Political Science, Statistics Gabrielle Cherry 32 79 . Kush Piper 98 sick Statistics Faizan Ratliff 82 75 Data Science Torin Ruiz 70 80 Sociology, Statistics Reiss Richardson missed exam 34 Neuroscience Ajwa Cochran 50 65 Data Science Low participation name exam_1 exam_2 f i rst_major second_major participation Barney Donaldson 89 76 Data Science Public Policy ok Clay Whelan 67 83 Public Policy NA ok Simran Bass 82 90 Statistics NA ok Chante Munro 45 72 Political Science Statistics low Gabrielle Cherry 32 79 NA NA ok Kush Piper 98 NA Statistics NA ok Faizan Ratliff 82 75 Data Science NA ok Torin Ruiz 70 80 Sociology Statistics ok Reiss Richardson NA 34 Neuroscience NA low Ajwa Cochran 50 65 Data Science NA low record code + document non-code steps + write tests
  36. Mark Ziemann, Yotam Eren, and Assam El-Osta. "Gene name errors

    are widespread in the scientific literature." Genome biology 17.1 (2016): 177. doi.org/10.1186/s13059-016-1044-7. 📖 recommended reading
  37. Karl Broman and Kara Woo. "Data organization in spreadsheets." The

    American Statistician 72.1 (2018): 2-10. doi.org/10.1080/00031305.2017.1375989. 📖 recommended reading
  38. comment your code 4

  39. 🤷

  40. use literate programming 5

  41. None
  42. None
  43. None
  44. Yihui Xie, JJ Allaire, and Garrett Grolemund. “R Markdown: The

    Definitive Guide”, bookdown.org/yihui/rmarkdown. 📖 recommended reading
  45. Yihui Xie, Christophe Dervieux, and Emily Riederer. “R Markdown Cookbook”,

    bookdown.org/yihui/rmarkdown-cookbook. 📖 recommended reading
  46. use version control 6

  47. changes tracked by hosted on

  48. Jenny Bryan, et. al. “Happy Git with R”, happygitwithr.com. 📖

    recommended reading
  49. automate your process 7

  50. raw - data processed - data scripts manuscript f i

    gures | - 01-load - packages.R | - 03-clean - data.R | - 04-explore.R | - 05-model.R | - 06-summarise.R | - 02-load - data.R | - 00-analyse.R
  51. Karl Broman. “Minimal Make”, kbroman.org/minimal_make. 📖 recommended reading

  52. Will Landau. “The targets R Package User Manual”, books.ropensci.org/targets. 📖

    recommended reading
  53. share computing environment 8

  54. None
  55. None
  56. None
  57. None
  58. None
  59. Carl Boettiger. "An introduction to Docker for reproducible research." ACM

    SIGOPS Operating Systems Review 49.1 (2015): 71-79. Ben Marwick, Carl Boettiger, and Lincoln Mullen. "Packaging data analytical work reproducibly using R (and friends)." The American Statistician 72.1 (2018): 80-88. 📖 recommended reading
  60. 1 organize your project 2 write READMEs liberally 3 keep

    data tidy & machine readable 4 comment your code 5 use literate programming 6 use version control 7 automate your process 8 share computing environment
  61. Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt,

    Tracy K. Teal “Good enough practices in scientific computing." PLoS computational biology 13.6 (2017): e1005510. 📖 recommended reading
  62. painful bits

  63. > coming up with good names > stages of data

    cleaning > going back and redoing stu ff > adding interim steps > keeping track of the order of things > clutter of unneeded old stu f Karl Broman, tools4RR. kbroman.org/Tools4RR
  64. looking into the future Photo by Sweet Ice Cream Photography

    on Unsplash.
  65. – Keith Baggerly “The most important tool is the mindset,

    when starting, that the end product will be reproducible.”
  66. – Karl Broman “The second most important tool is training.”

  67. #1: Convince data scientists to adopt a reproducible data analysis

    work fl ow #2: Train new data scientists who don’t have any other work fl ow
  68. statistics and data science educators who teach data analysis should

    be instilling best practices in students before they set out to do research
  69. → → → →

  70. → → → → → →

  71. Work fl ows for reproducible data science mine-cetinkaya-rundel cetinkaya.mine@gmail.com @minebocek

    🔗 bit.ly/repro-ds-21