Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improve your workflow for reproducible science

Improve your workflow for reproducible science

For data analysis to be reproducible, the data and code should be assembled in a way such that results (e.g. tables and figures) can be re-created. While the scientific community is by and large in agreement that reproducibility is a minimal standard by which data analyses should be evaluated, and a myriad of software tools for reproducible computing exist, it is still not trivial to reproduce someone's (sometimes your own!) results without fiddling with unavailable analysis data, external dependencies, missing packages, out of date software, etc. In this workshop we will demonstrate a workflow for reproducible data science with R, R Markdown, Git, and GitHub. Experience with R is expected but familiarity with the other tools is not required.

Mine Cetinkaya-Rundel

October 28, 2021
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Science

Transcript

  1. Improve your work
    fl
    ow for
    reproducible science
    Mine Çetinkaya-Rundel


    Duke University + RStudio
    mine-cetinkaya-rundel
    [email protected]
    @minebocek
    🔗 bit.ly/improve-repro-workflow

    View Slide

  2. The results in Table 1


    don’t seem to correspond to
    those in Figure 2!

    View Slide

  3. 61
    3
    44
    94 12
    4
    45
    20

    View Slide

  4. 70
    have tried and failed to reproduce


    another scientist's experiments
    more than percent
    Monya Baker,. "1,500 scientists li
    ft
    the lid on reproducibility." Nature News 533.7604 (2016): 452.

    View Slide

  5. 50
    have tried and failed to reproduce


    their own experiments
    more than percent
    Monya Baker. "1,500 scientists li
    ft
    the lid on reproducibility." Nature News 533.7604 (2016): 452.

    View Slide

  6. 1070
    Google Scholar Search, 28 October 2021.
    results containing the term reproducibility crisis


    just in 2021
    Google Scholar yields

    View Slide

  7. 1992
    Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning."


    SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.
    earliest reference reproducibility research*
    that I could find…

    View Slide

  8. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." SEG Technical Program Expanded Abstracts 1992.


    Society of Exploration Geophysicists, 1992. 601-604.

    View Slide

  9. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." SEG Technical Program Expanded Abstracts 1992.


    Society of Exploration Geophysicists, 1992. 601-604.

    View Slide

  10. Michelle Lewis, et al. "Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc." Elife 7 (2018): e30274.

    View Slide

  11. Photo by Alexander Dummer on Unsplash].
    setting


    the


    stage

    View Slide

  12. replicability reproducibility
    same research question same research question
    same results same results
    new data same data

    View Slide

  13. term estimate std.error statistic p.value


    (Intercept) 33.6 1.25 27.0 1.39e-86


    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32
    Table 1. Regression output for predicting bill depth from flipper length.
    Figure 2. Relationship between bill depth and flipper length.
    e.g.

    View Slide

  14. term estimate std.error statistic p.value


    (Intercept) 33.6 1.25 27.0 1.39e-86


    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32
    Table 1. Regression output for predicting bill depth from flipper length.
    Figure 2. Relationship between bill depth and flipper length.
    e.g.

    View Slide

  15. analysis report
    term estimate std.error statistic p.value


    (Intercept) 33.6 1.25 27.0 1.39e-86


    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32
    Table 1. Regression output for predicting bill depth from flipper length.

    View Slide

  16. analysis report
    term estimate std.error statistic p.value


    (Intercept) 33.6 1.25 27.0 1.39e-86


    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32
    Table 1. Regression output for predicting bill depth from flipper length.
    Figure 2. Relationship between bill depth and flipper length.

    View Slide

  17. analysis report
    Figure 2. Relationship between bill depth and flipper length.
    term estimate std.error statistic p.value


    (Intercept) 33.6 1.25 27.0 1.39e-86


    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32
    Table 1. Regression output for predicting bill depth from flipper length.

    View Slide

  18. term estimate std.error statistic p.value


    (Intercept) 33.6 1.25 27.0 1.39e-86


    flipper_length_mm -0.0820 0.00618 -13.3 1.23e-32
    Table 1. Regression output for predicting bill depth from flipper length.
    Figure 2. Relationship between bill depth and flipper length.

    View Slide

  19. making


    research


    reproducible

    View Slide

  20. raw data
    code & documentation to reproduce the analysis
    specifications of your computational environment
    make
    available and accessible
    Peng, Roger. "The reproducibility crisis in science: A statistical counterattack." Significance 12.3 (2015): 30-32.


    Gentleman, Robert, and Duncan Temple Lang. "Statistical analyses and reproducible research." Journal of Computational and Graphical Statistics 16.1 (2007): 1-23.

    View Slide

  21. – Keith Baggerly
    “The most important tool is


    the mindset, when starting,


    that the end product will be
    reproducible.”

    View Slide

  22. nobody,


    not even yourself,


    can recreate any part


    of your analysis
    push button


    reproducibility


    in published work
    💃 🎯

    View Slide

  23. “There’s no one-size-fits-all solution
    for computational reproducibility.”
    Perkel, Je
    ff
    rey M. "A toolkit for data transparency takes shape." Nature 560 (2018): 513-515.

    View Slide

  24. but the following might help…
    8 principles

    View Slide

  25. organize


    your


    project
    1

    View Slide

  26. level of organization

    View Slide

  27. simpler analysis
    raw
    -
    data
    processed
    -
    data
    manuscript
    | -
    manuscript.Rmd
    more complex analysis
    raw
    -
    data
    processed
    -
    data
    scripts
    manuscript
    f
    i
    gures
    | -
    manuscript.Rmd
    stick with the
    conventions of
    your peers

    View Slide

  28. write


    READMEs


    liberally
    2

    View Slide

  29. raw
    -
    data
    processed
    -
    data
    scripts
    manuscript
    f
    i
    gures
    | -
    README.md
    | -
    airports.csv
    | -
    flights.csv
    | -
    planes.csv
    | -
    weather.csv
    # README


    This folder contains the raw data
    for the project.


    All datasets were downloaded from
    openflights.org/data.html


    on 2019-04-01.


    -airlines: Airline names


    -airports: Airports metadata


    -flights: F
    l
    ight data


    -planes: Plane metadata


    -weather: Hourly weather data
    | -
    airlines.csv

    View Slide

  30. keep data


    tidy &


    machine readable
    3

    View Slide

  31. Student Exam Grade
    Name 1 2 Major
    Barney
    Donaldson
    89 76 Data Science,


    Public Policy
    Clay Whelan 67 83 Public Policy
    Simran Bass 82 90 Statistics
    Chante Munro 45 72 Political Science,
    Statistics
    Gabrielle
    Cherry
    32 79 .
    Kush Piper 98 sick Statistics
    Faizan
    Ratliff
    82 75 Data Science
    Torin Ruiz 70 80 Sociology,


    Statistics
    Reiss
    Richardson
    missed exam 34 Neuroscience
    Ajwa Cochran 50 65 Data Science
    Low participation
    name exam_1 exam_2 f
    i
    rst_major second_major participation
    Barney
    Donaldson
    89 76 Data Science Public Policy ok
    Clay Whelan 67 83 Public Policy NA ok
    Simran Bass 82 90 Statistics NA ok
    Chante Munro 45 72
    Political
    Science
    Statistics low
    Gabrielle
    Cherry
    32 79 NA NA ok
    Kush Piper 98 NA Statistics NA ok
    Faizan
    Ratliff
    82 75 Data Science NA ok
    Torin Ruiz 70 80 Sociology Statistics ok
    Reiss
    Richardson
    NA 34 Neuroscience NA low
    Ajwa Cochran 50 65 Data Science NA low
    record


    code +


    document


    non-code


    steps +


    write


    tests
    Broman, Karl W., and Kara H. Woo. "Data organization in spreadsheets." The American Statistician 72.1 (2018): 2-10.

    View Slide

  32. comment


    your


    code
    4

    View Slide

  33. 🤷

    View Slide

  34. use


    literate


    programming
    5

    View Slide

  35. View Slide

  36. View Slide

  37. demo
    rmarkdown

    View Slide

  38. ‣ Learn more about R Markdown:

    ‣ Documentation: rmarkdown.rstudio.com

    ‣ Book: bookdown.org/yihui/rmarkdown

    ‣ Book: bookdown.org/yihui/rmarkdown-cookbook

    ‣ Learn more about the visual editor:

    ‣ Documentation: rstudio.github.io/visual-markdown-editing
    more resources…

    View Slide

  39. use


    version


    control
    6

    View Slide

  40. changes


    tracked by
    hosted


    on

    View Slide

  41. 2
    Git work
    fl
    ows
    GitHub
    fi
    rst
    Local
    fi
    rst

    View Slide

  42. GitHub
    fi
    rst
    Today I start a new
    project!


    So I’ll do the right thing and
    create a repo first.
    ‣ Step 1: Create a new repo on GitHub


    ‣ Step 2: Copy the repo URL


    ‣ Step 3: Clone it using RStudio


    ‣ Step 4: Make changes locally


    ‣ Step 6: Commit and push to GitHub


    ‣ Step 7: Confirm your changes have propagated to GitHub

    View Slide

  43. Local
    fi
    rst
    I have been working on
    a project for a while, and now
    I’m realising I should have
    been tracking it with git.
    ‣ Step 1: Create an RStudio Project from existing directory (if
    an .Rproj file doesn’t already exist)


    ‣ Step 2: usethis::use_git() and follow instructions


    ‣ Step 3: usethis::use_github() and follow instructions

    View Slide

  44. demo
    git & github

    View Slide

  45. ‣ Learn more about using Git and GitHub with R:

    ‣ Book: happygitwithr.com

    ‣ Learn more about Git setup:

    ‣ Documentation: usethis.r-lib.org/articles/articles/usethis-setup.html
    more resources…

    View Slide

  46. automate


    your


    process
    7

    View Slide

  47. raw
    -
    data
    processed
    -
    data
    scripts
    manuscript
    f
    i
    gures
    | -
    01-load
    -
    packages.R
    | -
    03-clean
    -
    data.R
    | -
    04-explore.R
    | -
    05-model.R
    | -
    06-summarise.R
    | -
    02-load
    -
    data.R
    | -
    00-analyse.R

    View Slide

  48. Karl Broman. “Minimal Make”, kbroman.org/minimal_make.
    📖


    recommended


    reading

    View Slide

  49. Will Landau. “The targets R Package User
    Manual”, books.ropensci.org/targets.
    📖


    recommended


    reading

    View Slide

  50. share


    computing


    environment
    8

    View Slide

  51. View Slide

  52. 1 organize your project


    2 write READMEs liberally


    3 keep data tidy & machine readable


    4 comment your code


    5 use literate programming


    6 use version control


    7 automate your process


    8 share computing environment

    View Slide

  53. Greg Wilson, Jennifer Bryan, Karen Cranston,


    Justin Kitzes, Lex Nederbragt, Tracy K. Teal


    “Good enough practices in scientific computing."


    PLoS computational biology 13.6 (2017): e1005510.

    View Slide

  54. Improve your work
    fl
    ow


    for reproducible science
    mine-cetinkaya-rundel
    [email protected]
    @minebocek
    🔗 bit.ly/improve-repro-workflow

    View Slide