$30 off During Our Annual Pro Sale. View Details »

Workflows for reproducible data science

Workflows for reproducible data science

Mine Cetinkaya-Rundel

September 27, 2021
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. Work
    fl
    ows for


    reproducible data science
    Mine Çetinkaya-Rundel


    Duke University + RStudio
    mine-cetinkaya-rundel
    [email protected]
    @minebocek
    🔗 bit.ly/repro-ds-21

    View Slide

  2. The results in Table 1


    don’t seem to correspond to
    those in Figure 2!

    View Slide

  3. 61
    3
    44
    94 12
    4
    45
    20

    View Slide

  4. 70
    have tried and failed to reproduce


    another scientist's experiments
    more than percent
    Monya Baker,. "1,500 scientists li
    ft
    the lid on reproducibility." Nature News 533.7604 (2016): 452.

    View Slide

  5. 50
    have tried and failed to reproduce


    their own experiments
    more than percent
    Monya Baker. "1,500 scientists li
    ft
    the lid on reproducibility." Nature News 533.7604 (2016): 452.

    View Slide

  6. 965
    Google Scholar Search, September 2021.
    results containing the term reproducibility crisis


    just in 2021
    Google Scholar yields

    View Slide

  7. 1992
    Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning."


    SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 1992. 601-604.
    earliest reference reproducibility research*
    that I could find…

    View Slide

  8. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." SEG Technical Program Expanded Abstracts 1992.


    Society of Exploration Geophysicists, 1992. 601-604.

    View Slide

  9. Jon Claerbout and Martin Karrenbach. "Electronic documents give reproducible research a new meaning." SEG Technical Program Expanded Abstracts 1992.


    Society of Exploration Geophysicists, 1992. 601-604.

    View Slide

  10. Michelle Lewis, et al. "Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc." Elife 7 (2018): e30274.

    View Slide

  11. Photo by Alexander Dummer on Unsplash
    setting


    the


    stage

    View Slide

  12. replicability reproducibility
    same research question same research question
    same results same results
    new data same data

    View Slide

  13. – Mark Holder
    “Your closest collaborator is
    you six months ago, but you
    don’t reply to emails.”

    View Slide

  14. Term Estimate Std. Error Statistic p
    -
    value


    (Intercept) 9.06 0.929 9.76 1.13e-17


    Sepal.Width -1.74 0.301 -5.77 4.51e
    -
    8
    Table 1. Regression output for predicting petal length from sepal width.
    Figure 2. Relationship between petal length and sepal width
    e.g.

    View Slide

  15. Term Estimate Std. Error Statistic p
    -
    value


    (Intercept) 9.06 0.929 9.76 1.13e-17


    Sepal.Width -1.74 0.301 -5.77 4.51e
    -
    8
    Table 1. Regression output for predicting petal length from sepal width.
    Figure 2. Relationship between petal length and sepal width
    e.g.

    View Slide

  16. analysis report
    Term Estimate Std. Error Statistic p
    -
    value


    (Intercept) 9.06 0.929 9.76 1.13e-17


    Sepal.Width -1.74 0.301 -5.77 4.51e
    -
    8
    Table 1. Regression output for predicting petal length from sepal width.

    View Slide

  17. analysis report
    Term Estimate Std. Error Statistic p
    -
    value


    (Intercept) 9.06 0.929 9.76 1.13e-17


    Sepal.Width -1.74 0.301 -5.77 4.51e
    -
    8
    Table 1. Regression output for predicting petal length from sepal width.
    Figure 2. Relationship between petal length and sepal width

    View Slide

  18. analysis report
    Term Estimate Std. Error Statistic p
    -
    value


    (Intercept) 9.06 0.929 9.76 1.13e-17


    Sepal.Width -1.74 0.301 -5.77 4.51e
    -
    8
    Table 1. Regression output for predicting petal length from sepal width.
    Figure 2. Relationship between petal length and sepal width

    View Slide

  19. Term Estimate Std. Error Statistic p
    -
    value


    (Intercept) 9.06 0.929 9.76 1.13e-17


    Sepal.Width -1.74 0.301 -5.77 4.51e
    -
    8
    Table 1. Regression output for predicting petal length from sepal width.
    Figure 2. Relationship between petal length and sepal width

    View Slide

  20. making


    research


    reproducible

    View Slide

  21. – David Donoho, paraphrasing Jon Claerbout
    “An article about computational science in a scientific
    publication is not the scholarship itself, it is merely
    advertising of the scholarship. The actual scholarship is the
    complete so
    ft
    ware development environment and the
    complete set of instructions which generated the figures.”
    Jonathan Buckheit and David Donoho. "Wavelab and reproducible research." Wavelets and statistics. Springer, New York, NY, 1995. 55-81.

    View Slide

  22. raw data
    code & documentation to reproduce the analysis
    specifications of your computational environment
    make
    available and accessible
    Peng, Roger. "The reproducibility crisis in science: A statistical counterattack." Significance 12.3 (2015): 30-32.


    Gentleman, Robert, and Duncan Temple Lang. "Statistical analyses and reproducible research." Journal of Computational and Graphical Statistics 16.1 (2007): 1-23.

    View Slide

  23. – Keith Baggerly
    “The most important tool is


    the mindset, when starting,


    that the end product will be
    reproducible.”

    View Slide

  24. nobody,


    not even yourself,


    can recreate any part


    of your analysis
    push button


    reproducibility


    in published work
    💃 🎯

    View Slide

  25. “There’s no one-size-fits-all solution
    for computational reproducibility.”
    Perkel, Je
    ff
    rey M. "A toolkit for data transparency takes shape." Nature 560 (2018): 513-515.

    View Slide

  26. but the following might help…
    8 principles

    View Slide

  27. organize


    your


    project
    1

    View Slide

  28. – Jenny Bryan
    “File organization and naming
    are powerful weapons against
    chaos.”

    View Slide

  29. level of organization

    View Slide

  30. simpler analysis
    raw
    -
    data
    processed
    -
    data
    manuscript
    | -
    manuscript.Rmd
    more complex analysis
    raw
    -
    data
    processed
    -
    data
    scripts
    manuscript
    f
    i
    gures
    | -
    manuscript.Rmd
    stick with the
    conventions of
    your peers

    View Slide

  31. Lucy D’Agostino McGowan. “One year to dissertate”,


    livefreeordichotomize.com/2018/09/14/one-year-to-dissertate.
    📖


    recommended


    reading

    View Slide

  32. write


    READMEs


    liberally
    2

    View Slide

  33. raw
    -
    data
    processed
    -
    data
    scripts
    manuscript
    f
    i
    gures
    | -
    README.md
    | -
    airports.csv
    | - f l
    ights.csv
    | -
    planes.csv
    | -
    weather.csv
    # README


    This folder contains the raw data
    for the project.


    All datasets were downloaded from
    open
    f l
    ights.org/data.html


    on 2019-04-01.


    -airlines: Airline names


    -airports: Airports metadata


    -
    f l
    ights:
    F l
    ight data


    -planes: Plane metadata


    -weather: Hourly weather data
    | -
    airlines.csv

    View Slide

  34. keep data


    tidy &


    machine readable
    3

    View Slide

  35. Student Exam Grade
    Name 1 2 Major
    Barney
    Donaldson
    89 76 Data Science,


    Public Policy
    Clay Whelan 67 83 Public Policy
    Simran Bass 82 90 Statistics
    Chante Munro 45 72 Political Science,
    Statistics
    Gabrielle
    Cherry
    32 79 .
    Kush Piper 98 sick Statistics
    Faizan
    Ratliff
    82 75 Data Science
    Torin Ruiz 70 80 Sociology,


    Statistics
    Reiss
    Richardson
    missed exam 34 Neuroscience
    Ajwa Cochran 50 65 Data Science
    Low participation
    name exam_1 exam_2 f
    i
    rst_major second_major participation
    Barney
    Donaldson
    89 76 Data Science Public Policy ok
    Clay Whelan 67 83 Public Policy NA ok
    Simran Bass 82 90 Statistics NA ok
    Chante Munro 45 72
    Political
    Science
    Statistics low
    Gabrielle
    Cherry
    32 79 NA NA ok
    Kush Piper 98 NA Statistics NA ok
    Faizan
    Ratliff
    82 75 Data Science NA ok
    Torin Ruiz 70 80 Sociology Statistics ok
    Reiss
    Richardson
    NA 34 Neuroscience NA low
    Ajwa Cochran 50 65 Data Science NA low
    record


    code +


    document


    non-code


    steps +


    write


    tests

    View Slide

  36. Mark Ziemann, Yotam Eren, and Assam El-Osta. "Gene name
    errors are widespread in the scientific literature." Genome
    biology 17.1 (2016): 177. doi.org/10.1186/s13059-016-1044-7.
    📖


    recommended


    reading

    View Slide

  37. Karl Broman and Kara Woo. "Data organization in
    spreadsheets." The American Statistician 72.1 (2018): 2-10.


    doi.org/10.1080/00031305.2017.1375989.
    📖


    recommended


    reading

    View Slide

  38. comment


    your


    code
    4

    View Slide

  39. 🤷

    View Slide

  40. use


    literate


    programming
    5

    View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. Yihui Xie, JJ Allaire, and Garrett Grolemund. “R Markdown: The
    Definitive Guide”, bookdown.org/yihui/rmarkdown.
    📖


    recommended


    reading

    View Slide

  45. Yihui Xie, Christophe Dervieux, and Emily Riederer. “R Markdown
    Cookbook”, bookdown.org/yihui/rmarkdown-cookbook.
    📖


    recommended


    reading

    View Slide

  46. use


    version


    control
    6

    View Slide

  47. changes


    tracked by
    hosted


    on

    View Slide

  48. Jenny Bryan, et. al. “Happy Git with R”, happygitwithr.com.
    📖


    recommended


    reading

    View Slide

  49. automate


    your


    process
    7

    View Slide

  50. raw
    -
    data
    processed
    -
    data
    scripts
    manuscript
    f
    i
    gures
    | -
    01-load
    -
    packages.R
    | -
    03-clean
    -
    data.R
    | -
    04-explore.R
    | -
    05-model.R
    | -
    06-summarise.R
    | -
    02-load
    -
    data.R
    | -
    00-analyse.R

    View Slide

  51. Karl Broman. “Minimal Make”, kbroman.org/minimal_make.
    📖


    recommended


    reading

    View Slide

  52. Will Landau. “The targets R Package User
    Manual”, books.ropensci.org/targets.
    📖


    recommended


    reading

    View Slide

  53. share


    computing


    environment
    8

    View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. Carl Boettiger. "An introduction to Docker for reproducible research." ACM SIGOPS Operating Systems Review 49.1 (2015): 71-79.


    Ben Marwick, Carl Boettiger, and Lincoln Mullen. "Packaging data analytical work reproducibly using R (and friends)." The American Statistician 72.1 (2018): 80-88.
    📖


    recommended


    reading

    View Slide

  60. 1 organize your project


    2 write READMEs liberally


    3 keep data tidy & machine readable


    4 comment your code


    5 use literate programming


    6 use version control


    7 automate your process


    8 share computing environment

    View Slide

  61. Greg Wilson, Jennifer Bryan, Karen Cranston,


    Justin Kitzes, Lex Nederbragt, Tracy K. Teal


    “Good enough practices in scientific computing."


    PLoS computational biology 13.6 (2017): e1005510.
    📖


    recommended


    reading

    View Slide

  62. painful


    bits

    View Slide

  63. > coming up with good names


    > stages of data cleaning


    > going back and redoing stu
    ff


    > adding interim steps


    > keeping track of the order of things


    > clutter of unneeded old stu
    f
    Karl Broman, tools4RR. kbroman.org/Tools4RR

    View Slide

  64. looking


    into the


    future
    Photo by Sweet Ice Cream Photography on Unsplash.

    View Slide

  65. – Keith Baggerly
    “The most important tool is


    the mindset, when starting,


    that the end product will be
    reproducible.”

    View Slide

  66. – Karl Broman
    “The second most important


    tool is training.”

    View Slide

  67. #1:


    Convince data
    scientists to adopt a
    reproducible data
    analysis work
    fl
    ow
    #2:


    Train new data
    scientists who don’t
    have any other
    work
    fl
    ow

    View Slide

  68. statistics and data
    science educators
    who teach data
    analysis should be
    instilling best practices
    in students before they
    set out to do research

    View Slide

  69. → →


    View Slide

  70. → →


    → →

    View Slide

  71. Work
    fl
    ows for


    reproducible data science
    mine-cetinkaya-rundel
    [email protected]
    @minebocek
    🔗 bit.ly/repro-ds-21

    View Slide