Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FISH 6002: Week 3 - Markdown and workflow

FISH 6002: Week 3 - Markdown and workflow

Updated 23 Sept 2019

MI Fisheries Science

November 24, 2017
Tweet

More Decks by MI Fisheries Science

Other Decks in Science

Transcript

  1. Week 3: Markdown and Workflow FISH 6000: Science Communication for

    Fisheries Brett Favaro 2017 This work is licensed under a Creative Commons Attribution 4.0 International License
  2. Workflow Collect data on temporary media Archive and manage data

    Write down everything that didn’t make it into the paper Share code and data Transfer to permanent storage Recall: Goal: Each step should be frictionless
  3. FISH6002_Week3 | |- FISH6002_Week3.Rproj # The project file | |-

    Description.txt # You're reading it. | |- data/ | +- DiceData.csv # Dataset from above paper | |- analysis/ | +- 001_FISH6002Week2Code_Part1.R # Code from Week 2 lecture, first half | +- 002_FISH6002Week2Code_Part2.R # Code from Week 2 lecture, second half | +- Minimal_Reproducible_Example.R # A Reprex example | |- Resources/ | +- Fitzjohn (2013) - Directory structures in R.pdf # A short blog post explaining project folder layout | +- Xie et al (2018) - The Definitive Guide to Markdown.pdf # A full-length book about Markdown Arranging a project folder
  4. FISH6002_Week3 | |- 6002_Week3.Rproj # The project file The project

    file contains everything needed to run the project
  5. Where and how should I store meta-data? • There are

    many options. Pick one most appropriate for your project • Principles: • Be clear and consistent, at least within a project • Store in plain text • Have a plan before beginning, which all team members agree on • Three acceptable options: 1. Store in description.txt 2. Store directly in R code 3. Create an additional text file in data folder and store information there
  6. FISH6002_Week3 | |- data/ # raw data, not changed once

    created | +- DiceData.csv # List each file | +- DiceData_Metadata.txt # A text file showing another way to store metadata 3. Create an additional text file in data folder and store information there
  7. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0188966 Fancy: Agencies are increasingly called upon to implement their

    natural resource management programs within an adaptive management (AM) framework. This article provides the background and motivation for the R package, AMModels. ... The overall goal of AMModels is simple: To codify knowledge in the form of models and to store it, along with models generated from numerous analyses and datasets that may come our way, so that it can be used or recalled in the future. AMModels facilitates this process by storing all models and datasets in a single object that can be saved to an .RData file and routinely augmented to track changes in knowledge through time. … AMModels is the only package dedicated not to the mechanics of analysis but to organizing analysis inputs, analysis outputs, and preserving descriptive metadata. We anticipate that this package will assist users hoping to preserve the key elements of an analysis so they may be more confidently revisited at a later date.
  8. FISH6002_Week3 | |- analysis/ # Code pertaining to analysis |

    +- 001_FISH6002Week2Code_Part1.R # Code from Week 2 lecture, first half | +- 002_FISH6002Week2Code_Part2.R # Code from Week 2 lecture, second half | +- Minimal_Reproducible_Example.R # A Reprex example Enumerate so you remember what order in which to run scripts e.g. 001_DataSetup.R 002_DataExploration.R 003_PlotsForPublication.R 004_StatisticalAnalysis.R
  9. Why are we doing all this? • Data are Read

    only • Output is disposable • Relative folder structure improves reproducibility • Layout is sensible
  10. From this week onward, I will include one R Project

    .zip file per week on the course website
  11. Minimal Reproducible Example • “How do I…” • “My code

    isn’t running as expected” • “Why isn’t this working?” Solution: Reprex
  12. Minimal: Use as little code as possible that still produces

    the same problem From: https://stackoverflow.com/help/minimal-reproducible-example Create a new program, adding in only what is needed to see the problem. Use simple, descriptive names for functions and variables – don’t copy the names you’re using in your existing code. Complete: Provide all parts someone else needs to reproduce your problem in the question itself When asking for help – send *actual code, formatted as code*. Not pictures. Reproducible: Give a complete explanation of all steps to reproduce the bug AS WELL AS a description of what the problem is.
  13. Example # Minimal reproducible example # Here is a very

    simple script that has a bug: require(tidyr) cod <- c(1,1,3,1,2,5,2,3,2,1) salmon <- c(2,3,5,1,3,5,2,3,1,4) fishcatch <- data.frame(cod, salmon) fishlong <- gather(fishcatch, species, number, cod:salmon) plot(fishlong$number ~ fishlong$species) Error in plot.window(...) : need finite 'xlim' values In addition: Warning messages: 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion 2: In min(x) : no non-missing arguments to min; returning Inf 3: In max(x) : no non-missing arguments to max; returning -Inf Result How do I ask for help?
  14. The error could be due to anything in the code.

    Before asking for help, we must minimize - try and narrow it down. require(tidyr) cod <- c(1,1,3,1,2,5,2,3,2,1) salmon <- c(2,3,5,1,3,5,2,3,1,4) fishcatch <- data.frame(cod, salmon) fishlong <- gather(fishcatch, species, number, cod:salmon) plot(fishlong$number ~ fishlong$species) Run each line one at a time. Each step appears to work. Problem is here. But we want to share the same dataset
  15. require(tidyr) cod <- c(1,1,3,1,2,5,2,3,2,1) salmon <- c(2,3,5,1,3,5,2,3,1,4) fishcatch <- data.frame(cod,

    salmon) fishlong <- gather(fishcatch, species, number, cod:salmon) plot(fishlong$number ~ fishlong$species) Data preparation is not the problem – but we need it to be complete How can we share this without the bits that we know are not bugged? Solution: Use dput()
  16. dput(fishlong) dput(head(fishlong)) df <- dput(fishlong) df structure(list(species = c("cod", "cod",

    "cod", "cod", "cod", "cod", "cod", "cod", "cod", "cod", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon" ), number = c(1, 1, 3, 1, 2, 5, 2, 3, 2, 1, 2, 3, 5, 1, 3, 5, 2, 3, 1, 4)), row.names = c(NA, -20L), .Names = c("species", "number"), class = "data.frame") structure(list(species = c("cod", "cod", "cod", "cod", "cod", "cod"), number = c(1, 1, 3, 1, 2, 5)), .Names = c("species", "number"), row.names = c(NA, 6L), class = "data.frame")
  17. df <- structure(list(species = c("cod", "cod", "cod", "cod", "cod", "cod",

    "cod", "cod", "cod", "cod", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon"), number = c(1, 1, 3, 1, 2, 5, 2, 3, 2, 1, 2, 3, 5, 1, 3, 5, 2, 3, 1, 4)), row.names = c(NA, -20L), .Names = c("species”, "number"), class = "data.frame") plot(df$number ~ df$species) Now: Need to write clearly saying what you are trying to do and what happened. Code is now minimal and complete. You can copy and paste this right into R.
  18. I have a dataset with number of individual cod and

    salmon caught as part of a study. I want to make a boxplot showing the medium number of cod and salmon caught across replicates of the study. This is what the sample looks like: species number 1 cod 1 2 cod 1 3 cod 3 4 cod 1 5 cod 2 6 cod 5 7 cod 2 8 cod 3 9 cod 2 10 cod 1 11 salmon 2 12 salmon 3 13 salmon 5 14 salmon 1 15 salmon 3 16 salmon 5 17 salmon 2 18 salmon 3 19 salmon 1 20 salmon 4 The graph should look something like this: Cod Salmon number
  19. Here is my code: df <- structure(list(species = c("cod", "cod",

    "cod", "cod", "cod", "cod", "cod", "cod", "cod", "cod", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon"), number = c(1, 1, 3, 1, 2, 5, 2, 3, 2, 1, 2, 3, 5, 1, 3, 5, 2, 3, 1, 4)), row.names = c(NA, -20L), .Names = c("species”, "number"), class = "data.frame") plot(df$number ~ df$species) This outputs an error message: Error in plot.window(...) : need finite 'xlim' values In addition: Warning messages: 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion 2: In min(x) : no non-missing arguments to min; returning Inf 3: In max(x) : no non-missing arguments to max; returning -Inf After extensive troubleshooting I can’t figure out what the problem is. Thanks in advance for assistance.
  20. This allows people to assist you: sapply(df, class) species number

    "character" "numeric" In their response to you: “The problem was df$species should have been a factor, not a character. See solution below:” df <- structure(list(species = c("cod", "cod", "cod", "cod", "cod", "cod", "cod", "cod", "cod", "cod", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon", "salmon" ), number = c(1, 1, 3, 1, 2, 5, 2, 3, 2, 1, 2, 3, 5, 1, 3, 5, 2, 3, 1, 4)), row.names = c(NA, -20L), .Names = c("species", "number"), class = "data.frame") df$species <- as.factor(df$species) plot(df$number ~ df$species)
  21. Activity 1: Structure your Major Assignment Part 1 • One

    of the deliverables of your Major Assignment Part 1 is to lay out an R Project Folder • You already know most of what you need to do this • Activity: • Create an R Project for Major Assignment Part 1, with subfolders for: • data, analysis, figures, tables • Create a description.txt file and draw out its folder structure • Try to do this yourself based on the lecture notes. If need be, I’ve uploaded a sample project to help. • Time = 20-30 min
  22. Markdown advantages 1. Publish in multiple formats 2. Embedded equations

    and math notation 3. Embed code in text • No need to re-make figures from scratch when data changes. • You still need to export publication-quality though 4. Easily publish online
  23. https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf What can you do in Markdown? See the cheat

    sheet Code for Greek letters for math notation https://csrgxtu.github.io/2015/03/20/Writing-Mathematic-Fomulars-in-Markdown/
  24. Save your .bib file and any .csl files you want

    to use right in your /doc subfolder of your R project
  25. Activity 2: Minimal Reproducible Example • Task: You have a

    broken piece of code. • You can’t figure out the problem! • You want to ask the class for help • Download 6002-MinimalExample1.R from the course website http://derekogle.com/fishR/
  26. # Load the data into memory library(fishkirkko2015) library(dplyr) # data()

    enables use of data from pre- # loaded packages data(fishkirkkojarvi2015) data(fishnames) # data are: #fishname Fish name in Finnish #fishID Fish unique identifier for this dataset #sl Standard Length in mm #fl Fork Length in mm #tl Total Length in mm #wt Weight in g https://www.fishbase.se/Images/Glospic/G_Fig13a6181_TL.jpg
  27. In the Minimal Example R script, I’ve embedded five pieces

    of code with bugs Pick one, and work in pairs to submit a minimal reproducible example on Teams to ask for help! Everyone should ask a question, and everyone should solve a question Askers • State your goal • Use a code snippet to show the *minimal* amount of code to reproduce the issue (without having to download a package) • Sketch the intended behaviour • Show the error text, again as a code snippet Solvers • State in plain text what the problem is • Input a code snippet showing the solution, and paste in any output Askers • Thank the solver. Apply the solution to your full dataset and show results See my sample ``` = code snippet ` = single word as code
  28. Concept: Version Control As you move through your work, be

    orderly about how you name and organize old files
  29. https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2017.1399928?journalCode=utas20#.W6lrNGhKhPY Data analysis, statistical research, and teaching statistics have at

    least one thing in common: these activities all produce many files! There are data files, source code, figures, tables, prepared reports, and much more. Most of these files evolve over the course of a project and often need to be shared with others, for reading or edits, as a project unfolds. Without explicit and structured management, project organization can easily descend into chaos, taking time away from the primary work and reducing the quality of the final product. … This article describes the use of the version control system Git and the hosting site GitHub for statistical and data scientific workflows. Special attention is given to projects that use the statistical language R and, optionally, R Markdown documents. Supplementary materials include an annotated set of links to step-by-step tutorials, real world examples, and other useful learning resources. Supplementary materials for this article are available online.