Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NYC R Conference 2018: Doing Data Science

Daniel Chen
April 20, 2018

NYC R Conference 2018: Doing Data Science

A set of good practices while doing data science.
Let's do better science so we can make the world a better place.

Daniel Chen

April 20, 2018
Tweet

More Decks by Daniel Chen

Other Decks in Science

Transcript

  1. Hi!

  2. What is it? Tools Used Journey to Open Data Science,

    Anaconda (Continuum Analytics) 10/49
  3. What is it? Tasks Performed Journey to Open Data Science,

    Anaconda (Continuum Analytics) 11/49
  4. Script it! Reproducible-Science-Curriculum/rr-init chendaniely/computational-project-cookie-cutter · · cd $1 mkdir doc

    data src bin results cd doc echo "Doc directory with one subdirectory per manuscript" > README touch .gitkeep cd ../data echo "Data directory for storing fixed data sets" > README touch .gitkeep cd ../src echo "src for source code" > README touch .gitkeep cd ../bin echo "bin for compiled binaries or scripts" > README touch .gitkeep cd ../results echo "Results directory for tracking computational experiments peformed on data" > README touch .gitkeep echo "Folders created." cd .. 15/49
  5. Best Practices Good Enough tl;dr Write programs for people, not

    computers 1. Let the computer do the work 2. Make incremental changes 3. Don’t repeat yourself (or others) 4. Plan for mistakes 5. Optimize software only after it works correctly 6. Document design and purpose, not mechanics 7. Collaborate 8. Data management 1. Software 2. Collaboration 3. Project Organization 4. Keeping track of changes 5. Manuscripts 6. 18/49
  6. Summer Program + ~20 people into the lab … to

    stress test! https://www.bi.vt.edu/sdal/projects/data-science-for-the-public-good-program 19/49
  7. Blog post: From VMs to LXC Containers to Docker Containers

    fully document the setup process tear down and spin up the container if something goes wrong system libraries for R packages easy to try out a new technology before full integration · · · · 20/49
  8. Installing R Packages Separate Docker Container for installing R packages

    (rpkgs) This installs R packages into a persistent docker volume Everyone has /rpkgs mounted in their rstudio container Add /rpkgs to everyone’s .libPaths() · · · · site_path = R.home(component = "home") fname = file.path(site_path, "etc", "Rprofile.site") write(".libPaths(c('/rpkgs', .libPaths()))", file = fname, append = TRUE) # prepend to .libPaths write('local({r <- getOption("repos"); r["CRAN"] <- "https://cloud.r-project.org/"; options(repos=r)})', file = fname, append = TRUE) 23/49
  9. Installing R Packages (Development Server) Unless you need system libraries

    (e.g., CentOS) https://github.com/bi-sdal/infrastructure_submodules#installing-the-system-library-for-all-docker-images In rpkgs: install.packages() · Add the installation in the mro Dockerfile, e.g., yum install -y jq-devel && \ 1. Build the base mro, rpkgs, rstudio, and shiny images 2. Push the images to dockerhub, docker push 3. docker push sdal/mro-c7sd_auth docker push sdal/rss-mro-c7sd_auth docker push sdal/rpkgs-mro-c7sd_auth docker push sdal/shy-mro-c7sd_auth 24/49
  10. Installing R Packages (Production Server) Pull the images from dockerhub,

    docker pull 1. docker pull sdal/mro-c7sd_auth docker pull sdal/rss-mro-c7sd_auth docker pull sdal/rpkgs-mro-c7sd_auth docker pull sdal/shy-mro-c7sd_auth Start up the RStudio containers, docker-compose -f rstudio-compose.yml up -d --no- recreate 1. 25/49
  11. RStudio Server Open Source Edition great for individual use Exploring

    RStudio Pro… much better suited for parallel projects, groups, and teams (?) · Pretty much building the Pro stack… nginx - - web server: reverse proxy, load balancer, and HTTP cache - · 26/49
  12. RStudio Server for Everyone docker-compose.yml/rstudio-compose.yml: volumes: rpkgs: services: rstudio_chend: image:

    sdal/rss-mro-c7sd_auth container_name: rstudio_chend volumes: - /sys/fs/cgroup:/sys/fs/cgroup:ro - /etc/group:/etc/group - /home:/home - rpkgs:/rpkgs - checkpoint:/checkpoint cap_add: - SYS_ADMIN ports: - 3125:8787 27/49
  13. Virginia Tech Libraries provides free Overleaf Pro+ accounts Authorship Collaborative

    with real-time rendering… with Git! · LT X A E Try it with Docker! docker-compose.yml · · ShareLaTeX MongoDB Redis · · · 28/49
  14. Greg Wilson Head of Instructor Training @ DataCamp Co-founder of

    Software Carpentry Free DataCamp Courses: Software-Carpentry Lessons: Git is hard & Shell (Bash) is the glue Introduction to Shell for Data Science Introduction to Git for Data Science · · Shell Git/Mercurial SQL Python R MATLAB Make · · · · · · · 31/49
  15. Balancing best/good practices… … with getting work done Titus Brown

    Associate Professor at UC Davis Write Functions! in your projects 32/49
  16. No setwd()! Use rprojects! here package, “A simple interface to

    rprojroot”: https://github.com/r-lib/here rprojroot: https://github.com/r-lib/rprojroot · Open them: rstudioapi::openProject(...) - · print(R.utils::sourceDirectory(here::here('shiny', 'functions'))) No more if (interactive()){...} - - · 33/49
  17. Saving things Save out long calculations into intermediate datasets Use

    base::saveRDS() and base::readRDS() · · vs base::save() and base::load() - v <- 1:10 # I want to save this... save(v, file = 'awesome_datascience.RData') rm(v) load(file = 'awesome_datascience.RData') v ## [1] 1 2 3 4 5 6 7 8 9 10 saveRDS(v, file = 'super_awesome_datascience.RDS') loaded <- readRDS(file = 'super_awesome_datascience.RDS') loaded ## [1] 1 2 3 4 5 6 7 8 9 10 34/49
  18. Secrets Hardcoding secrets (e.g., passwords, API keys) in your code

    Reverting lines before a commit Sourcing a special ignored file · · · # uses console or rstudio to do password prompt getPass::getPass("database username") 35/49
  19. Secret Library .secret_to_keep <- function(user, pass) { if (is.null(pass)) {

    pass <- getPass("LDAP Password (the one you use to login to Lightfoot and RStudio):") } secret_to_keep <- c(password = pass, username = user) return(secret_to_keep) } setup_user_pass <- function(username = unname(Sys.info()['user']), password = NULL, public_key = '~/.ssh/id_rsa.pub', vault = '/home/sdal/projects/sdal/vault', secret_name = unname(Sys.info()['user']), verbose = FALSE) { add_user(username, public_key, vault) secret_to_keep <- .secret_to_keep(user = username, pass = password) add_secret(secret_name, secret_to_keep, users = username, vault = vault) } get_my_password <- function(secret_name = unname(Sys.info()['user']), key = local_key(), vault = '/home/sdal/projects/sdal/vault') { return(unname(get_secret(secret_name, key , vault)['password'])) } 36/49
  20. Package! I learned from last year, we should all probably

    just create packages for ourselves/group/lab/company sdalr, https://github.com/bi-sdal/sdalr/blob/master/R/user_pass.R 37/49
  21. RMarkdown This presentation is written in it! I’ve given Meetup

    Talks about it: NYC, DC Websites, Books, Presentations, Dashboards, Reports… · · · · https://rmarkdown.rstudio.com/ - 40/49
  22. Graphics Cookbook How to do something Gallery of plots Similar

    to: http://www.cookbook-r.com/Graphs/ · · · 41/49
  23. A Better Default Colormap for Matplotlib SciPy 2015 | Nathaniel

    Smith and Stéfan van der Walt https://www.youtube.com/watch?v=xAoljeRJ3lU “Perceptually uniform”, sequential, works well in black-and-white, Colorblind friendly Matlab: parula · · · 42/49
  24. Virdis viscm, a tool to see how “good” your colormap

    is: http://bids.github.io/colormap/ R Package! · https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html by: Simon Garnier, Noam Ross, Bob Rudis, Marco Sciaini, Cédric Scherer - - 43/49
  25. Perceptual Color Maps in matplotlib for Oceanography SciPy 2015 |

    Kristen Thyng Domain specific color maps (Oceanography): https://www.youtube.com/watch?v=XjHzLUnHeM0 · 44/49
  26. Other things Flight Rules are the hard-earned body of knowledge

    recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. […] NASA has been capturing our missteps, disasters and solutions since the early 1960s, when Mercury-era ground teams first started gathering “lessons learned” into a compendium that now lists thousands of problematic situations […] and their solutions. — Chris Hadfield, An Astronaut’s Guide to Life. Cookbooks for you own usecases (e.g., GIS) Make your own “flight rules” · · 46/49
  27. Why? Our planet needs our help, and we need (good)

    science to fix it. — Greg Wilson 47/49