Slide 1

Slide 1 text

Doing Data Science NYC R Conference 2018 Daniel Chen (@chendaniely) April 20, 2018

Slide 2

Slide 2 text

Hi!

Slide 3

Slide 3 text

I’m Daniel 3/49

Slide 4

Slide 4 text

Thanks

Slide 5

Slide 5 text

Community Y’all #rstatsnyc DataCommunityDC PyData SciPy The Carpentries · · · · · Software-Carpentry Data Carpentry - 5/49

Slide 6

Slide 6 text

I’m an Author! :O Pandas For Everyone 6/49

Slide 7

Slide 7 text

Doing Data Science

Slide 8

Slide 8 text

R! and Python! 8/49

Slide 9

Slide 9 text

We’re all friends 9/49

Slide 10

Slide 10 text

What is it? Tools Used Journey to Open Data Science, Anaconda (Continuum Analytics) 10/49

Slide 11

Slide 11 text

What is it? Tasks Performed Journey to Open Data Science, Anaconda (Continuum Analytics) 11/49

Slide 12

Slide 12 text

Last year… 12/49

Slide 13

Slide 13 text

Structure your projects! 13/49

Slide 14

Slide 14 text

(Computational Biology) Project Structure 14/49

Slide 15

Slide 15 text

Script it! Reproducible-Science-Curriculum/rr-init chendaniely/computational-project-cookie-cutter · · cd $1 mkdir doc data src bin results cd doc echo "Doc directory with one subdirectory per manuscript" > README touch .gitkeep cd ../data echo "Data directory for storing fixed data sets" > README touch .gitkeep cd ../src echo "src for source code" > README touch .gitkeep cd ../bin echo "bin for compiled binaries or scripts" > README touch .gitkeep cd ../results echo "Results directory for tracking computational experiments peformed on data" > README touch .gitkeep echo "Folders created." cd .. 15/49

Slide 16

Slide 16 text

Best Practices … (2014) 16/49

Slide 17

Slide 17 text

Good enough practices … (2017) 17/49

Slide 18

Slide 18 text

Best Practices Good Enough tl;dr Write programs for people, not computers 1. Let the computer do the work 2. Make incremental changes 3. Don’t repeat yourself (or others) 4. Plan for mistakes 5. Optimize software only after it works correctly 6. Document design and purpose, not mechanics 7. Collaborate 8. Data management 1. Software 2. Collaboration 3. Project Organization 4. Keeping track of changes 5. Manuscripts 6. 18/49

Slide 19

Slide 19 text

Summer Program + ~20 people into the lab … to stress test! https://www.bi.vt.edu/sdal/projects/data-science-for-the-public-good-program 19/49

Slide 20

Slide 20 text

Blog post: From VMs to LXC Containers to Docker Containers fully document the setup process tear down and spin up the container if something goes wrong system libraries for R packages easy to try out a new technology before full integration · · · · 20/49

Slide 21

Slide 21 text

Infrastructure https://github.com/bi-sdal/dockerimages https://github.com/bi-sdal/infrastructure_submodules 21/49

Slide 22

Slide 22 text

Project Template 22/49

Slide 23

Slide 23 text

Installing R Packages Separate Docker Container for installing R packages (rpkgs) This installs R packages into a persistent docker volume Everyone has /rpkgs mounted in their rstudio container Add /rpkgs to everyone’s .libPaths() · · · · site_path = R.home(component = "home") fname = file.path(site_path, "etc", "Rprofile.site") write(".libPaths(c('/rpkgs', .libPaths()))", file = fname, append = TRUE) # prepend to .libPaths write('local({r <- getOption("repos"); r["CRAN"] <- "https://cloud.r-project.org/"; options(repos=r)})', file = fname, append = TRUE) 23/49

Slide 24

Slide 24 text

Installing R Packages (Development Server) Unless you need system libraries (e.g., CentOS) https://github.com/bi-sdal/infrastructure_submodules#installing-the-system-library-for-all-docker-images In rpkgs: install.packages() · Add the installation in the mro Dockerfile, e.g., yum install -y jq-devel && \ 1. Build the base mro, rpkgs, rstudio, and shiny images 2. Push the images to dockerhub, docker push 3. docker push sdal/mro-c7sd_auth docker push sdal/rss-mro-c7sd_auth docker push sdal/rpkgs-mro-c7sd_auth docker push sdal/shy-mro-c7sd_auth 24/49

Slide 25

Slide 25 text

Installing R Packages (Production Server) Pull the images from dockerhub, docker pull 1. docker pull sdal/mro-c7sd_auth docker pull sdal/rss-mro-c7sd_auth docker pull sdal/rpkgs-mro-c7sd_auth docker pull sdal/shy-mro-c7sd_auth Start up the RStudio containers, docker-compose -f rstudio-compose.yml up -d --no- recreate 1. 25/49

Slide 26

Slide 26 text

RStudio Server Open Source Edition great for individual use Exploring RStudio Pro… much better suited for parallel projects, groups, and teams (?) · Pretty much building the Pro stack… nginx - - web server: reverse proxy, load balancer, and HTTP cache - · 26/49

Slide 27

Slide 27 text

RStudio Server for Everyone docker-compose.yml/rstudio-compose.yml: volumes: rpkgs: services: rstudio_chend: image: sdal/rss-mro-c7sd_auth container_name: rstudio_chend volumes: - /sys/fs/cgroup:/sys/fs/cgroup:ro - /etc/group:/etc/group - /home:/home - rpkgs:/rpkgs - checkpoint:/checkpoint cap_add: - SYS_ADMIN ports: - 3125:8787 27/49

Slide 28

Slide 28 text

Virginia Tech Libraries provides free Overleaf Pro+ accounts Authorship Collaborative with real-time rendering… with Git! · LT X A E Try it with Docker! docker-compose.yml · · ShareLaTeX MongoDB Redis · · · 28/49

Slide 29

Slide 29 text

Sharing Documenting Projects 29/49

Slide 30

Slide 30 text

The Skills

Slide 31

Slide 31 text

Greg Wilson Head of Instructor Training @ DataCamp Co-founder of Software Carpentry Free DataCamp Courses: Software-Carpentry Lessons: Git is hard & Shell (Bash) is the glue Introduction to Shell for Data Science Introduction to Git for Data Science · · Shell Git/Mercurial SQL Python R MATLAB Make · · · · · · · 31/49

Slide 32

Slide 32 text

Balancing best/good practices… … with getting work done Titus Brown Associate Professor at UC Davis Write Functions! in your projects 32/49

Slide 33

Slide 33 text

No setwd()! Use rprojects! here package, “A simple interface to rprojroot”: https://github.com/r-lib/here rprojroot: https://github.com/r-lib/rprojroot · Open them: rstudioapi::openProject(...) - · print(R.utils::sourceDirectory(here::here('shiny', 'functions'))) No more if (interactive()){...} - - · 33/49

Slide 34

Slide 34 text

Saving things Save out long calculations into intermediate datasets Use base::saveRDS() and base::readRDS() · · vs base::save() and base::load() - v <- 1:10 # I want to save this... save(v, file = 'awesome_datascience.RData') rm(v) load(file = 'awesome_datascience.RData') v ## [1] 1 2 3 4 5 6 7 8 9 10 saveRDS(v, file = 'super_awesome_datascience.RDS') loaded <- readRDS(file = 'super_awesome_datascience.RDS') loaded ## [1] 1 2 3 4 5 6 7 8 9 10 34/49

Slide 35

Slide 35 text

Secrets Hardcoding secrets (e.g., passwords, API keys) in your code Reverting lines before a commit Sourcing a special ignored file · · · # uses console or rstudio to do password prompt getPass::getPass("database username") 35/49

Slide 36

Slide 36 text

Secret Library .secret_to_keep <- function(user, pass) { if (is.null(pass)) { pass <- getPass("LDAP Password (the one you use to login to Lightfoot and RStudio):") } secret_to_keep <- c(password = pass, username = user) return(secret_to_keep) } setup_user_pass <- function(username = unname(Sys.info()['user']), password = NULL, public_key = '~/.ssh/id_rsa.pub', vault = '/home/sdal/projects/sdal/vault', secret_name = unname(Sys.info()['user']), verbose = FALSE) { add_user(username, public_key, vault) secret_to_keep <- .secret_to_keep(user = username, pass = password) add_secret(secret_name, secret_to_keep, users = username, vault = vault) } get_my_password <- function(secret_name = unname(Sys.info()['user']), key = local_key(), vault = '/home/sdal/projects/sdal/vault') { return(unname(get_secret(secret_name, key , vault)['password'])) } 36/49

Slide 37

Slide 37 text

Package! I learned from last year, we should all probably just create packages for ourselves/group/lab/company sdalr, https://github.com/bi-sdal/sdalr/blob/master/R/user_pass.R 37/49

Slide 38

Slide 38 text

Testing RStatsNYC 2016: Data Testing Repo: https://github.com/chendaniely/2016-04-08-rstatsnyc_testing Video: https://www.youtube.com/watch?v=CAy0udiWwmg · · 38/49

Slide 39

Slide 39 text

Cookbooks

Slide 40

Slide 40 text

RMarkdown This presentation is written in it! I’ve given Meetup Talks about it: NYC, DC Websites, Books, Presentations, Dashboards, Reports… · · · · https://rmarkdown.rstudio.com/ - 40/49

Slide 41

Slide 41 text

Graphics Cookbook How to do something Gallery of plots Similar to: http://www.cookbook-r.com/Graphs/ · · · 41/49

Slide 42

Slide 42 text

A Better Default Colormap for Matplotlib SciPy 2015 | Nathaniel Smith and Stéfan van der Walt https://www.youtube.com/watch?v=xAoljeRJ3lU “Perceptually uniform”, sequential, works well in black-and-white, Colorblind friendly Matlab: parula · · · 42/49

Slide 43

Slide 43 text

Virdis viscm, a tool to see how “good” your colormap is: http://bids.github.io/colormap/ R Package! · https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html by: Simon Garnier, Noam Ross, Bob Rudis, Marco Sciaini, Cédric Scherer - - 43/49

Slide 44

Slide 44 text

Perceptual Color Maps in matplotlib for Oceanography SciPy 2015 | Kristen Thyng Domain specific color maps (Oceanography): https://www.youtube.com/watch?v=XjHzLUnHeM0 · 44/49

Slide 45

Slide 45 text

Colorbrewer 45/49

Slide 46

Slide 46 text

Other things Flight Rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. […] NASA has been capturing our missteps, disasters and solutions since the early 1960s, when Mercury-era ground teams first started gathering “lessons learned” into a compendium that now lists thousands of problematic situations […] and their solutions. — Chris Hadfield, An Astronaut’s Guide to Life. Cookbooks for you own usecases (e.g., GIS) Make your own “flight rules” · · 46/49

Slide 47

Slide 47 text

Why? Our planet needs our help, and we need (good) science to fix it. — Greg Wilson 47/49

Slide 48

Slide 48 text

Thanks, again!

Slide 49

Slide 49 text

#hobbestheblueheelermix #rdogladies #bowtiesarecool chendaniely: Slides: https://github.com/chendaniely/rstatsnyc_2018-data_science :) #rstatsnyc #nycdatamafia twitter github github.io instagram · · · · 49/49