Pro Yearly is on sale from $80 to $50! »

Should all statistics students 
be programmers?

7ba164f40a50bc23dbb2aa825fb7bc16?s=47 Hadley Wickham
July 12, 2018
3.8k

Should all statistics students 
be programmers?

A presentation at ICOTS 10 (Kyoto, Japan)

7ba164f40a50bc23dbb2aa825fb7bc16?s=128

Hadley Wickham

July 12, 2018
Tweet

Transcript

  1. Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio Should all statistics

    students
 be programmers? July 2018 No!
  2. Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio July 2018 Should

    all statistics students
 program? Yes!
  3. What should a statistics student be able to do? Model

  4. What should a statistics student be able to do? Tidy

    Surprises, but doesn't scale Create new variables & new summaries Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Automate Store data consistently Import Understand
  5. Two primary computer interfaces: point & click GUI

  6. Or: program with a CLI

  7. 1. Code is text 2. Code is read-able 3. Code

    is shareable 4. Code is open Why is programming preferable for statistics?
  8. Code is text And this provides for two 
 extremely

    powerful techniques
  9. ⌘C ⌘V

  10. None
  11. By Jenny Bryan ⌘⇧R

  12. ⌘⇧K

  13. Code is read-able

  14. Much of my work is developing tools

  15. library(tidycensus) geo <- get_acs( geography = "metropolitan statistical area...", variables

    = "DP03_0021PE", summary_var = "B01003_001", survey = "acs1", endyear = 2016 ) # Thanks to Kyle Walker (@kyle_e_walker) # For package and example A small example
  16. None
  17. big_metro <- geo %>% filter(summary_est > 2e6) %>% select(-variable) %>%

    mutate( NAME = gsub(" Metro Area", "", NAME) ) %>% separate(NAME, c("city", "state"), ", ") %>% mutate( city = str_extract(city, "^[A-Za-z ]+"), state = str_extract(state, "^[A-Za-z ]+"), name = paste0(city, ", ", state), summary_moe = na_if(summary_moe, -555555555) ) Followed by data munging
  18. None
  19. big_metro %>% ggplot(aes( x = estimate, y = reorder(name, estimate))

    ) + geom_errorbarh( aes( xmin = estimate - moe, xmax = estimate + moe ), width = 0.1 ) + geom_point(color = "navy")
  20. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0 10 20 30 estimate reorder(name, estimate)
  21. library(tidyverse) library(magick) dir(pattern = ".png") %>% map(image_read) %>% image_join() %>%

    image_animate(fps = 1, loop = 25) %>% image_write("my_animation.gif") And hence you can read unfamiliar code https://twitter.com/ricardokriebel/status/849626401611411458 What does this code do?
  22. https://twitter.com/ricardokriebel/status/849626401611411458

  23. I live in fear of clicking the wrong thing

  24. Code is shareable

  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. Why is sharing so important? Learn from others Open science

    Easily critique
  33. Code is open

  34. All modern programming languages are open source Free Students can

    use same tools as practitioners.
 Anyone can use best tools regardless of wealth.
 Anyone can re-run your analysis You can fix problems
 You can build your own tools Fluid
  35. Conclusion

  36. 1. Code is text 2. Code is read-able 3. Code

    is shareable 4. Code is open Why is programming preferable for statistics?