Slide 1

Slide 1 text

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio Should all statistics students
 be programmers? July 2018 No!

Slide 2

Slide 2 text

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio July 2018 Should all statistics students
 program? Yes!

Slide 3

Slide 3 text

What should a statistics student be able to do? Model

Slide 4

Slide 4 text

What should a statistics student be able to do? Tidy Surprises, but doesn't scale Create new variables & new summaries Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Automate Store data consistently Import Understand

Slide 5

Slide 5 text

Two primary computer interfaces: point & click GUI

Slide 6

Slide 6 text

Or: program with a CLI

Slide 7

Slide 7 text

1. Code is text 2. Code is read-able 3. Code is shareable 4. Code is open Why is programming preferable for statistics?

Slide 8

Slide 8 text

Code is text And this provides for two 
 extremely powerful techniques

Slide 9

Slide 9 text

⌘C ⌘V

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

By Jenny Bryan ⌘⇧R

Slide 12

Slide 12 text

⌘⇧K

Slide 13

Slide 13 text

Code is read-able

Slide 14

Slide 14 text

Much of my work is developing tools

Slide 15

Slide 15 text

library(tidycensus) geo <- get_acs( geography = "metropolitan statistical area...", variables = "DP03_0021PE", summary_var = "B01003_001", survey = "acs1", endyear = 2016 ) # Thanks to Kyle Walker (@kyle_e_walker) # For package and example A small example

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

big_metro <- geo %>% filter(summary_est > 2e6) %>% select(-variable) %>% mutate( NAME = gsub(" Metro Area", "", NAME) ) %>% separate(NAME, c("city", "state"), ", ") %>% mutate( city = str_extract(city, "^[A-Za-z ]+"), state = str_extract(state, "^[A-Za-z ]+"), name = paste0(city, ", ", state), summary_moe = na_if(summary_moe, -555555555) ) Followed by data munging

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

big_metro %>% ggplot(aes( x = estimate, y = reorder(name, estimate)) ) + geom_errorbarh( aes( xmin = estimate - moe, xmax = estimate + moe ), width = 0.1 ) + geom_point(color = "navy")

Slide 20

Slide 20 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0 10 20 30 estimate reorder(name, estimate)

Slide 21

Slide 21 text

library(tidyverse) library(magick) dir(pattern = ".png") %>% map(image_read) %>% image_join() %>% image_animate(fps = 1, loop = 25) %>% image_write("my_animation.gif") And hence you can read unfamiliar code https://twitter.com/ricardokriebel/status/849626401611411458 What does this code do?

Slide 22

Slide 22 text

https://twitter.com/ricardokriebel/status/849626401611411458

Slide 23

Slide 23 text

I live in fear of clicking the wrong thing

Slide 24

Slide 24 text

Code is shareable

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Why is sharing so important? Learn from others Open science Easily critique

Slide 33

Slide 33 text

Code is open

Slide 34

Slide 34 text

All modern programming languages are open source Free Students can use same tools as practitioners.
 Anyone can use best tools regardless of wealth.
 Anyone can re-run your analysis You can fix problems
 You can build your own tools Fluid

Slide 35

Slide 35 text

Conclusion

Slide 36

Slide 36 text

1. Code is text 2. Code is read-able 3. Code is shareable 4. Code is open Why is programming preferable for statistics?