Slide 1

Slide 1 text

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio You can’t do data science in a GUI March 2018

Slide 2

Slide 2 text

Data science is the process by which data becomes understanding, knowledge and insight Data science is the process by which data becomes understanding, knowledge and insight

Slide 3

Slide 3 text

Data science is the process by which data becomes understanding, knowledge and insight Data science is the process by which data becomes understanding, knowledge and insight

Slide 4

Slide 4 text

Tidy Import Store data consistently

Slide 5

Slide 5 text

Tidy Import Understand Store data consistently

Slide 6

Slide 6 text

Tidy Import Surprises, but doesn't scale Create new variables & new summaries Visualise Transform Model Scales, but doesn't (fundamentally) surprise Store data consistently

Slide 7

Slide 7 text

Tidy Import Surprises, but doesn't scale Create new variables & new summaries Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Automate Store data consistently

Slide 8

Slide 8 text

Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr dplyr forcats hms ggplot2 broom modelr readr readxl haven xml2 lubridate stringr tidyverse.org r4ds.had.co.nz

Slide 9

Slide 9 text

Why program?

Slide 10

Slide 10 text

Cognitive Computational Think it Describe it (precisely) Do it

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

table %>% rename(player = X1, team = X2, position = X3) %>% filter(player != 'PLAYER') %>% mutate( college = ifelse(player == position, player, NA) ) %>% fill(college) %>% filter(player != college) Programming languages are languages

Slide 14

Slide 14 text

It’s just text! And this gives you access to two extremely powerful techniques

Slide 15

Slide 15 text

⌘C ⌘V

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

And provides provenance Reproducible Readable Diffable Open

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

I live in fear of clicking the wrong thing

Slide 27

Slide 27 text

Why program 
 in R?

Slide 28

Slide 28 text

x <- sample(100, 10) x > 50 #> [1] TRUE FALSE FALSE TRUE TRUE #> [6] TRUE TRUE FALSE FALSE TRUE sum(x > 50) #> [1] 6 # (There are no scalars! ) R is a vector language

Slide 29

Slide 29 text

y <- sample(c(1:5, NA)) y #> [1] 1 NA 2 3 5 4 y > 2 #> [1] FALSE NA FALSE TRUE TRUE TRUE y == NA #> [1] NA NA NA NA NA NA Missing values are baked in

Slide 30

Slide 30 text

john_age <- NA mary_age <- NA john_age == mary_age #> [1] NA An example makes this clearer

Slide 31

Slide 31 text

y <- sample(c(1:5, NA)) y #> [1] 1 NA 2 3 5 4 y > 2 #> [1] FALSE NA FALSE TRUE TRUE TRUE is.na(y) #> [1] FALSE TRUE FALSE FALSE FALSE FALSE Missing values are baked in

Slide 32

Slide 32 text

data.frame( x = 1:4, y = sample(letters[1:4]), z = runif(4) ) #> x y z #> 1 1 c 0.1189635 #> 2 2 a 0.0518956 #> 3 3 b 0.4471441 #> 4 4 d 0.0818547 So are relational tables (aka data frames/tibbles)

Slide 33

Slide 33 text

# It's well suited to data science but I # can't (yet) articulate why # Something about having a standard # container for 80% of problems, and # needing to do something to each element # of that container # Whole object thinking? Functional programming

Slide 34

Slide 34 text

x <- seq(0, 2 * pi, length = 100) plot(x, sin(x), type = "l") Metaprogramming 0 1 2 3 4 5 6 −1.0 −0.5 0.0 0.5 1.0 x sin(x)

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Which makes it a great place to write DSLs

Slide 37

Slide 37 text

Why program in R with the ?

Slide 38

Slide 38 text

Solve complex https://unsplash.com/photos/tjX_sniNzgQ simple pieces combining problems by

Slide 39

Slide 39 text

library(tidycensus) geo <- get_acs( geography = "metropolitan statistical area...", variables = "DP03_0021PE", summary_var = "B01003_001", survey = "acs1", endyear = 2016 ) # Thanks to Kyle Walker (@kyle_e_walker) # For package and example A small example

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

big_metro <- geo %>% filter(summary_est > 2e6) %>% select(-variable) %>% mutate( NAME = gsub(" Metro Area", "", NAME) ) %>% separate(NAME, c("city", "state"), ", ") %>% mutate( city = str_extract(city, "^[A-Za-z ]+"), state = str_extract(state, "^[A-Za-z ]+"), name = paste0(city, ", ", state), summary_moe = na_if(summary_moe, -555555555) ) Followed by data munging

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

big_metro %>% ggplot(aes( x = estimate, y = reorder(name, estimate)) ) + geom_errorbarh( aes( xmin = estimate - moe, xmax = estimate + moe ), width = 0.1 ) + geom_point(color = "navy")

Slide 44

Slide 44 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0 10 20 30 estimate reorder(name, estimate)

Slide 45

Slide 45 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0% 10% 20% 30% 2016 1−year ACS estimates Residents who take public transportation to work Source: ACS Data Profile variable DP03_0021P / tidycensus

Slide 46

Slide 46 text

No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system. — Hal Abelson

Slide 47

Slide 47 text

My goal is to make a pit of success http://blog.codinghorror.com/falling-into-the-pit-of-success/

Slide 48

Slide 48 text

But

Slide 49

Slide 49 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 2 3 4 5 6 7 displ hwy

Slide 50

Slide 50 text

df %>% select( date = `Date Created`, name = Name, plays = `Total Plays`, loads = `Total Loads`, apv = `Average Percent Viewed` ) But this is painful!

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

df %>% filter(n > 1e6) %>% mutate(x = f(y))) %>% ??? # How predictable is next step from # previous steps? What next?

Slide 53

Slide 53 text

Can we do more with autocomplete? Where do dialogs and autocomplete intersect?

Slide 54

Slide 54 text

Learning from examples http://vis.stanford.edu/papers/wrangler

Slide 55

Slide 55 text

What about deep learning? https://twitter.com/carroll_jono/status/914254139873361920

Slide 56

Slide 56 text

Conclusion

Slide 57

Slide 57 text

I believe that: 1. Huge advantages to code 2. R provides great environment 3. DSLs help express your thoughts 4. Code should be primary artefact (but might be generated other than typing)