Slide 1

Slide 1 text

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio The tidyverse October 2016

Slide 2

Slide 2 text

Tidy Import Surprises, but doesn't scale Create new variables & new summaries Consistent way of storing data Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Program

Slide 3

Slide 3 text

No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system. — Hal Abelson

Slide 4

Slide 4 text

Tidy Import Visualise Transform Model Communicate Program

Slide 5

Slide 5 text

Tidy Import Visualise Transform Model Communicate Program

Slide 6

Slide 6 text

The tidy tools manifesto

Slide 7

Slide 7 text

Import readr readxl haven httr jsonlite DBI rvest xml2 Tidy tibble tidyr Transform dplyr forcats hms lubridate stringr Visualise ggplot2 Model broom modelr Program purrr magrittr http://r4ds.had.co.nz tidyverse

Slide 8

Slide 8 text

1. Share data structures. 2.Compose simple pieces. 3.Embrace FP. 4.Write for humans.

Slide 9

Slide 9 text

1 Share data structures

Slide 10

Slide 10 text

1. Put each dataset in a 
 data frame. 2. Put each variable in a column. Tidy data

Slide 11

Slide 11 text

# A tibble: 5,769 × 22 iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1 17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0 # ... with 5,752 more rows, and 6 more variables: f2534 , f3544 , f4554 , # f5564 , f65 , fu Messy data has a varied shape What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)

Slide 12

Slide 12 text

# A tibble: 35,750 × 5 country year sex age n 1 AD 1996 f 014 0 2 AD 1996 f 1524 1 3 AD 1996 f 2534 1 4 AD 1996 f 3544 0 5 AD 1996 f 4554 0 6 AD 1996 f 5564 1 7 AD 1996 f 65 0 8 AD 1996 m 014 0 9 AD 1996 m 1524 0 10 AD 1996 m 2534 0 # ... with 35,740 more rows Tidy data has a uniform shape

Slide 13

Slide 13 text

Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy

Slide 14

Slide 14 text

Tidy dataset are all alike; every messy dataset is messy in its own way — Hadley Wickham

Slide 15

Slide 15 text

Tidy data is ~80% of data structures strings dates matrices vectors xml HTTP requests HTTP response http://simplystatistics.org/2016/02/17/non-tidy-data/ factors

Slide 16

Slide 16 text

What if you have a mix? Training data Test data Model Predictions RMSE Cross-validation data frame data frame lm vector scalar

Slide 17

Slide 17 text

Use a tibble with list-columns! # A tibble: 100 x 5 train test .id mod rmse 1 001 0.5661605 2 002 0.2399357 3 003 3.5482986 4 004 0.2396810 5 005 0.1591336 6 006 0.1934869 7 007 0.2697834 8 008 0.4910886 9 009 1.7002645 10 010 0.2047787 ... with 90 more rows

Slide 18

Slide 18 text

df <- data.frame(xyz = "a") # What does this return? df$x Your turn!

Slide 19

Slide 19 text

df <- data.frame(xyz = "a") # What does this return? df$x #> [1] a #> Levels: a Your turn! Two surprises
 partial name matching & stringsAsFactors

Slide 20

Slide 20 text

Two important tensions for understanding base R Interactive exploration Programming Conservative Utopian

Slide 21

Slide 21 text

df <- tibble(xyz = "a") df$xyz #> [1] "a" is.data.frame(df[, "xyz"]) #> [1] TRUE df$x #> Warning: Unknown column 'x' #> NULL Tibbles are data frames that are lazy & surly

Slide 22

Slide 22 text

data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3 And work better with list-columns

Slide 23

Slide 23 text

data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3 data.frame(x = I(list(1:2, 3:5))) #> x #> 1 1, 2 #> 2 3, 4, 5 And work better with list-columns

Slide 24

Slide 24 text

data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3 data.frame(x = I(list(1:2, 3:5))) #> x #> 1 1, 2 #> 2 3, 4, 5 tibble(x = list(1:2, 3:5)) #> # A tibble: 2 x 1 #> x #> #> 1 #> 2 And work better with list-columns

Slide 25

Slide 25 text

2 Compose simple pieces

Slide 26

Slide 26 text

Goal: Solve complex problems by combining uniform pieces.

Slide 27

Slide 27 text

https://www.flickr.com/photos/brunurb/13129057003

Slide 28

Slide 28 text

http://brickartist.com/gallery/pc-magazine-computer/. CC-BY-NC

Slide 29

Slide 29 text

%>% magrittr::

Slide 30

Slide 30 text

foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

Slide 31

Slide 31 text

library(nycflights13) library(dplyr) library(ggplot2) flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date, n)) + geom_line() Consistency across packages is important

Slide 32

Slide 32 text

ggplot(mtcars, aes(mpg, wt)) + geom_point() + geom_line() + ggsave("mtcars.pdf") And ggplot2 is not even internally consistent x

Slide 33

Slide 33 text

ggsave( "mtcars.pdf", ggplot(mtcars, aes(mpg, wt)) + geom_point() + geom_line() + ) And ggplot2 is not even internally consistent

Slide 34

Slide 34 text

# devtools::install_github("hadley/ggplot1") library(ggplot1) ggsave( ggpoint( ggplot( mtcars, list(x = mpg, y = wt) ) ), "mtcars.pdf", width = 8, height = 6 ) ggplot1 had a tidier API than ggplot2!

Slide 35

Slide 35 text

library(ggplot1) mtcars %>% ggplot(list(x = mpg, y = wt)) %>% ggpoint() %>% ggsave("mtcars.pdf", width = 8, height = 6) So you can use the pipe with ggplot1 ggplot2 never would have existed if I’d discovered the pipe 10 years earlier!

Slide 36

Slide 36 text

2 3 4 5 10 15 20 25 30 35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● wt mpg

Slide 37

Slide 37 text

library(rvest) library(purrr) library(readr) library(dplyr) library(lubridate) read_html("https://www.massshootingtracker.org/data") %>% html_nodes("a[href^='https://docs.goo']") %>% html_attr("href") %>% map_df(read_csv) %>% mutate(date = mdy(date)) -> shootings One small example from Bob Rudis https://rud.is/b/2016/07/26

Slide 38

Slide 38 text

3 Embrace FP Answered with cupcakes Why are for loops “bad”?

Slide 39

Slide 39 text

1 cup flour a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Vanilla cupcakes The hummingbird bakery cookbook

Slide 40

Slide 40 text

¾ cup + 2T flour 2 ½ T cocoa powder a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Chocolate cupcakes The hummingbird bakery cookbook

Slide 41

Slide 41 text

¾ cup + 2T flour 2 ½ T cocoa powder a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Chocolate cupcakes The hummingbird bakery cookbook

Slide 42

Slide 42 text

1 cup flour a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Vanilla cupcakes The hummingbird bakery cookbook

Slide 43

Slide 43 text

120g flour 140g sugar 1.5 t baking powder 40g unsalted butter 120ml milk 1 egg 0.25 t pure vanilla extract Preheat oven to 170°C. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Vanilla cupcakes 1. Convert units The hummingbird bakery cookbook

Slide 44

Slide 44 text

120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Beat flour, sugar, baking powder, salt, and butter until sandy. Whisk milk, egg, and vanilla. Mix half into flour mixture until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C. Vanilla cupcakes 2. Rely on domain knowledge The hummingbird bakery cookbook

Slide 45

Slide 45 text

Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C. Vanilla cupcakes 3. Use variables 120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla The hummingbird bakery cookbook

Slide 46

Slide 46 text

120g flour 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C. Cupcakes 4. Extract out common code 100g flour 20g cocoa 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Vanilla Chocolate

Slide 47

Slide 47 text

out1 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out1[[i]] <- mean(mtcars[[i]], na.rm = TRUE) } out2 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out2[[i]] <- median(mtcars[[i]], na.rm = TRUE) } What do these for loops do?

Slide 48

Slide 48 text

out1 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out1[[i]] <- mean(mtcars[[i]], na.rm = TRUE) } out2 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out2[[i]] <- median(mtcars[[i]], na.rm = TRUE) } For loops emphasise the objects

Slide 49

Slide 49 text

out1 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out1[[i]] <- mean(mtcars[[i]], na.rm = TRUE) } out2 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out2[[i]] <- median(mtcars[[i]], na.rm = TRUE) } Not the actions

Slide 50

Slide 50 text

library(purrr) means <- map_dbl(mtcars, mean) medians <- map_dbl(mtcars, median) Functional programming emphasises the actions

Slide 51

Slide 51 text

sim <- tribble( ~f, ~params, "runif", list(min = -1, max = 1), "rnorm", list(sd = 5), "rpois", list(lambda = 10) ) sim %>% mutate(sim = invoke_map(f, params, n = 10)) Teaser: simulation

Slide 52

Slide 52 text

reports <- tibble( class = unique(mpg$class), filename = paste0("fuel-economy-", class, ".html"), params = map(class, ~ list(my_class = .)) ) reports %>% select(output_file = filename, params) %>% pwalk(rmarkdown::render, input = "fuel-economy.Rmd") Teaser: saving parameterised reports

Slide 53

Slide 53 text

4 Write for humans

Slide 54

Slide 54 text

Programs must be written for people to read, and only incidentally for machines to execute. — Hal Abelson

Slide 55

Slide 55 text

tibble lubridate forcats filter mutate summarise arrange select magrittr

Slide 56

Slide 56 text

tibble lubridate forcats filter mutate summarise arrange select magrittr Embrace existing language Have fun! Pack animals

Slide 57

Slide 57 text

Conclusion

Slide 58

Slide 58 text

1. Share data structures. 2.Compose simple pieces. 3.Embrace FP. 4.Write for humans.

Slide 59

Slide 59 text

My goal is to make a pit of success http://blog.codinghorror.com/falling-into-the-pit-of-success/

Slide 60

Slide 60 text

install.packages("tidyverse") library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages ---------------------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats Gotta install them all

Slide 61

Slide 61 text

Import readr readxl haven httr jsonlite rvest xml2 Tidy tibble tidyr Transform dplyr forcats hms lubridate stringr vctrs Visualise ggplot2 Model broom modelr ??? Program purrr magrittr http://r4ds.had.co.nz tidyverse

Slide 62

Slide 62 text

I want to R the very best / like no one ever R'ed / To tidy them is my real test / To model them is my cause I'll import from across the web/ Searching GitHub and CRAN/ Each Tidyverse package gets me closer/ to results I'll understand Tidyverse! 
 Gotta wrangle/ It's R and me/ I know it's my destiny/ Tidyverse! 
 Hadley - you're my best friend/ With tidy data from end to end/ Tidyverse! Gotta wrangle/ A package so new/ Pipes and data first will pull us through/ You teach me and I'll teach you/ Tidyverse! Gotta catch’em all Tidyverse theme song, by Sean Kross

Slide 63

Slide 63 text

This work is licensed under the 
 Creative Commons Attribution-Noncommercial 3.0 
 United States License. To view a copy of this license, visit 
 http://creativecommons.org/licenses/by-nc/3.0/us/