Slide 1

Slide 1 text

Hadley Wickham 
 Chief Scientist, RStudio Expressing 
 yourself with R July 2017

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr dplyr forcats hms ggplot2 broom modelr readr readxl haven xml2 lubridate stringr

Slide 4

Slide 4 text

@aaronwolen, @aghaynes, @ajdamico, @ajschumacher, @alberthkcheng, @alyst, @andrew, @andrewjlm, @apjanke, @arneschillert, @artemklevtsov, @arunsrinivasan, @asnr, @astamm, @austenhead, @baptiste, @bbolker, @bearloga, @benmarwick, @bhive01, @BioStatMatt, @bpbond, @bquast, @BrianDiggs, @briatte, @burchill, @casallas, @cb4ds, @cboettig, @cderv, @christophergandrud, @cmartin, @colinbrislawn, @coolbutuseless, @cosinequanon, @craigcitro, @csgillespie, @ctbrown, @daattali, @dandermotj, @danliIDEA, @DanRuderman, @davharris, @davidmorrison, @dchiu911, @dchudz, @dewittpe, @dgromer, @dgrtwo, @dhimmel, @dickoa, @diogocp, @djmurphy420, @dlebauer, @dmedri, @dmenne, @dougmitarotonda, @dpastoor, @dpocock, @dtelad11, @earino, @echasnovski, @ecortens, @eddelbuettel, @edgararuiz, @edwindj, @egnha, @ehrlinger, @eibanez, @eipi10, @ekstroem, @emojiencoding, @etiennebr, @evanmiller, @fpinter, @FvD, @gaborcsardi, @gagolews, @garrettgman, @gavinsimpson, @gergness, @gnustats, @gorcha, @goyalmunish, @gregmacfarlane, @guillett, @gvelasq2, @hannesmuehleisen, @has2k1, @helix123, @hmalmedal, @hoehleatsu, @hoesler, @holstius, @hrbrmstr, @ianmcook, @ijlyttle, @ilarischeinin, @imanuelcostigan, @Ironholds, @ismayc, @isomorphisms, @itsdalmo, @JakeRuss, @janschulz, @jasonelaw, @javierluraschi, @jayhesselberth, @jcheng5, @jdnewmil, @jefferis, @jennybc, @jenzopr, @jeremystan, @jeroen, @jgabry, @jhuovari, @jiho, @jimhester, @jirkalewandowski, @jjallaire, @jmarshallnz, @jmi5, @joethorley, @JoFrhwld, @jonboiser, @jonmcalder, @joranE, @joshkatz, @jrnold, @juba, @junkka, @justmarkham, @kalibera, @karawoo, @karthik, @Katiedaisey, @kbenoit, @Kevin-M-Smith, @kevinushey, @kmillar, @kohske, @krlmlr, @kwenzig, @kwstat, @KZARCA, @l-d-s, @LaDilettante, @larmarange, @leondutoit, @lepennec, @lindbrook, @lionel-, @lmullen, @lorenzwalthert, @lselzer, @luckyrandom, @LucyMcGowan, @lwjohnst86, @MarcusWalz, @markdly, @markriseley, @matthieugomez, @maurolepore, @mdlincoln, @mgacc0, @mgirlich, @michaelquinn32, @mikelove, @mkcor, @mkuehn10, @mkuhn, @mmparker, @msonnabaum, @ncarchedi, @NoahMarconi, @noamross, @npjc, @nutterb, @paternogbc, @paul-buerkner, @PedramNavid, @PeteHaitch, @pierucci, @pimentel, @pitakakariki, @pkq, @r2evans, @rbdixon, @richierocks, @RiRam, @rmsharp, @robertzk, @rohan-shah, @romainfrancois, @RoyalTS, @rsaporta, @rtaph, @rudazhan, @ruderphilipp, @s-fleck, @seaaan, @setempler, @sfirke, @shabbybanks, @sjackman, @sjPlot, @smbache, @statisfactions, @steromano, @t-kalinowski, @tareefk, @tdhock, @terrytangyuan, @thomasp85, @tjmahr, @tklebel, @tmshn, @tonytonov, @tuttinator, @tverbeke, @uribo, @vspinu, @wch, @webbedfeet, @wibeasley, @wligtenberg, @x0rshift, @xiaodaigh, @Yeedle, @yutannihilation, @zeehio, @zhaoy, and @zhilongjia

Slide 5

Slide 5 text

My goal is to make a pit of success

Slide 6

Slide 6 text

Solve complex problems by combining simple pieces that have a consistent structure

Slide 7

Slide 7 text


Slide 8

Slide 8 text

df$a[df$a == -99] <- NA df$b[df$b == -99] <- NA df$e[df$e == -99] <- NA df$f[df$f == -99] <- NA df$g[df$g == -98] <- NA df$h[df$h == -99] <- NA df$i[df$i == -99] <- NA df$i[df$j == -99] <- NA df$k[df$k == -99] <- NA df$l[df$l == -99] <- NA df$m[df$m == -99] <- NA df$n[df$n == -99] <- NA What’s the point of this code? What’s wrong?

Slide 9

Slide 9 text

df$a[df$a == -99] <- NA df$b[df$b == -99] <- NA # c & d are character variables df$e[df$e == -99] <- NA df$f[df$f == -99] <- NA df$g[df$g == -98] <- NA df$h[df$h == -99] <- NA df$i[df$i == -99] <- NA df$i[df$j == -99] <- NA df$k[df$k == -99] <- NA df$l[df$l == -99] <- NA df$m[df$m == -99] <- NA df$n[df$n == -99] <- NA Duplicated code hides intent & errors

Slide 10

Slide 10 text

fix_missing <- function(x) { x[x == -99] <- NA x } df$a <- fix_missing(df$a) df$b <- fix_missing(df$b) df$e <- fix_missing(df$e) df$f <- fix_missing(df$f) df$g <- fix_missing(df$g) df$h <- fix_missing(df$h) df$i <- fix_missing(df$i) df$j <- fix_missing(df$j) df$k <- fix_missing(df$k) Create a function whenever you’ve pasted >3 times

Slide 11

Slide 11 text

fix_missing <- function(x) { x[x == -99] <- NA x } df <- purrr::modify_if(df, is.numeric, fix_missing) Learn FP tools to remove even more duplication

Slide 12

Slide 12 text

Simple pieces

Slide 13

Slide 13 text

Generally, want functions like legos

Slide 14

Slide 14 text Not like playmobil

Slide 15

Slide 15 text

What is a simple function? Does one thing well Needs minimal context to be understood

Slide 16

Slide 16 text

Computes a value / Changes the world One thing well Minimal context Type stable Obey scoping rules No hidden arguments Evocative 

Slide 17

Slide 17 text

Computes a value / Changes the world One thing well Minimal context Type stable Obey scoping rules No hidden arguments Evocative 

Slide 18

Slide 18 text

Computes a value / Changes the world

Slide 19

Slide 19 text

print() mean() mutate() write_csv() + geom_line() <- runif() Which is which?

Slide 20

Slide 20 text

# Computes a value mean() mutate() + geom_line() # Changes the world print() write_csv() <- Which is which?

Slide 21

Slide 21 text

runif(5) #> [1] 0.5530 0.0138 0.8774 0.9225 0.0606 runif(5) #> [1] 0.8210 0.0459 0.6008 0.4323 0.3644 Some functions must do both

Slide 22

Slide 22 text

.Random.seed[2] #> [1] 624 runif(5) #> [1] 0.0808 0.8343 0.6008 0.1572 0.0074 .Random.seed[2] #> [1] 5 Some functions must do both

Slide 23

Slide 23 text

mod <- lm(mpg ~ wt, data = mtcars) summary(mod) #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 37.285 1.878 19.86 < 2e-16 *** #> wt -5.344 0.559 -9.56 1.3e-10 *** #> --- #> #> Residual standard error: 3.05 on 30 degrees of freedom #> Multiple R-squared: 0.753, Adjusted R-squared: 0.745 #> F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10 Base R generally does this well

Slide 24

Slide 24 text

So the exceptions are extra frustrating 10 15 20 25 30 −4 0 2 4 6 8 Fitted values Residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residuals vs Fitted Fiat 128 Toyota Corolla Chrysler Imperial ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −1 0 1 2 −1 0 1 2 Theoretical Quantiles Standardized residuals Normal Q−Q Fiat 128 Toyota Corolla Chrysler Imperial 10 15 20 25 30 0.0 0.5 1.0 1.5 Fitted values Standardized residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Scale−Location Fiat 128 Toyota Corolla Chrysler Imperial 0.00 0.05 0.10 0.15 0.20 −2 −1 0 1 2 Leverage Standardized residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cook's distance 0.5 0.5 1 Residuals vs Leverage Chrysler Imperial Toyota Corolla Fiat 128

Slide 25

Slide 25 text

Type stability

Slide 26

Slide 26 text

“Mama always said type- unstable functions are like a box of chocolates. You never know what you’re gonna get.”
 — Hadley Gump

Slide 27

Slide 27 text

Type-stable functions f f f g g g Regardless of the input, a type-stable function gives the same type of output It’s harder to predict the result of a type-unstable function

Slide 28

Slide 28 text

# A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows iris

Slide 29

Slide 29 text

find_vars <- function(df, predicate) { vars <- sapply(df, predicate) df[, vars] } find_vars(iris, is.numeric) find_vars(iris, is.factor) # For experts only: find_vars(iris[, 0], is.numeric) What will this function return? iris has four numeric variables and one factor

Slide 30

Slide 30 text

class(find_vars(iris, is.numeric)) #> [1] "data.frame" class(find_vars(iris, is.factor)) #> [1] "factor" find_vars(iris[, 0], is.numeric) #> Error in .subset(x, j): #> invalid subscript type 'list'

Slide 31

Slide 31 text

find_vars <- function(df, predicate) { vars <- sapply(df, predicate) df[, vars] } sapply() & [.data.frame are type-unstable Returns vector or data frame Returns vector, matrix, or list

Slide 32

Slide 32 text

find_vars <- function(df, predicate) { vars <- purrr::map_lgl(df, predicate) df[, vars, drop = FALSE] } Two changes make it much more predictable

Slide 33

Slide 33 text

 simple pieces

Slide 34

Slide 34 text

by_dest <- group_by(flights, dest) dest_delay <- summarise(by_dest, delay = mean(dep_delay, na.rm = TRUE), n = n() ) big_dest <- filter(dest_delay, n > 100) arrange(big_dest, desc(delay)) Base R has two ways to combine functions

Slide 35

Slide 35 text

foo <- group_by(flights, dest) foo <- summarise(foo, delay = mean(dep_delay, na.rm = TRUE), n = n() ) foo <- filter(foo, n > 100) arrange(foo, desc(delay)) But naming is hard work

Slide 36

Slide 36 text

foo1 <- group_by(flights, dest) foo2 <- summarise(foo1, delay = mean(dep_delay, na.rm = TRUE), n = n() ) foo3 <- filter(foo2, n > 100) arrange(foo2, desc(delay)) But naming is hard work

Slide 37

Slide 37 text

arrange( filter( summarise( group_by(flights, dest), delay = mean(dep_delay, na.rm = TRUE), n = n() ), n > 100 ), desc(delay) ) Alternatively, you could nest function calls

Slide 38

Slide 38 text

magrittr provides a third option %>%

Slide 39

Slide 39 text

x %>% f() # Is the same as f(x) x %>% f() %>% g(y) # Is the same as g(f(x), y) The pipe

Slide 40

Slide 40 text

flights %>% group_by(dest) %>% summarise( delay = mean(dep_delay, na.rm = TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(delay)) This is easy to read & doesn’t require naming

Slide 41

Slide 41 text

library(tidyverse) library(magick) dir(pattern = ".png") %>% map(image_read) %>% image_join() %>% image_animate(fps = 1, loop = 25) %>% image_write("my_animation.gif") Makes it easy to read unfamiliar code What does this code do?

Slide 42

Slide 42 text

Slide 43

Slide 43 text

 left-to-right Omits intermediate values Non-linear y <- f(x) g(y) ✅ ✅ g(f(x)) ✅ ✅ x %>% 
 f() %>% 
 g() ✅ ✅

Slide 44

Slide 44 text

flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date, n)) + geom_line() What happens if your pieces aren’t simple functions?

Slide 45

Slide 45 text

ggsave( flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date, n)) + geom_line(), "my-plot.pdf" ) Which makes it quite inconsistent

Slide 46

Slide 46 text

# library(ggplot1) flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date, n)) %>% ggpoint() %>% ggsave("my-plot.pdf") Interestingly, ggplot did not have this problem

Slide 47

Slide 47 text

flights %>% group_by(dest) %>% summarise( delay = mean(dep_delay, na.rm = TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(delay)) -> dest_delays Another interesting connection is ->

Slide 48

Slide 48 text

dest_delays <- flights %>% group_by(dest) %>% summarise( delay = mean(dep_delay, na.rm = TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(delay)) But leading with assignment improves readability

Slide 49

Slide 49 text

Consistent structure

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

Simple and have a consistent structure

Slide 52

Slide 52 text CC-BY-NC

Slide 53

Slide 53 text

Tidy data is a consistent way of storing data 1. Each dataset goes 
 in a data frame. 2. Each variable goes 
 in a column.

Slide 54

Slide 54 text

Tidy datasets are all alike; 
 every messy dataset is 
 messy in its own way — Hadley Tolstoy

Slide 55

Slide 55 text

# A tibble: 5,769 × 22 iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1 17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0 # ... with 5,752 more rows, and 6 more variables: f2534 , f3544 , f4554 , # f5564 , f65 , fu Messy data has a varied shape What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)

Slide 56

Slide 56 text

# A tibble: 35,750 × 5 country year sex age n 1 AD 1996 f 014 0 2 AD 1996 f 1524 1 3 AD 1996 f 2534 1 4 AD 1996 f 3544 0 5 AD 1996 f 4554 0 6 AD 1996 f 5564 1 7 AD 1996 f 65 0 8 AD 1996 m 014 0 9 AD 1996 m 1524 0 10 AD 1996 m 2534 0 # ... with 35,740 more rows Tidy data has a uniform shape

Slide 57

Slide 57 text

tidytext by Julia Silge & David Robinson

Slide 58

Slide 58 text

The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. — Sense & Sensibility, Jane Austen

Slide 59

Slide 59 text

# A tibble: 724,880 × 4 book linenumber chapter word 1 Sense & Sensibility 10 1 chapter 2 Sense & Sensibility 10 1 1 3 Sense & Sensibility 13 1 the 4 Sense & Sensibility 13 1 family 5 Sense & Sensibility 13 1 of 6 Sense & Sensibility 13 1 dashwood 7 Sense & Sensibility 13 1 had 8 Sense & Sensibility 13 1 long 9 Sense & Sensibility 13 1 been 10 Sense & Sensibility 13 1 settled # ... with 724,870 more rows tidytext provides an answer

Slide 60

Slide 60 text

Emma Northanger Abbey Persuasion Sense & Sensibility Pride & Prejudice Mansfield Park 0 50 100 150 0 20 40 60 80 0 20 40 60 80 0 50 100 0 50 100 0 50 100 150 −50 −25 0 25 50 −50 −25 0 25 50 sentiment Sentiment of Jane Austen books

Slide 61

Slide 61 text

sfby Edzer Pebesma

Slide 62

Slide 62 text

34°N 34.5°N 35°N 35.5°N 36°N 36.5°N 84°W 82°W 80°W 78°W 76°W 84°W 82°W 80°W 78°W 76°W 34°N 34.5°N 35°N 35.5°N 36°N 36.5°N 0.05 0.10 0.15 0.20 AREA

Slide 63

Slide 63 text

nc <- sf::st_read(system.file("shape/nc.shp", package = "sf")) nc %>% as_tibble() %>% select(NAME, FIPS, AREA, geometry) #> # A tibble: 100 × 4 #> NAME FIPS AREA geometry #> #> 1 Ashe 37009 0.114 #> 2 Alleghany 37005 0.061 #> 3 Surry 37171 0.143 #> 4 Currituck 37053 0.070 #> 5 Northampton 37131 0.153 #> 6 Hertford 37091 0.097 #> 7 Camden 37029 0.062 #> 8 Gates 37073 0.091 #> 9 Warren 37185 0.118 #> 10 Stokes 37169 0.124 #> # ... with 90 more rows Store complex geometries in a list-column

Slide 64

Slide 64 text

What if you have complex data? 1. Each dataset goes 
 in a tibble. 2. Each variable goes 
 in a column.

Slide 65

Slide 65 text

df <- tibble(xyz = "a") df$x #> Warning: Unknown column 'x' #> NULL df$xyz #> [1] "a" Tibbles are data frames that are lazy & surly

Slide 66

Slide 66 text

data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number #> of rows: 2, 3 tibble(x = list(1:2, 3:5)) #> # A tibble: 2 x 1 #> x #> #> 1 #> 2 But also have better support for list-cols

Slide 67

Slide 67 text

List-columns keep related things together Anything can go in a list & a list can go in a data frame

Slide 68

Slide 68 text


Slide 69

Slide 69 text

Solve complex problems by combining simple pieces that have a consistent structure

Slide 70

Slide 70 text

Solve complex problems by combining simple pieces that have a consistent structure Functions that do one thing well & can be understood with minimal context

Slide 71

Slide 71 text

Solve complex problems by combining simple pieces that have a consistent structure With assignment, composition, or the pipe

Slide 72

Slide 72 text

Solve complex problems by combining simple pieces that have a consistent structure Tidy tibbles have variables in columns and cases in rows.
 List-cols can store richer data structures

Slide 73

Slide 73 text

Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr dplyr forcats hms ggplot2 broom modelr readr readxl haven xml2 lubridate stringr

Slide 74

Slide 74 text

This work is licensed under the 
 Creative Commons Attribution-Noncommercial 3.0 
 United States License. To view a copy of this license, visit