Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tidyverse

 Tidyverse

Given at the Portland R meetup.

7ba164f40a50bc23dbb2aa825fb7bc16?s=128

Hadley Wickham

October 10, 2016
Tweet

Transcript

  1. Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio The tidyverse October

    2016
  2. Tidy Import Surprises, but doesn't scale Create new variables &

    new summaries Consistent way of storing data Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Program
  3. No matter how complex and polished the individual operations are,

    it is often the quality of the glue that most directly determines the power of the system. — Hal Abelson
  4. Tidy Import Visualise Transform Model Communicate Program

  5. Tidy Import Visualise Transform Model Communicate Program

  6. The tidy tools manifesto

  7. Import readr readxl haven httr jsonlite DBI rvest xml2 Tidy

    tibble tidyr Transform dplyr forcats hms lubridate stringr Visualise ggplot2 Model broom modelr Program purrr magrittr http://r4ds.had.co.nz tidyverse
  8. 1. Share data structures. 2.Compose simple pieces. 3.Embrace FP. 4.Write

    for humans.
  9. 1 Share data structures

  10. 1. Put each dataset in a 
 data frame. 2.

    Put each variable in a column. Tidy data
  11. # A tibble: 5,769 × 22 iso2 year m04 m514

    m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524 <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1 17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0 # ... with 5,752 more rows, and 6 more variables: f2534 <int>, f3544 <int>, f4554 <int>, # f5564 <int>, f65 <int>, fu <int> Messy data has a varied shape What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)
  12. # A tibble: 35,750 × 5 country year sex age

    n <chr> <int> <chr> <chr> <int> 1 AD 1996 f 014 0 2 AD 1996 f 1524 1 3 AD 1996 f 2534 1 4 AD 1996 f 3544 0 5 AD 1996 f 4554 0 6 AD 1996 f 5564 1 7 AD 1996 f 65 0 8 AD 1996 m 014 0 9 AD 1996 m 1524 0 10 AD 1996 m 2534 0 # ... with 35,740 more rows Tidy data has a uniform shape
  13. Happy families are all alike; every unhappy family is unhappy

    in its own way — Leo Tolstoy
  14. Tidy dataset are all alike; every messy dataset is messy

    in its own way — Hadley Wickham
  15. Tidy data is ~80% of data structures strings dates matrices

    vectors xml HTTP requests HTTP response http://simplystatistics.org/2016/02/17/non-tidy-data/ factors
  16. What if you have a mix? Training data Test data

    Model Predictions RMSE Cross-validation data frame data frame lm vector scalar
  17. Use a tibble with list-columns! # A tibble: 100 x

    5 train test .id mod rmse <list> <list> <chr> <list> <dbl> 1 <S3: resample> <S3: resample> 001 <S3: lm> 0.5661605 2 <S3: resample> <S3: resample> 002 <S3: lm> 0.2399357 3 <S3: resample> <S3: resample> 003 <S3: lm> 3.5482986 4 <S3: resample> <S3: resample> 004 <S3: lm> 0.2396810 5 <S3: resample> <S3: resample> 005 <S3: lm> 0.1591336 6 <S3: resample> <S3: resample> 006 <S3: lm> 0.1934869 7 <S3: resample> <S3: resample> 007 <S3: lm> 0.2697834 8 <S3: resample> <S3: resample> 008 <S3: lm> 0.4910886 9 <S3: resample> <S3: resample> 009 <S3: lm> 1.7002645 10 <S3: resample> <S3: resample> 010 <S3: lm> 0.2047787 ... with 90 more rows
  18. df <- data.frame(xyz = "a") # What does this return?

    df$x Your turn!
  19. df <- data.frame(xyz = "a") # What does this return?

    df$x #> [1] a #> Levels: a Your turn! Two surprises
 partial name matching & stringsAsFactors
  20. Two important tensions for understanding base R Interactive exploration Programming

    Conservative Utopian
  21. df <- tibble(xyz = "a") df$xyz #> [1] "a" is.data.frame(df[,

    "xyz"]) #> [1] TRUE df$x #> Warning: Unknown column 'x' #> NULL Tibbles are data frames that are lazy & surly
  22. data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number

    #> of rows: 2, 3 And work better with list-columns
  23. data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number

    #> of rows: 2, 3 data.frame(x = I(list(1:2, 3:5))) #> x #> 1 1, 2 #> 2 3, 4, 5 And work better with list-columns
  24. data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number

    #> of rows: 2, 3 data.frame(x = I(list(1:2, 3:5))) #> x #> 1 1, 2 #> 2 3, 4, 5 tibble(x = list(1:2, 3:5)) #> # A tibble: 2 x 1 #> x #> <list> #> 1 <int [2]> #> 2 <int [3]> And work better with list-columns
  25. 2 Compose simple pieces

  26. Goal: Solve complex problems by combining uniform pieces.

  27. https://www.flickr.com/photos/brunurb/13129057003

  28. http://brickartist.com/gallery/pc-magazine-computer/. CC-BY-NC

  29. %>% magrittr::

  30. foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head

    ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)
  31. library(nycflights13) library(dplyr) library(ggplot2) flights %>% group_by(date) %>% summarise(n = n())

    %>% ggplot(aes(date, n)) + geom_line() Consistency across packages is important
  32. ggplot(mtcars, aes(mpg, wt)) + geom_point() + geom_line() + ggsave("mtcars.pdf") And

    ggplot2 is not even internally consistent x
  33. ggsave( "mtcars.pdf", ggplot(mtcars, aes(mpg, wt)) + geom_point() + geom_line() +

    ) And ggplot2 is not even internally consistent
  34. # devtools::install_github("hadley/ggplot1") library(ggplot1) ggsave( ggpoint( ggplot( mtcars, list(x = mpg,

    y = wt) ) ), "mtcars.pdf", width = 8, height = 6 ) ggplot1 had a tidier API than ggplot2!
  35. library(ggplot1) mtcars %>% ggplot(list(x = mpg, y = wt)) %>%

    ggpoint() %>% ggsave("mtcars.pdf", width = 8, height = 6) So you can use the pipe with ggplot1 ggplot2 never would have existed if I’d discovered the pipe 10 years earlier!
  36. 2 3 4 5 10 15 20 25 30 35

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • wt mpg
  37. library(rvest) library(purrr) library(readr) library(dplyr) library(lubridate) read_html("https://www.massshootingtracker.org/data") %>% html_nodes("a[href^='https://docs.goo']") %>% html_attr("href")

    %>% map_df(read_csv) %>% mutate(date = mdy(date)) -> shootings One small example from Bob Rudis https://rud.is/b/2016/07/26
  38. 3 Embrace FP Answered with cupcakes Why are for loops

    “bad”?
  39. 1 cup flour a scant ¾ cup sugar 1 ½

    t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Vanilla cupcakes The hummingbird bakery cookbook
  40. ¾ cup + 2T flour 2 ½ T cocoa powder

    a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Chocolate cupcakes The hummingbird bakery cookbook
  41. ¾ cup + 2T flour 2 ½ T cocoa powder

    a scant ¾ cup sugar 1 ½ t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, cocoa, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Chocolate cupcakes The hummingbird bakery cookbook
  42. 1 cup flour a scant ¾ cup sugar 1 ½

    t baking powder 3 T unsalted butter ½ cup whole milk 1 egg ¼ t pure vanilla extract Preheat oven to 350°F. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Vanilla cupcakes The hummingbird bakery cookbook
  43. 120g flour 140g sugar 1.5 t baking powder 40g unsalted

    butter 120ml milk 1 egg 0.25 t pure vanilla extract Preheat oven to 170°C. Put the flour, sugar, baking powder, salt, and butter in a freestanding electric mixer with a paddle attachment and beat on slow speed until you get a sandy consistency and everything is combined. Whisk the milk, egg, and vanilla together in a pitcher, then slowly pour about half into the flour mixture, beat to combine, and turn the mixer up to high speed to get rid of any lumps. Turn the mixer down to a slower speed and slowly pour in the remaining milk mixture. Continue mixing for a couple of more minutes until the batter is smooth but do not overmix. Spoon the batter into paper cases until 2/3 full and bake in the preheated oven for 20-25 minutes, or until the cake bounces back when touched. Vanilla cupcakes 1. Convert units The hummingbird bakery cookbook
  44. 120g flour 140g sugar 1.5 t baking powder 40g butter

    120ml milk 1 egg 0.25 t vanilla Beat flour, sugar, baking powder, salt, and butter until sandy. Whisk milk, egg, and vanilla. Mix half into flour mixture until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C. Vanilla cupcakes 2. Rely on domain knowledge The hummingbird bakery cookbook
  45. Beat dry ingredients + butter until sandy. Whisk together wet

    ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C. Vanilla cupcakes 3. Use variables 120g flour 140g sugar 1.5 t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla The hummingbird bakery cookbook
  46. 120g flour 140g sugar 1.5t baking powder 40g butter 120ml

    milk 1 egg 0.25 t vanilla Beat dry ingredients + butter until sandy. Whisk together wet ingredients. Mix half into dry until smooth (use high speed). Beat in remaining half. Mix until smooth. Bake 20-25 min at 170°C. Cupcakes 4. Extract out common code 100g flour 20g cocoa 140g sugar 1.5t baking powder 40g butter 120ml milk 1 egg 0.25 t vanilla Vanilla Chocolate
  47. out1 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out1[[i]] <-

    mean(mtcars[[i]], na.rm = TRUE) } out2 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out2[[i]] <- median(mtcars[[i]], na.rm = TRUE) } What do these for loops do?
  48. out1 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out1[[i]] <-

    mean(mtcars[[i]], na.rm = TRUE) } out2 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out2[[i]] <- median(mtcars[[i]], na.rm = TRUE) } For loops emphasise the objects
  49. out1 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out1[[i]] <-

    mean(mtcars[[i]], na.rm = TRUE) } out2 <- vector("double", ncol(mtcars)) for(i in seq_along(mtcars)) { out2[[i]] <- median(mtcars[[i]], na.rm = TRUE) } Not the actions
  50. library(purrr) means <- map_dbl(mtcars, mean) medians <- map_dbl(mtcars, median) Functional

    programming emphasises the actions
  51. sim <- tribble( ~f, ~params, "runif", list(min = -1, max

    = 1), "rnorm", list(sd = 5), "rpois", list(lambda = 10) ) sim %>% mutate(sim = invoke_map(f, params, n = 10)) Teaser: simulation
  52. reports <- tibble( class = unique(mpg$class), filename = paste0("fuel-economy-", class,

    ".html"), params = map(class, ~ list(my_class = .)) ) reports %>% select(output_file = filename, params) %>% pwalk(rmarkdown::render, input = "fuel-economy.Rmd") Teaser: saving parameterised reports
  53. 4 Write for humans

  54. Programs must be written for people to read, and only

    incidentally for machines to execute. — Hal Abelson
  55. tibble lubridate forcats filter mutate summarise arrange select magrittr

  56. tibble lubridate forcats filter mutate summarise arrange select magrittr Embrace

    existing language Have fun! Pack animals
  57. Conclusion

  58. 1. Share data structures. 2.Compose simple pieces. 3.Embrace FP. 4.Write

    for humans.
  59. My goal is to make a pit of success http://blog.codinghorror.com/falling-into-the-pit-of-success/

  60. install.packages("tidyverse") library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble

    #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages ---------------------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats Gotta install them all
  61. Import readr readxl haven httr jsonlite rvest xml2 Tidy tibble

    tidyr Transform dplyr forcats hms lubridate stringr vctrs Visualise ggplot2 Model broom modelr ??? Program purrr magrittr http://r4ds.had.co.nz tidyverse
  62. I want to R the very best / like no

    one ever R'ed / To tidy them is my real test / To model them is my cause I'll import from across the web/ Searching GitHub and CRAN/ Each Tidyverse package gets me closer/ to results I'll understand Tidyverse! 
 Gotta wrangle/ It's R and me/ I know it's my destiny/ Tidyverse! 
 Hadley - you're my best friend/ With tidy data from end to end/ Tidyverse! Gotta wrangle/ A package so new/ Pipes and data first will pull us through/ You teach me and I'll teach you/ Tidyverse! Gotta catch’em all Tidyverse theme song, by Sean Kross
  63. This work is licensed under the 
 Creative Commons Attribution-Noncommercial

    3.0 
 United States License. To view a copy of this license, visit 
 http://creativecommons.org/licenses/by-nc/3.0/us/