Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Expressing yourself with R

Expressing yourself with R

7ba164f40a50bc23dbb2aa825fb7bc16?s=128

Hadley Wickham

September 24, 2017
Tweet

Transcript

  1. Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio Expressing 
 yourself

    with R July 2017
  2. None
  3. Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr

    dplyr forcats hms ggplot2 broom modelr readr readxl haven xml2 lubridate stringr tidyverse.org r4ds.had.co.nz
  4. @aaronwolen, @aghaynes, @ajdamico, @ajschumacher, @alberthkcheng, @alyst, @andrew, @andrewjlm, @apjanke, @arneschillert,

    @artemklevtsov, @arunsrinivasan, @asnr, @astamm, @austenhead, @baptiste, @bbolker, @bearloga, @benmarwick, @bhive01, @BioStatMatt, @bpbond, @bquast, @BrianDiggs, @briatte, @burchill, @casallas, @cb4ds, @cboettig, @cderv, @christophergandrud, @cmartin, @colinbrislawn, @coolbutuseless, @cosinequanon, @craigcitro, @csgillespie, @ctbrown, @daattali, @dandermotj, @danliIDEA, @DanRuderman, @davharris, @davidmorrison, @dchiu911, @dchudz, @dewittpe, @dgromer, @dgrtwo, @dhimmel, @dickoa, @diogocp, @djmurphy420, @dlebauer, @dmedri, @dmenne, @dougmitarotonda, @dpastoor, @dpocock, @dtelad11, @earino, @echasnovski, @ecortens, @eddelbuettel, @edgararuiz, @edwindj, @egnha, @ehrlinger, @eibanez, @eipi10, @ekstroem, @emojiencoding, @etiennebr, @evanmiller, @fpinter, @FvD, @gaborcsardi, @gagolews, @garrettgman, @gavinsimpson, @gergness, @gnustats, @gorcha, @goyalmunish, @gregmacfarlane, @guillett, @gvelasq2, @hannesmuehleisen, @has2k1, @helix123, @hmalmedal, @hoehleatsu, @hoesler, @holstius, @hrbrmstr, @ianmcook, @ijlyttle, @ilarischeinin, @imanuelcostigan, @Ironholds, @ismayc, @isomorphisms, @itsdalmo, @JakeRuss, @janschulz, @jasonelaw, @javierluraschi, @jayhesselberth, @jcheng5, @jdnewmil, @jefferis, @jennybc, @jenzopr, @jeremystan, @jeroen, @jgabry, @jhuovari, @jiho, @jimhester, @jirkalewandowski, @jjallaire, @jmarshallnz, @jmi5, @joethorley, @JoFrhwld, @jonboiser, @jonmcalder, @joranE, @joshkatz, @jrnold, @juba, @junkka, @justmarkham, @kalibera, @karawoo, @karthik, @Katiedaisey, @kbenoit, @Kevin-M-Smith, @kevinushey, @kmillar, @kohske, @krlmlr, @kwenzig, @kwstat, @KZARCA, @l-d-s, @LaDilettante, @larmarange, @leondutoit, @lepennec, @lindbrook, @lionel-, @lmullen, @lorenzwalthert, @lselzer, @luckyrandom, @LucyMcGowan, @lwjohnst86, @MarcusWalz, @markdly, @markriseley, @matthieugomez, @maurolepore, @mdlincoln, @mgacc0, @mgirlich, @michaelquinn32, @mikelove, @mkcor, @mkuehn10, @mkuhn, @mmparker, @msonnabaum, @ncarchedi, @NoahMarconi, @noamross, @npjc, @nutterb, @paternogbc, @paul-buerkner, @PedramNavid, @PeteHaitch, @pierucci, @pimentel, @pitakakariki, @pkq, @r2evans, @rbdixon, @richierocks, @RiRam, @rmsharp, @robertzk, @rohan-shah, @romainfrancois, @RoyalTS, @rsaporta, @rtaph, @rudazhan, @ruderphilipp, @s-fleck, @seaaan, @setempler, @sfirke, @shabbybanks, @sjackman, @sjPlot, @smbache, @statisfactions, @steromano, @t-kalinowski, @tareefk, @tdhock, @terrytangyuan, @thomasp85, @tjmahr, @tklebel, @tmshn, @tonytonov, @tuttinator, @tverbeke, @uribo, @vspinu, @wch, @webbedfeet, @wibeasley, @wligtenberg, @x0rshift, @xiaodaigh, @Yeedle, @yutannihilation, @zeehio, @zhaoy, and @zhilongjia
  5. My goal is to make a pit of success http://blog.codinghorror.com/falling-into-the-pit-of-success/

  6. Solve complex problems by combining simple pieces that have a

    consistent structure
  7. Pieces

  8. df$a[df$a == -99] <- NA df$b[df$b == -99] <- NA

    df$e[df$e == -99] <- NA df$f[df$f == -99] <- NA df$g[df$g == -98] <- NA df$h[df$h == -99] <- NA df$i[df$i == -99] <- NA df$i[df$j == -99] <- NA df$k[df$k == -99] <- NA df$l[df$l == -99] <- NA df$m[df$m == -99] <- NA df$n[df$n == -99] <- NA What’s the point of this code? What’s wrong?
  9. df$a[df$a == -99] <- NA df$b[df$b == -99] <- NA

    # c & d are character variables df$e[df$e == -99] <- NA df$f[df$f == -99] <- NA df$g[df$g == -98] <- NA df$h[df$h == -99] <- NA df$i[df$i == -99] <- NA df$i[df$j == -99] <- NA df$k[df$k == -99] <- NA df$l[df$l == -99] <- NA df$m[df$m == -99] <- NA df$n[df$n == -99] <- NA Duplicated code hides intent & errors
  10. fix_missing <- function(x) { x[x == -99] <- NA x

    } df$a <- fix_missing(df$a) df$b <- fix_missing(df$b) df$e <- fix_missing(df$e) df$f <- fix_missing(df$f) df$g <- fix_missing(df$g) df$h <- fix_missing(df$h) df$i <- fix_missing(df$i) df$j <- fix_missing(df$j) df$k <- fix_missing(df$k) Create a function whenever you’ve pasted >3 times
  11. fix_missing <- function(x) { x[x == -99] <- NA x

    } df <- purrr::modify_if(df, is.numeric, fix_missing) Learn FP tools to remove even more duplication
  12. Simple pieces

  13. Generally, want functions like legos

  14. https://unsplash.com/photos/0VNVxhEnkII Not like playmobil

  15. What is a simple function? Does one thing well Needs

    minimal context to be understood
  16. Computes a value / Changes the world One thing well

    Minimal context Type stable Obey scoping rules No hidden arguments Evocative 
 name
  17. Computes a value / Changes the world One thing well

    Minimal context Type stable Obey scoping rules No hidden arguments Evocative 
 name
  18. Computes a value / Changes the world

  19. print() mean() mutate() write_csv() + geom_line() <- runif() Which is

    which?
  20. # Computes a value mean() mutate() + geom_line() # Changes

    the world print() write_csv() <- Which is which?
  21. runif(5) #> [1] 0.5530 0.0138 0.8774 0.9225 0.0606 runif(5) #>

    [1] 0.8210 0.0459 0.6008 0.4323 0.3644 Some functions must do both
  22. .Random.seed[2] #> [1] 624 runif(5) #> [1] 0.0808 0.8343 0.6008

    0.1572 0.0074 .Random.seed[2] #> [1] 5 Some functions must do both
  23. mod <- lm(mpg ~ wt, data = mtcars) summary(mod) #>

    Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 37.285 1.878 19.86 < 2e-16 *** #> wt -5.344 0.559 -9.56 1.3e-10 *** #> --- #> #> Residual standard error: 3.05 on 30 degrees of freedom #> Multiple R-squared: 0.753, Adjusted R-squared: 0.745 #> F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10 Base R generally does this well
  24. So the exceptions are extra frustrating 10 15 20 25

    30 −4 0 2 4 6 8 Fitted values Residuals • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Residuals vs Fitted Fiat 128 Toyota Corolla Chrysler Imperial • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • −2 −1 0 1 2 −1 0 1 2 Theoretical Quantiles Standardized residuals Normal Q−Q Fiat 128 Toyota Corolla Chrysler Imperial 10 15 20 25 30 0.0 0.5 1.0 1.5 Fitted values Standardized residuals • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Scale−Location Fiat 128 Toyota Corolla Chrysler Imperial 0.00 0.05 0.10 0.15 0.20 −2 −1 0 1 2 Leverage Standardized residuals • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Cook's distance 0.5 0.5 1 Residuals vs Leverage Chrysler Imperial Toyota Corolla Fiat 128
  25. Type stability

  26. “Mama always said type- unstable functions are like a box

    of chocolates. You never know what you’re gonna get.”
 — Hadley Gump
  27. Type-stable functions f f f g g g Regardless of

    the input, a type-stable function gives the same type of output It’s harder to predict the result of a type-unstable function
  28. # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width

    Species <dbl> <dbl> <dbl> <dbl> <fctr> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows iris
  29. find_vars <- function(df, predicate) { vars <- sapply(df, predicate) df[,

    vars] } find_vars(iris, is.numeric) find_vars(iris, is.factor) # For experts only: find_vars(iris[, 0], is.numeric) What will this function return? iris has four numeric variables and one factor
  30. class(find_vars(iris, is.numeric)) #> [1] "data.frame" class(find_vars(iris, is.factor)) #> [1] "factor"

    find_vars(iris[, 0], is.numeric) #> Error in .subset(x, j): #> invalid subscript type 'list'
  31. find_vars <- function(df, predicate) { vars <- sapply(df, predicate) df[,

    vars] } sapply() & [.data.frame are type-unstable Returns vector or data frame Returns vector, matrix, or list
  32. find_vars <- function(df, predicate) { vars <- purrr::map_lgl(df, predicate) df[,

    vars, drop = FALSE] } Two changes make it much more predictable
  33. Combining 
 simple pieces

  34. by_dest <- group_by(flights, dest) dest_delay <- summarise(by_dest, delay = mean(dep_delay,

    na.rm = TRUE), n = n() ) big_dest <- filter(dest_delay, n > 100) arrange(big_dest, desc(delay)) Base R has two ways to combine functions
  35. foo <- group_by(flights, dest) foo <- summarise(foo, delay = mean(dep_delay,

    na.rm = TRUE), n = n() ) foo <- filter(foo, n > 100) arrange(foo, desc(delay)) But naming is hard work
  36. foo1 <- group_by(flights, dest) foo2 <- summarise(foo1, delay = mean(dep_delay,

    na.rm = TRUE), n = n() ) foo3 <- filter(foo2, n > 100) arrange(foo2, desc(delay)) But naming is hard work
  37. arrange( filter( summarise( group_by(flights, dest), delay = mean(dep_delay, na.rm =

    TRUE), n = n() ), n > 100 ), desc(delay) ) Alternatively, you could nest function calls
  38. magrittr provides a third option %>%

  39. x %>% f() # Is the same as f(x) x

    %>% f() %>% g(y) # Is the same as g(f(x), y) The pipe
  40. flights %>% group_by(dest) %>% summarise( delay = mean(dep_delay, na.rm =

    TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(delay)) This is easy to read & doesn’t require naming
  41. library(tidyverse) library(magick) dir(pattern = ".png") %>% map(image_read) %>% image_join() %>%

    image_animate(fps = 1, loop = 25) %>% image_write("my_animation.gif") Makes it easy to read unfamiliar code https://twitter.com/ricardokriebel/status/849626401611411458 What does this code do?
  42. https://twitter.com/ricardokriebel/status/849626401611411458

  43. Read 
 left-to-right Omits intermediate values Non-linear y <- f(x)

    g(y) ✅ ✅ g(f(x)) ✅ ✅ x %>% 
 f() %>% 
 g() ✅ ✅
  44. flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date, n))

    + geom_line() What happens if your pieces aren’t simple functions?
  45. ggsave( flights %>% group_by(date) %>% summarise(n = n()) %>% ggplot(aes(date,

    n)) + geom_line(), "my-plot.pdf" ) Which makes it quite inconsistent
  46. # https://github.com/hadley/ggplot1 library(ggplot1) flights %>% group_by(date) %>% summarise(n = n())

    %>% ggplot(aes(date, n)) %>% ggpoint() %>% ggsave("my-plot.pdf") Interestingly, ggplot did not have this problem
  47. flights %>% group_by(dest) %>% summarise( delay = mean(dep_delay, na.rm =

    TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(delay)) -> dest_delays Another interesting connection is ->
  48. dest_delays <- flights %>% group_by(dest) %>% summarise( delay = mean(dep_delay,

    na.rm = TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(delay)) But leading with assignment improves readability
  49. Consistent structure

  50. None
  51. Simple and have a consistent structure

  52. http://brickartist.com/gallery/pc-magazine-computer/. CC-BY-NC

  53. Tidy data is a consistent way of storing data 1.

    Each dataset goes 
 in a data frame. 2. Each variable goes 
 in a column.
  54. Tidy datasets are all alike; 
 every messy dataset is

    
 messy in its own way — Hadley Tolstoy
  55. # A tibble: 5,769 × 22 iso2 year m04 m514

    m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524 <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1 17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0 # ... with 5,752 more rows, and 6 more variables: f2534 <int>, f3544 <int>, f4554 <int>, # f5564 <int>, f65 <int>, fu <int> Messy data has a varied shape What are the variables in this dataset? (Hint: f = female, u = unknown, 1524 = 15-24)
  56. # A tibble: 35,750 × 5 country year sex age

    n <chr> <int> <chr> <chr> <int> 1 AD 1996 f 014 0 2 AD 1996 f 1524 1 3 AD 1996 f 2534 1 4 AD 1996 f 3544 0 5 AD 1996 f 4554 0 6 AD 1996 f 5564 1 7 AD 1996 f 65 0 8 AD 1996 m 014 0 9 AD 1996 m 1524 0 10 AD 1996 m 2534 0 # ... with 35,740 more rows Tidy data has a uniform shape
  57. tidytext by Julia Silge & David Robinson
 http://tidytextmining.com

  58. The family of Dashwood had long been settled in Sussex.

    Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. — Sense & Sensibility, Jane Austen
  59. # A tibble: 724,880 × 4 book linenumber chapter word

    <fctr> <int> <int> <chr> 1 Sense & Sensibility 10 1 chapter 2 Sense & Sensibility 10 1 1 3 Sense & Sensibility 13 1 the 4 Sense & Sensibility 13 1 family 5 Sense & Sensibility 13 1 of 6 Sense & Sensibility 13 1 dashwood 7 Sense & Sensibility 13 1 had 8 Sense & Sensibility 13 1 long 9 Sense & Sensibility 13 1 been 10 Sense & Sensibility 13 1 settled # ... with 724,870 more rows tidytext provides an answer
  60. Emma Northanger Abbey Persuasion Sense & Sensibility Pride & Prejudice

    Mansfield Park 0 50 100 150 0 20 40 60 80 0 20 40 60 80 0 50 100 0 50 100 0 50 100 150 −50 −25 0 25 50 −50 −25 0 25 50 sentiment Sentiment of Jane Austen books
  61. sfby Edzer Pebesma
 http://r-spatial.github.io/sf/

  62. 34°N 34.5°N 35°N 35.5°N 36°N 36.5°N 84°W 82°W 80°W 78°W

    76°W 84°W 82°W 80°W 78°W 76°W 34°N 34.5°N 35°N 35.5°N 36°N 36.5°N 0.05 0.10 0.15 0.20 AREA
  63. nc <- sf::st_read(system.file("shape/nc.shp", package = "sf")) nc %>% as_tibble() %>%

    select(NAME, FIPS, AREA, geometry) #> # A tibble: 100 × 4 #> NAME FIPS AREA geometry #> <fctr> <fctr> <dbl> <simple_feature> #> 1 Ashe 37009 0.114 <MULTIPOLYGON...> #> 2 Alleghany 37005 0.061 <MULTIPOLYGON...> #> 3 Surry 37171 0.143 <MULTIPOLYGON...> #> 4 Currituck 37053 0.070 <MULTIPOLYGON...> #> 5 Northampton 37131 0.153 <MULTIPOLYGON...> #> 6 Hertford 37091 0.097 <MULTIPOLYGON...> #> 7 Camden 37029 0.062 <MULTIPOLYGON...> #> 8 Gates 37073 0.091 <MULTIPOLYGON...> #> 9 Warren 37185 0.118 <MULTIPOLYGON...> #> 10 Stokes 37169 0.124 <MULTIPOLYGON...> #> # ... with 90 more rows Store complex geometries in a list-column
  64. What if you have complex data? 1. Each dataset goes

    
 in a tibble. 2. Each variable goes 
 in a column.
  65. df <- tibble(xyz = "a") df$x #> Warning: Unknown column

    'x' #> NULL df$xyz #> [1] "a" Tibbles are data frames that are lazy & surly
  66. data.frame(x = list(1:2, 3:5)) #> Error: arguments imply differing number

    #> of rows: 2, 3 tibble(x = list(1:2, 3:5)) #> # A tibble: 2 x 1 #> x #> <list> #> 1 <int [2]> #> 2 <int [3]> But also have better support for list-cols
  67. List-columns keep related things together Anything can go in a

    list & a list can go in a data frame
  68. Conclusion

  69. Solve complex problems by combining simple pieces that have a

    consistent structure
  70. Solve complex problems by combining simple pieces that have a

    consistent structure Functions that do one thing well & can be understood with minimal context
  71. Solve complex problems by combining simple pieces that have a

    consistent structure With assignment, composition, or the pipe
  72. Solve complex problems by combining simple pieces that have a

    consistent structure Tidy tibbles have variables in columns and cases in rows.
 List-cols can store richer data structures
  73. Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr

    dplyr forcats hms ggplot2 broom modelr readr readxl haven xml2 lubridate stringr tidyverse.org r4ds.had.co.nz
  74. This work is licensed under the 
 Creative Commons Attribution-Noncommercial

    3.0 
 United States License. To view a copy of this license, visit 
 http://creativecommons.org/licenses/by-nc/3.0/us/