Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Expressing yourself with R

Expressing yourself with R

Hadley Wickham

September 24, 2017
Tweet

More Decks by Hadley Wickham

Other Decks in Programming

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    Expressing 

    yourself with R
    July 2017

    View Slide

  2. View Slide

  3. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    broom
    modelr
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org r4ds.had.co.nz

    View Slide

  4. @aaronwolen, @aghaynes, @ajdamico, @ajschumacher, @alberthkcheng, @alyst, @andrew, @andrewjlm,
    @apjanke, @arneschillert, @artemklevtsov, @arunsrinivasan, @asnr, @astamm, @austenhead, @baptiste,
    @bbolker, @bearloga, @benmarwick, @bhive01, @BioStatMatt, @bpbond, @bquast, @BrianDiggs, @briatte,
    @burchill, @casallas, @cb4ds, @cboettig, @cderv, @christophergandrud, @cmartin, @colinbrislawn,
    @coolbutuseless, @cosinequanon, @craigcitro, @csgillespie, @ctbrown, @daattali, @dandermotj,
    @danliIDEA, @DanRuderman, @davharris, @davidmorrison, @dchiu911, @dchudz, @dewittpe, @dgromer,
    @dgrtwo, @dhimmel, @dickoa, @diogocp, @djmurphy420, @dlebauer, @dmedri, @dmenne, @dougmitarotonda,
    @dpastoor, @dpocock, @dtelad11, @earino, @echasnovski, @ecortens, @eddelbuettel, @edgararuiz,
    @edwindj, @egnha, @ehrlinger, @eibanez, @eipi10, @ekstroem, @emojiencoding, @etiennebr, @evanmiller,
    @fpinter, @FvD, @gaborcsardi, @gagolews, @garrettgman, @gavinsimpson, @gergness, @gnustats, @gorcha,
    @goyalmunish, @gregmacfarlane, @guillett, @gvelasq2, @hannesmuehleisen, @has2k1, @helix123,
    @hmalmedal, @hoehleatsu, @hoesler, @holstius, @hrbrmstr, @ianmcook, @ijlyttle, @ilarischeinin,
    @imanuelcostigan, @Ironholds, @ismayc, @isomorphisms, @itsdalmo, @JakeRuss, @janschulz, @jasonelaw,
    @javierluraschi, @jayhesselberth, @jcheng5, @jdnewmil, @jefferis, @jennybc, @jenzopr, @jeremystan,
    @jeroen, @jgabry, @jhuovari, @jiho, @jimhester, @jirkalewandowski, @jjallaire, @jmarshallnz, @jmi5,
    @joethorley, @JoFrhwld, @jonboiser, @jonmcalder, @joranE, @joshkatz, @jrnold, @juba, @junkka,
    @justmarkham, @kalibera, @karawoo, @karthik, @Katiedaisey, @kbenoit, @Kevin-M-Smith, @kevinushey,
    @kmillar, @kohske, @krlmlr, @kwenzig, @kwstat, @KZARCA, @l-d-s, @LaDilettante, @larmarange,
    @leondutoit, @lepennec, @lindbrook, @lionel-, @lmullen, @lorenzwalthert, @lselzer, @luckyrandom,
    @LucyMcGowan, @lwjohnst86, @MarcusWalz, @markdly, @markriseley, @matthieugomez, @maurolepore,
    @mdlincoln, @mgacc0, @mgirlich, @michaelquinn32, @mikelove, @mkcor, @mkuehn10, @mkuhn, @mmparker,
    @msonnabaum, @ncarchedi, @NoahMarconi, @noamross, @npjc, @nutterb, @paternogbc, @paul-buerkner,
    @PedramNavid, @PeteHaitch, @pierucci, @pimentel, @pitakakariki, @pkq, @r2evans, @rbdixon,
    @richierocks, @RiRam, @rmsharp, @robertzk, @rohan-shah, @romainfrancois, @RoyalTS, @rsaporta,
    @rtaph, @rudazhan, @ruderphilipp, @s-fleck, @seaaan, @setempler, @sfirke, @shabbybanks, @sjackman,
    @sjPlot, @smbache, @statisfactions, @steromano, @t-kalinowski, @tareefk, @tdhock, @terrytangyuan,
    @thomasp85, @tjmahr, @tklebel, @tmshn, @tonytonov, @tuttinator, @tverbeke, @uribo, @vspinu, @wch,
    @webbedfeet, @wibeasley, @wligtenberg, @x0rshift, @xiaodaigh, @Yeedle, @yutannihilation, @zeehio,
    @zhaoy, and @zhilongjia

    View Slide

  5. My goal is to make
    a pit of success
    http://blog.codinghorror.com/falling-into-the-pit-of-success/

    View Slide

  6. Solve complex problems by
    combining simple pieces that
    have a consistent structure

    View Slide

  7. Pieces

    View Slide

  8. df$a[df$a == -99] <- NA
    df$b[df$b == -99] <- NA
    df$e[df$e == -99] <- NA
    df$f[df$f == -99] <- NA
    df$g[df$g == -98] <- NA
    df$h[df$h == -99] <- NA
    df$i[df$i == -99] <- NA
    df$i[df$j == -99] <- NA
    df$k[df$k == -99] <- NA
    df$l[df$l == -99] <- NA
    df$m[df$m == -99] <- NA
    df$n[df$n == -99] <- NA
    What’s the point of this code? What’s wrong?

    View Slide

  9. df$a[df$a == -99] <- NA
    df$b[df$b == -99] <- NA
    # c & d are character variables
    df$e[df$e == -99] <- NA
    df$f[df$f == -99] <- NA
    df$g[df$g == -98] <- NA
    df$h[df$h == -99] <- NA
    df$i[df$i == -99] <- NA
    df$i[df$j == -99] <- NA
    df$k[df$k == -99] <- NA
    df$l[df$l == -99] <- NA
    df$m[df$m == -99] <- NA
    df$n[df$n == -99] <- NA
    Duplicated code hides intent & errors

    View Slide

  10. fix_missing <- function(x) {
    x[x == -99] <- NA
    x
    }
    df$a <- fix_missing(df$a)
    df$b <- fix_missing(df$b)
    df$e <- fix_missing(df$e)
    df$f <- fix_missing(df$f)
    df$g <- fix_missing(df$g)
    df$h <- fix_missing(df$h)
    df$i <- fix_missing(df$i)
    df$j <- fix_missing(df$j)
    df$k <- fix_missing(df$k)
    Create a function whenever you’ve pasted >3 times

    View Slide

  11. fix_missing <- function(x) {
    x[x == -99] <- NA
    x
    }
    df <- purrr::modify_if(df, is.numeric, fix_missing)
    Learn FP tools to remove even more duplication

    View Slide

  12. Simple pieces

    View Slide

  13. Generally, want functions like legos

    View Slide

  14. https://unsplash.com/photos/0VNVxhEnkII
    Not like playmobil

    View Slide

  15. What is a simple function?
    Does one thing well
    Needs minimal context
    to be understood

    View Slide

  16. Computes a value /
    Changes the world
    One thing well Minimal context
    Type stable
    Obey
    scoping rules
    No hidden
    arguments
    Evocative 

    name

    View Slide

  17. Computes a value /
    Changes the world
    One thing well Minimal context
    Type stable
    Obey
    scoping rules
    No hidden
    arguments
    Evocative 

    name

    View Slide

  18. Computes a value /
    Changes the world

    View Slide

  19. print()
    mean()
    mutate()
    write_csv()
    + geom_line()
    <-
    runif()
    Which is which?

    View Slide

  20. # Computes a value
    mean()
    mutate()
    + geom_line()
    # Changes the world
    print()
    write_csv()
    <-
    Which is which?

    View Slide

  21. runif(5)
    #> [1] 0.5530 0.0138 0.8774 0.9225 0.0606
    runif(5)
    #> [1] 0.8210 0.0459 0.6008 0.4323 0.3644
    Some functions must do both

    View Slide

  22. .Random.seed[2]
    #> [1] 624
    runif(5)
    #> [1] 0.0808 0.8343 0.6008 0.1572 0.0074
    .Random.seed[2]
    #> [1] 5
    Some functions must do both

    View Slide

  23. mod <- lm(mpg ~ wt, data = mtcars)
    summary(mod)
    #> Coefficients:
    #> Estimate Std. Error t value Pr(>|t|)
    #> (Intercept) 37.285 1.878 19.86 < 2e-16 ***
    #> wt -5.344 0.559 -9.56 1.3e-10 ***
    #> ---
    #>
    #> Residual standard error: 3.05 on 30 degrees of freedom
    #> Multiple R-squared: 0.753, Adjusted R-squared: 0.745
    #> F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
    Base R generally does this well

    View Slide

  24. So the exceptions are extra frustrating
    10 15 20 25 30
    −4 0 2 4 6 8
    Fitted values
    Residuals
































    Residuals vs Fitted
    Fiat 128
    Toyota Corolla
    Chrysler Imperial






















    ● ●








    −2 −1 0 1 2
    −1 0 1 2
    Theoretical Quantiles
    Standardized residuals
    Normal Q−Q
    Fiat 128
    Toyota Corolla
    Chrysler Imperial
    10 15 20 25 30
    0.0 0.5 1.0 1.5
    Fitted values
    Standardized residuals






    ● ●
























    Scale−Location
    Fiat 128
    Toyota Corolla
    Chrysler Imperial
    0.00 0.05 0.10 0.15 0.20
    −2 −1 0 1 2
    Leverage
    Standardized residuals






















    ● ●








    Cook's distance
    0.5
    0.5
    1
    Residuals vs Leverage
    Chrysler Imperial
    Toyota Corolla
    Fiat 128

    View Slide

  25. Type stability

    View Slide

  26. “Mama always said type-
    unstable functions are like a
    box of chocolates. You never
    know what you’re gonna get.”

    — Hadley Gump

    View Slide

  27. Type-stable functions
    f
    f
    f
    g
    g
    g
    Regardless of the input, a
    type-stable function gives
    the same type of output
    It’s harder to predict the
    result of a type-unstable
    function

    View Slide

  28. # A tibble: 150 x 5
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species

    1 5.1 3.5 1.4 0.2 setosa
    2 4.9 3.0 1.4 0.2 setosa
    3 4.7 3.2 1.3 0.2 setosa
    4 4.6 3.1 1.5 0.2 setosa
    5 5.0 3.6 1.4 0.2 setosa
    6 5.4 3.9 1.7 0.4 setosa
    7 4.6 3.4 1.4 0.3 setosa
    8 5.0 3.4 1.5 0.2 setosa
    9 4.4 2.9 1.4 0.2 setosa
    10 4.9 3.1 1.5 0.1 setosa
    # ... with 140 more rows
    iris

    View Slide

  29. find_vars <- function(df, predicate) {
    vars <- sapply(df, predicate)
    df[, vars]
    }
    find_vars(iris, is.numeric)
    find_vars(iris, is.factor)
    # For experts only:
    find_vars(iris[, 0], is.numeric)
    What will this function return?
    iris has four numeric
    variables and one factor

    View Slide

  30. class(find_vars(iris, is.numeric))
    #> [1] "data.frame"
    class(find_vars(iris, is.factor))
    #> [1] "factor"
    find_vars(iris[, 0], is.numeric)
    #> Error in .subset(x, j):
    #> invalid subscript type 'list'

    View Slide

  31. find_vars <- function(df, predicate) {
    vars <- sapply(df, predicate)
    df[, vars]
    }
    sapply() & [.data.frame are type-unstable
    Returns vector or data frame
    Returns vector, matrix, or list

    View Slide

  32. find_vars <- function(df, predicate) {
    vars <- purrr::map_lgl(df, predicate)
    df[, vars, drop = FALSE]
    }
    Two changes make it much more predictable

    View Slide

  33. Combining 

    simple pieces

    View Slide

  34. by_dest <- group_by(flights, dest)
    dest_delay <- summarise(by_dest,
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    )
    big_dest <- filter(dest_delay, n > 100)
    arrange(big_dest, desc(delay))
    Base R has two ways to combine functions

    View Slide

  35. foo <- group_by(flights, dest)
    foo <- summarise(foo,
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    )
    foo <- filter(foo, n > 100)
    arrange(foo, desc(delay))
    But naming is hard work

    View Slide

  36. foo1 <- group_by(flights, dest)
    foo2 <- summarise(foo1,
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    )
    foo3 <- filter(foo2, n > 100)
    arrange(foo2, desc(delay))
    But naming is hard work

    View Slide

  37. arrange(
    filter(
    summarise(
    group_by(flights, dest),
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    ),
    n > 100
    ),
    desc(delay)
    )
    Alternatively, you could nest function calls

    View Slide

  38. magrittr provides a third option
    %>%

    View Slide

  39. x %>% f()
    # Is the same as
    f(x)
    x %>% f() %>% g(y)
    # Is the same as
    g(f(x), y)
    The pipe

    View Slide

  40. flights %>%
    group_by(dest) %>%
    summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    ) %>%
    filter(n > 100) %>%
    arrange(desc(delay))
    This is easy to read & doesn’t require naming

    View Slide

  41. library(tidyverse)
    library(magick)
    dir(pattern = ".png") %>%
    map(image_read) %>%
    image_join() %>%
    image_animate(fps = 1, loop = 25) %>%
    image_write("my_animation.gif")
    Makes it easy to read unfamiliar code
    https://twitter.com/ricardokriebel/status/849626401611411458
    What does this
    code do?

    View Slide

  42. https://twitter.com/ricardokriebel/status/849626401611411458

    View Slide

  43. Read 

    left-to-right
    Omits
    intermediate
    values
    Non-linear
    y <- f(x)
    g(y)
    ✅ ✅
    g(f(x)) ✅ ✅
    x %>% 

    f() %>% 

    g()
    ✅ ✅

    View Slide

  44. flights %>%
    group_by(date) %>%
    summarise(n = n()) %>%
    ggplot(aes(date, n)) +
    geom_line()
    What happens if your pieces aren’t simple functions?

    View Slide

  45. ggsave(
    flights %>%
    group_by(date) %>%
    summarise(n = n()) %>%
    ggplot(aes(date, n)) +
    geom_line(),
    "my-plot.pdf"
    )
    Which makes it quite inconsistent

    View Slide

  46. # https://github.com/hadley/ggplot1
    library(ggplot1)
    flights %>%
    group_by(date) %>%
    summarise(n = n()) %>%
    ggplot(aes(date, n)) %>%
    ggpoint() %>%
    ggsave("my-plot.pdf")
    Interestingly, ggplot did not have this problem

    View Slide

  47. flights %>%
    group_by(dest) %>%
    summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    ) %>%
    filter(n > 100) %>%
    arrange(desc(delay)) ->
    dest_delays
    Another interesting connection is ->

    View Slide

  48. dest_delays <- flights %>%
    group_by(dest) %>%
    summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    ) %>%
    filter(n > 100) %>%
    arrange(desc(delay))
    But leading with assignment improves readability

    View Slide

  49. Consistent
    structure

    View Slide

  50. View Slide

  51. Simple and have a consistent structure

    View Slide

  52. http://brickartist.com/gallery/pc-magazine-computer/. CC-BY-NC

    View Slide

  53. Tidy data is a consistent way of storing data
    1. Each dataset goes 

    in a data frame.
    2. Each variable goes 

    in a column.

    View Slide

  54. Tidy datasets are all alike; 

    every messy dataset is 

    messy in its own way
    — Hadley Tolstoy

    View Slide

  55. # A tibble: 5,769 × 22
    iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524

    1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1
    8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1
    9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA
    10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0
    11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA
    12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA
    13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1
    14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1
    15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0
    16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1
    17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0
    # ... with 5,752 more rows, and 6 more variables: f2534 , f3544 , f4554 ,
    # f5564 , f65 , fu
    Messy data has a varied shape
    What are the variables in this dataset?
    (Hint: f = female, u = unknown, 1524 = 15-24)

    View Slide

  56. # A tibble: 35,750 × 5
    country year sex age n

    1 AD 1996 f 014 0
    2 AD 1996 f 1524 1
    3 AD 1996 f 2534 1
    4 AD 1996 f 3544 0
    5 AD 1996 f 4554 0
    6 AD 1996 f 5564 1
    7 AD 1996 f 65 0
    8 AD 1996 m 014 0
    9 AD 1996 m 1524 0
    10 AD 1996 m 2534 0
    # ... with 35,740 more rows
    Tidy data has a uniform shape

    View Slide

  57. tidytext
    by Julia Silge & David Robinson

    http://tidytextmining.com

    View Slide

  58. The family of Dashwood had long been
    settled in Sussex. Their estate was large,
    and their residence was at Norland Park, in
    the centre of their property, where, for
    many generations, they had lived in so
    respectable a manner as to engage the
    general good opinion of their surrounding
    acquaintance.
    — Sense & Sensibility, Jane Austen

    View Slide

  59. # A tibble: 724,880 × 4
    book linenumber chapter word

    1 Sense & Sensibility 10 1 chapter
    2 Sense & Sensibility 10 1 1
    3 Sense & Sensibility 13 1 the
    4 Sense & Sensibility 13 1 family
    5 Sense & Sensibility 13 1 of
    6 Sense & Sensibility 13 1 dashwood
    7 Sense & Sensibility 13 1 had
    8 Sense & Sensibility 13 1 long
    9 Sense & Sensibility 13 1 been
    10 Sense & Sensibility 13 1 settled
    # ... with 724,870 more rows
    tidytext provides an answer

    View Slide

  60. Emma Northanger Abbey Persuasion
    Sense & Sensibility Pride & Prejudice Mansfield Park
    0 50 100 150 0 20 40 60 80 0 20 40 60 80
    0 50 100 0 50 100 0 50 100 150
    −50
    −25
    0
    25
    50
    −50
    −25
    0
    25
    50
    sentiment
    Sentiment of Jane Austen books

    View Slide

  61. sfby Edzer Pebesma

    http://r-spatial.github.io/sf/

    View Slide

  62. 34°N 34.5°N 35°N 35.5°N 36°N 36.5°N
    84°W
    82°W
    80°W
    78°W
    76°W
    84°W
    82°W
    80°W
    78°W
    76°W
    34°N 34.5°N 35°N 35.5°N 36°N 36.5°N
    0.05
    0.10
    0.15
    0.20
    AREA

    View Slide

  63. nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"))
    nc %>%
    as_tibble() %>%
    select(NAME, FIPS, AREA, geometry)
    #> # A tibble: 100 × 4
    #> NAME FIPS AREA geometry
    #>
    #> 1 Ashe 37009 0.114
    #> 2 Alleghany 37005 0.061
    #> 3 Surry 37171 0.143
    #> 4 Currituck 37053 0.070
    #> 5 Northampton 37131 0.153
    #> 6 Hertford 37091 0.097
    #> 7 Camden 37029 0.062
    #> 8 Gates 37073 0.091
    #> 9 Warren 37185 0.118
    #> 10 Stokes 37169 0.124
    #> # ... with 90 more rows
    Store complex geometries in a list-column

    View Slide

  64. What if you have complex data?
    1. Each dataset goes 

    in a tibble.
    2. Each variable goes 

    in a column.

    View Slide

  65. df <- tibble(xyz = "a")
    df$x
    #> Warning: Unknown column 'x'
    #> NULL
    df$xyz
    #> [1] "a"
    Tibbles are data frames that are lazy & surly

    View Slide

  66. data.frame(x = list(1:2, 3:5))
    #> Error: arguments imply differing number
    #> of rows: 2, 3
    tibble(x = list(1:2, 3:5))
    #> # A tibble: 2 x 1
    #> x
    #>
    #> 1
    #> 2
    But also have better support for list-cols

    View Slide

  67. List-columns keep related things together
    Anything can go in a list & a list can go in a data frame

    View Slide

  68. Conclusion

    View Slide

  69. Solve complex problems by
    combining simple pieces that
    have a consistent structure

    View Slide

  70. Solve complex problems by
    combining simple pieces that
    have a consistent structure
    Functions that do one thing well & can be
    understood with minimal context

    View Slide

  71. Solve complex problems by
    combining simple pieces that
    have a consistent structure
    With assignment, composition,
    or the pipe

    View Slide

  72. Solve complex problems by
    combining simple pieces that
    have a consistent structure
    Tidy tibbles have variables in columns and cases in rows.

    List-cols can store richer data structures

    View Slide

  73. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    broom
    modelr
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org r4ds.had.co.nz

    View Slide

  74. This work is licensed under the 

    Creative Commons Attribution-Noncommercial 3.0 

    United States License.
    To view a copy of this license, visit 

    http://creativecommons.org/licenses/by-nc/3.0/us/

    View Slide