Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tidyverse

 Tidyverse

Given at the Portland R meetup.

Hadley Wickham

October 10, 2016
Tweet

More Decks by Hadley Wickham

Other Decks in Technology

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    The tidyverse
    October 2016

    View Slide

  2. Tidy
    Import
    Surprises, but doesn't scale
    Create new variables & new summaries
    Consistent way of
    storing data
    Visualise
    Transform
    Model
    Communicate
    Scales, but doesn't (fundamentally) surprise
    Program

    View Slide

  3. No matter how complex and
    polished the individual operations
    are, it is often the quality of the
    glue that most directly determines
    the power of the system.
    — Hal Abelson

    View Slide

  4. Tidy
    Import Visualise
    Transform
    Model
    Communicate
    Program

    View Slide

  5. Tidy
    Import Visualise
    Transform
    Model
    Communicate
    Program

    View Slide

  6. The tidy tools
    manifesto

    View Slide

  7. Import
    readr
    readxl
    haven
    httr
    jsonlite
    DBI
    rvest
    xml2
    Tidy
    tibble
    tidyr
    Transform
    dplyr
    forcats
    hms
    lubridate
    stringr
    Visualise
    ggplot2
    Model
    broom
    modelr
    Program
    purrr
    magrittr
    http://r4ds.had.co.nz
    tidyverse

    View Slide

  8. 1. Share data structures.
    2.Compose simple pieces.
    3.Embrace FP.
    4.Write for humans.

    View Slide

  9. 1
    Share data
    structures

    View Slide

  10. 1. Put each dataset in a 

    data frame.
    2. Put each variable in a
    column.
    Tidy data

    View Slide

  11. # A tibble: 5,769 × 22
    iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524

    1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1
    8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1
    9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA
    10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0
    11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA NA NA
    12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA NA NA
    13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 0 1
    14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 0 1
    15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0
    16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 0 1
    17 AD 2006 0 0 0 1 1 2 0 1 1 0 0 0 0 0
    # ... with 5,752 more rows, and 6 more variables: f2534 , f3544 , f4554 ,
    # f5564 , f65 , fu
    Messy data has a varied shape
    What are the variables in this dataset?
    (Hint: f = female, u = unknown, 1524 = 15-24)

    View Slide

  12. # A tibble: 35,750 × 5
    country year sex age n

    1 AD 1996 f 014 0
    2 AD 1996 f 1524 1
    3 AD 1996 f 2534 1
    4 AD 1996 f 3544 0
    5 AD 1996 f 4554 0
    6 AD 1996 f 5564 1
    7 AD 1996 f 65 0
    8 AD 1996 m 014 0
    9 AD 1996 m 1524 0
    10 AD 1996 m 2534 0
    # ... with 35,740 more rows
    Tidy data has a uniform shape

    View Slide

  13. Happy families are all alike;
    every unhappy family is
    unhappy in its own way
    — Leo Tolstoy

    View Slide

  14. Tidy dataset are all alike;
    every messy dataset is
    messy in its own way
    — Hadley Wickham

    View Slide

  15. Tidy data is ~80% of data structures
    strings
    dates
    matrices
    vectors
    xml
    HTTP requests
    HTTP response
    http://simplystatistics.org/2016/02/17/non-tidy-data/
    factors

    View Slide

  16. What if you have a mix?
    Training data
    Test data
    Model
    Predictions RMSE
    Cross-validation
    data frame
    data frame
    lm
    vector scalar

    View Slide

  17. Use a tibble with list-columns!
    # A tibble: 100 x 5
    train test .id mod rmse

    1 001 0.5661605
    2 002 0.2399357
    3 003 3.5482986
    4 004 0.2396810
    5 005 0.1591336
    6 006 0.1934869
    7 007 0.2697834
    8 008 0.4910886
    9 009 1.7002645
    10 010 0.2047787
    ... with 90 more rows

    View Slide

  18. df # What does this return?
    df$x
    Your turn!

    View Slide

  19. df # What does this return?
    df$x
    #> [1] a
    #> Levels: a
    Your turn!
    Two surprises

    partial name matching &
    stringsAsFactors

    View Slide

  20. Two important tensions for understanding base R
    Interactive
    exploration
    Programming
    Conservative Utopian

    View Slide

  21. df df$xyz
    #> [1] "a"
    is.data.frame(df[, "xyz"])
    #> [1] TRUE
    df$x
    #> Warning: Unknown column 'x'
    #> NULL
    Tibbles are data frames that are lazy & surly

    View Slide

  22. data.frame(x = list(1:2, 3:5))
    #> Error: arguments imply differing number
    #> of rows: 2, 3
    And work better with list-columns

    View Slide

  23. data.frame(x = list(1:2, 3:5))
    #> Error: arguments imply differing number
    #> of rows: 2, 3
    data.frame(x = I(list(1:2, 3:5)))
    #> x
    #> 1 1, 2
    #> 2 3, 4, 5
    And work better with list-columns

    View Slide

  24. data.frame(x = list(1:2, 3:5))
    #> Error: arguments imply differing number
    #> of rows: 2, 3
    data.frame(x = I(list(1:2, 3:5)))
    #> x
    #> 1 1, 2
    #> 2 3, 4, 5
    tibble(x = list(1:2, 3:5))
    #> # A tibble: 2 x 1
    #> x
    #>
    #> 1
    #> 2
    And work better with list-columns

    View Slide

  25. 2
    Compose
    simple pieces

    View Slide

  26. Goal: Solve complex
    problems by combining
    uniform pieces.

    View Slide

  27. https://www.flickr.com/photos/brunurb/13129057003

    View Slide

  28. http://brickartist.com/gallery/pc-magazine-computer/. CC-BY-NC

    View Slide

  29. %>%
    magrittr::

    View Slide

  30. foo_foo bop_on(
    scoop_up(
    hop_through(foo_foo, forest),
    field_mouse
    ),
    head
    )
    # vs
    foo_foo %>%
    hop_through(forest) %>%
    scoop_up(field_mouse) %>%
    bop_on(head)

    View Slide

  31. library(nycflights13)
    library(dplyr)
    library(ggplot2)
    flights %>%
    group_by(date) %>%
    summarise(n = n()) %>%
    ggplot(aes(date, n)) +
    geom_line()
    Consistency across packages is important

    View Slide

  32. ggplot(mtcars, aes(mpg, wt)) +
    geom_point() +
    geom_line() +
    ggsave("mtcars.pdf")
    And ggplot2 is not even internally consistent
    x

    View Slide

  33. ggsave(
    "mtcars.pdf",
    ggplot(mtcars, aes(mpg, wt)) +
    geom_point() +
    geom_line() +
    )
    And ggplot2 is not even internally consistent

    View Slide

  34. # devtools::install_github("hadley/ggplot1")
    library(ggplot1)
    ggsave(
    ggpoint(
    ggplot(
    mtcars,
    list(x = mpg, y = wt)
    )
    ),
    "mtcars.pdf", width = 8, height = 6
    )
    ggplot1 had a tidier API than ggplot2!

    View Slide

  35. library(ggplot1)
    mtcars %>%
    ggplot(list(x = mpg, y = wt)) %>%
    ggpoint() %>%
    ggsave("mtcars.pdf", width = 8, height = 6)
    So you can use the pipe with ggplot1
    ggplot2 never would have
    existed if I’d discovered the
    pipe 10 years earlier!

    View Slide

  36. 2
    3
    4
    5
    10 15 20 25 30 35























    ● ●







    wt
    mpg

    View Slide

  37. library(rvest)
    library(purrr)
    library(readr)
    library(dplyr)
    library(lubridate)
    read_html("https://www.massshootingtracker.org/data") %>%
    html_nodes("a[href^='https://docs.goo']") %>%
    html_attr("href") %>%
    map_df(read_csv) %>%
    mutate(date = mdy(date)) ->
    shootings
    One small example from Bob Rudis https://rud.is/b/2016/07/26

    View Slide

  38. 3
    Embrace FP
    Answered with cupcakes
    Why are for loops “bad”?

    View Slide

  39. 1 cup flour
    a scant ¾ cup sugar
    1 ½ t baking powder
    3 T unsalted butter
    ½ cup whole milk
    1 egg
    ¼ t pure vanilla extract
    Preheat oven to 350°F.
    Put the flour, sugar, baking powder, salt, and butter in a
    freestanding electric mixer with a paddle attachment and beat
    on slow speed until you get a sandy consistency and everything
    is combined.
    Whisk the milk, egg, and vanilla together in a pitcher, then
    slowly pour about half into the flour mixture, beat to combine,
    and turn the mixer up to high speed to get rid of any lumps.
    Turn the mixer down to a slower speed and slowly pour in the
    remaining milk mixture. Continue mixing for a couple of more
    minutes until the batter is smooth but do not overmix.
    Spoon the batter into paper cases until 2/3 full and bake in the
    preheated oven for 20-25 minutes, or until the cake bounces
    back when touched.
    Vanilla cupcakes The hummingbird
    bakery cookbook

    View Slide

  40. ¾ cup + 2T flour
    2 ½ T cocoa powder
    a scant ¾ cup sugar
    1 ½ t baking powder
    3 T unsalted butter
    ½ cup whole milk
    1 egg
    ¼ t pure vanilla extract
    Preheat oven to 350°F.
    Put the flour, cocoa, sugar, baking powder, salt, and butter in a
    freestanding electric mixer with a paddle attachment and beat
    on slow speed until you get a sandy consistency and everything
    is combined.
    Whisk the milk, egg, and vanilla together in a pitcher, then
    slowly pour about half into the flour mixture, beat to combine,
    and turn the mixer up to high speed to get rid of any lumps.
    Turn the mixer down to a slower speed and slowly pour in the
    remaining milk mixture. Continue mixing for a couple of more
    minutes until the batter is smooth but do not overmix.
    Spoon the batter into paper cases until 2/3 full and bake in the
    preheated oven for 20-25 minutes, or until the cake bounces
    back when touched.
    Chocolate cupcakes The hummingbird
    bakery cookbook

    View Slide

  41. ¾ cup + 2T flour
    2 ½ T cocoa powder
    a scant ¾ cup sugar
    1 ½ t baking powder
    3 T unsalted butter
    ½ cup whole milk
    1 egg
    ¼ t pure vanilla extract
    Preheat oven to 350°F.
    Put the flour, cocoa, sugar, baking powder, salt, and butter in a
    freestanding electric mixer with a paddle attachment and beat
    on slow speed until you get a sandy consistency and everything
    is combined.
    Whisk the milk, egg, and vanilla together in a pitcher, then
    slowly pour about half into the flour mixture, beat to combine,
    and turn the mixer up to high speed to get rid of any lumps.
    Turn the mixer down to a slower speed and slowly pour in the
    remaining milk mixture. Continue mixing for a couple of more
    minutes until the batter is smooth but do not overmix.
    Spoon the batter into paper cases until 2/3 full and bake in the
    preheated oven for 20-25 minutes, or until the cake bounces
    back when touched.
    Chocolate cupcakes The hummingbird
    bakery cookbook

    View Slide

  42. 1 cup flour
    a scant ¾ cup sugar
    1 ½ t baking powder
    3 T unsalted butter
    ½ cup whole milk
    1 egg
    ¼ t pure vanilla extract
    Preheat oven to 350°F.
    Put the flour, sugar, baking powder, salt, and butter in a
    freestanding electric mixer with a paddle attachment and beat
    on slow speed until you get a sandy consistency and everything
    is combined.
    Whisk the milk, egg, and vanilla together in a pitcher, then
    slowly pour about half into the flour mixture, beat to combine,
    and turn the mixer up to high speed to get rid of any lumps.
    Turn the mixer down to a slower speed and slowly pour in the
    remaining milk mixture. Continue mixing for a couple of more
    minutes until the batter is smooth but do not overmix.
    Spoon the batter into paper cases until 2/3 full and bake in the
    preheated oven for 20-25 minutes, or until the cake bounces
    back when touched.
    Vanilla cupcakes The hummingbird
    bakery cookbook

    View Slide

  43. 120g flour
    140g sugar
    1.5 t baking powder
    40g unsalted butter
    120ml milk
    1 egg
    0.25 t pure vanilla extract
    Preheat oven to 170°C.
    Put the flour, sugar, baking powder, salt, and butter in a
    freestanding electric mixer with a paddle attachment and beat
    on slow speed until you get a sandy consistency and everything
    is combined.
    Whisk the milk, egg, and vanilla together in a pitcher, then
    slowly pour about half into the flour mixture, beat to combine,
    and turn the mixer up to high speed to get rid of any lumps.
    Turn the mixer down to a slower speed and slowly pour in the
    remaining milk mixture. Continue mixing for a couple of more
    minutes until the batter is smooth but do not overmix.
    Spoon the batter into paper cases until 2/3 full and bake in the
    preheated oven for 20-25 minutes, or until the cake bounces
    back when touched.
    Vanilla cupcakes
    1. Convert units
    The hummingbird
    bakery cookbook

    View Slide

  44. 120g flour
    140g sugar
    1.5 t baking powder
    40g butter
    120ml milk
    1 egg
    0.25 t vanilla
    Beat flour, sugar, baking powder, salt, and butter until sandy.
    Whisk milk, egg, and vanilla. Mix half into flour mixture until
    smooth (use high speed). Beat in remaining half. Mix until
    smooth.
    Bake 20-25 min at 170°C.
    Vanilla cupcakes
    2. Rely on domain knowledge
    The hummingbird
    bakery cookbook

    View Slide

  45. Beat dry ingredients + butter until sandy.
    Whisk together wet ingredients. Mix half into dry until smooth
    (use high speed). Beat in remaining half. Mix until smooth.
    Bake 20-25 min at 170°C.
    Vanilla cupcakes
    3. Use variables
    120g flour
    140g sugar
    1.5 t baking powder
    40g butter
    120ml milk
    1 egg
    0.25 t vanilla
    The hummingbird
    bakery cookbook

    View Slide

  46. 120g flour
    140g sugar
    1.5t baking powder
    40g butter
    120ml milk
    1 egg
    0.25 t vanilla
    Beat dry ingredients + butter
    until sandy.
    Whisk together wet ingredients.
    Mix half into dry until smooth
    (use high speed). Beat in
    remaining half. Mix until smooth.
    Bake 20-25 min at 170°C.
    Cupcakes
    4. Extract out common code
    100g flour
    20g cocoa
    140g sugar
    1.5t baking powder
    40g butter
    120ml milk
    1 egg
    0.25 t vanilla
    Vanilla Chocolate

    View Slide

  47. out1 for(i in seq_along(mtcars)) {
    out1[[i]] }
    out2 for(i in seq_along(mtcars)) {
    out2[[i]] }
    What do these for loops do?

    View Slide

  48. out1 for(i in seq_along(mtcars)) {
    out1[[i]] }
    out2 for(i in seq_along(mtcars)) {
    out2[[i]] }
    For loops emphasise the objects

    View Slide

  49. out1 for(i in seq_along(mtcars)) {
    out1[[i]] }
    out2 for(i in seq_along(mtcars)) {
    out2[[i]] }
    Not the actions

    View Slide

  50. library(purrr)
    means medians Functional programming emphasises the actions

    View Slide

  51. sim ~f, ~params,
    "runif", list(min = -1, max = 1),
    "rnorm", list(sd = 5),
    "rpois", list(lambda = 10)
    )
    sim %>%
    mutate(sim = invoke_map(f, params, n = 10))
    Teaser: simulation

    View Slide

  52. reports class = unique(mpg$class),
    filename = paste0("fuel-economy-", class, ".html"),
    params = map(class, ~ list(my_class = .))
    )
    reports %>%
    select(output_file = filename, params) %>%
    pwalk(rmarkdown::render, input = "fuel-economy.Rmd")
    Teaser: saving parameterised reports

    View Slide

  53. 4
    Write for
    humans

    View Slide

  54. Programs must be written for
    people to read, and only
    incidentally for machines to
    execute.
    — Hal Abelson

    View Slide

  55. tibble
    lubridate
    forcats filter
    mutate
    summarise
    arrange
    select
    magrittr

    View Slide

  56. tibble
    lubridate
    forcats filter
    mutate
    summarise
    arrange
    select
    magrittr
    Embrace existing
    language
    Have fun!
    Pack
    animals

    View Slide

  57. Conclusion

    View Slide

  58. 1. Share data structures.
    2.Compose simple pieces.
    3.Embrace FP.
    4.Write for humans.

    View Slide

  59. My goal is to make
    a pit of success
    http://blog.codinghorror.com/falling-into-the-pit-of-success/

    View Slide

  60. install.packages("tidyverse")
    library(tidyverse)
    #> Loading tidyverse: ggplot2
    #> Loading tidyverse: tibble
    #> Loading tidyverse: tidyr
    #> Loading tidyverse: readr
    #> Loading tidyverse: purrr
    #> Loading tidyverse: dplyr
    #> Conflicts with tidy packages
    ----------------------------------------------
    #> filter(): dplyr, stats
    #> lag(): dplyr, stats
    Gotta install them all

    View Slide

  61. Import
    readr
    readxl
    haven
    httr
    jsonlite
    rvest
    xml2
    Tidy
    tibble
    tidyr
    Transform
    dplyr
    forcats
    hms
    lubridate
    stringr
    vctrs
    Visualise
    ggplot2
    Model
    broom
    modelr
    ???
    Program
    purrr
    magrittr
    http://r4ds.had.co.nz
    tidyverse

    View Slide

  62. I want to R the very best /
    like no one ever R'ed /
    To tidy them is my real test /
    To model them is my cause
    I'll import from across the web/
    Searching GitHub and CRAN/
    Each Tidyverse package gets me closer/
    to results I'll understand
    Tidyverse! 

    Gotta wrangle/
    It's R and me/
    I know it's my destiny/
    Tidyverse! 

    Hadley - you're my best friend/
    With tidy data from end to end/
    Tidyverse! Gotta wrangle/
    A package so new/
    Pipes and data first will pull us through/
    You teach me and I'll teach you/
    Tidyverse! Gotta catch’em all
    Tidyverse theme song, by Sean Kross

    View Slide

  63. This work is licensed under the 

    Creative Commons Attribution-Noncommercial 3.0 

    United States License.
    To view a copy of this license, visit 

    http://creativecommons.org/licenses/by-nc/3.0/us/

    View Slide