Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Welcome to the tidyverse

Welcome to the tidyverse

Keynote presentation at ASA CSP conference.

The tidyverse is a language for solving data science challenges with R code. This talk discusses how the tidyverse is a language that you can use to work with data with little programming knowledge; my model of data science; and why coding is so important.

Hadley Wickham

February 15, 2019
Tweet

More Decks by Hadley Wickham

Other Decks in Programming

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    Welcome to the 

    tidyverse
    February 2019

    View full-size slide

  2. The tidyverse is a
    language for solving
    data science challenges
    with R code.

    View full-size slide

  3. Your turn: What data do we
    need to recreate this plot?

    View full-size slide

  4. # A tibble: 193 x 6
    country four_regions year income life_exp pop

    1 Afghanistan asia 2015 1750 57.9 33700000
    2 Albania europe 2015 11000 77.6 2920000
    3 Algeria africa 2015 13700 77.3 39900000
    4 Andorra europe 2015 46600 82.5 78000
    5 Angola africa 2015 6230 64 27900000
    6 Antigua and Barbuda americas 2015 20100 77.2 99900
    7 Argentina americas 2015 19100 76.5 43400000
    8 Armenia europe 2015 8180 75.4 2920000
    9 Australia asia 2015 43800 82.6 23800000
    10 Austria europe 2015 44100 81.4 8680000
    # … with 183 more rows
    We need five variables

    View full-size slide

  5. # A tibble: 193 x 6
    country four_regions year income life_exp pop

    1 Afghanistan asia 2015 1750 57.9 33700000
    2 Albania europe 2015 11000 77.6 2920000
    3 Algeria africa 2015 13700 77.3 39900000
    4 Andorra europe 2015 46600 82.5 78000
    5 Angola africa 2015 6230 64 27900000
    6 Antigua and Barbuda americas 2015 20100 77.2 99900
    7 Argentina americas 2015 19100 76.5 43400000
    8 Armenia europe 2015 8180 75.4 2920000
    9 Australia asia 2015 43800 82.6 23800000
    10 Austria europe 2015 44100 81.4 8680000
    # … with 183 more rows
    This is “tidy” data
    Variable
    Observation

    View full-size slide

  6. # A tibble: 193 x 220
    country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809` `1810` `1811` `1812` `1813`

    1 Afghan… 603 603 603 603 603 603 603 603 603 603 604 604 604 604
    2 Albania 667 667 667 667 667 668 668 668 668 668 668 668 668 668
    3 Algeria 715 716 717 718 719 720 721 722 723 724 725 726 727 728
    4 Andorra 1200 1200 1200 1200 1210 1210 1210 1210 1220 1220 1220 1220 1220 1230
    5 Angola 618 620 623 626 628 631 634 637 640 642 645 648 651 654
    6 Antigu… 757 757 757 757 757 757 757 758 758 758 758 758 758 758
    7 Argent… 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510
    8 Armenia 514 514 514 514 514 514 514 514 514 514 514 515 515 515
    9 Austra… 814 816 818 820 822 824 825 827 829 831 833 835 837 839
    10 Austria 1850 1850 1860 1870 1880 1880 1890 1900 1910 1920 1920 1930 1940 1950
    # … with 183 more rows, and 205 more variables: `1814` , `1815` , `1816` , `1817` ,
    # `1818` , `1819` , `1820` , `1821` , `1822` , `1823` , `1824` ,
    # `1825` , `1826` , `1827` , `1828` , `1829` , `1830` , `1831` ,
    # `1832` , `1833` , `1834` , `1835` , `1836` , `1837` , `1838` ,
    # `1839` , `1840` , `1841` , `1842` , `1843` , `1844` , `1845` ,
    # `1846` , `1847` , `1848` , `1849` , `1850` , `1851` , `1852` , ...
    The original data looked like this

    View full-size slide

  7. gapminder %>%
    filter(year == 2015) ->
    gapminder15
    Start with single year
    Called the pipe;
    pronounced “then”
    Called the reverse assignment
    operator; pronounced “creates”

    View full-size slide

  8. gapminder %>%
    filter(year == 2015) ->
    gapminder15
    Phonics are important!
    filter rows where year equals 2015, creating
    Take the gapminder data, then
    gapminder15 variable

    View full-size slide

  9. 50
    60
    70
    80
    0 25000 50000 75000 100000 12500
    income
    life_exp
    gapminder15 %>%
    ggplot(aes(income, life_exp))

    View full-size slide


  10. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    0 25000 50000 75000 100000 12500
    income
    life_exp
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point()

    View full-size slide


  11. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    0 25000 50000 75000 100000 12500
    income
    life_exp

    View full-size slide


  12. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point() +
    scale_x_log10()

    View full-size slide


  13. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    four_regions




    africa
    americas
    asia
    europe
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(colour = four_regions)) +
    scale_x_log10()

    View full-size slide


  14. ● ●





















































    ● ●












































































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    pop


    5e+08
    1e+09
    four_regions




    africa
    americas
    asia
    europe
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(colour = four_regions, size = pop)) +
    scale_x_log10()

    View full-size slide


  15. ● ●





















































    ● ●












































































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    pop


    5e+08
    1e+09
    four_regions




    africa
    americas
    asia
    europe
    Your turn: What’s missing?

    View full-size slide

  16. gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(fill = four_regions, size = pop), shape = 21) +
    scale_x_log10(breaks = 2^(-1:7) * 1000) +
    scale_size(range = c(1, 20), guide = FALSE) +
    scale_fill_manual(
    guide = FALSE,
    values = c(
    africa = "#60D2E6",
    americas = "#9AE847",
    asia = "#EC6475",
    europe = "#FBE84D"
    )
    ) +
    labs(
    x = "Income (GDP / capita)",
    y = "Life expectancy (years)"
    )
    It’s a lot more work to make a expository graphic

    View full-size slide


  17. ● ●




















    ● ●






























    ● ●










































































    ● ●


















































    50
    60
    70
    80
    500 1000 2000 4000 8000 16000 32000 64000 128000
    Income (GDP / capita)
    Life expectancy (years)
    Your turn: Can you spot
    the subtle problem?

    View full-size slide














































































  18. ● ●








    ● ●

























    ● ●




    ● ●





    ● ●






    ● ●


















    ● ●

















    ● ●







    50
    60
    70
    80
    500 1000 2000 4000 8000 16000 32000 64000 128000
    Income (GDP / capita)
    Life expectancy (years)

    View full-size slide

  19. data %>%
    arrange(desc(pop)) %>%
    ggplot(aes(income, life_exp)) +
    ...
    A little motivation

    View full-size slide

  20. gap_plot <- function(data) {
    data %>%
    arrange(desc(pop)) %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(fill = four_regions, size = pop), shape = 21) +
    scale_x_log10(breaks = 2^(-1:7) * 1000) +
    scale_size(range = c(1, 20), guide = FALSE) +
    scale_fill_manual(values = c(
    africa = "#60D2E6",
    americas = "#9AE847",
    asia = "#EC6475",
    europe = "#FBE84D"
    )) +
    labs(
    x = "Income (GDP / capita)",
    y = "Life expectancy",
    fill = "Region"
    )
    }
    You can also turn your code into a function

    View full-size slide


























  21. ● ●


















































    ● ●


    ● ●

    ● ●

    ● ●

    ● ●

    ● ●
    ● ●

    ● ●
    ● ●
    ● ● ●
    40
    50
    60
    70
    80
    1000 2000 4000 8000 16000 32000
    Income (GDP / capita)
    Life expectancy
    Region
    ● asia
    gapminder %>%
    filter(country == "New Zealand") %>%
    gap_plot()

    View full-size slide













































































  22. ● ●

























    ● ●




    ● ●







    ● ●





























































    20
    30
    40
    50
    500 1000 2000 4000 8000
    Income (GDP / capita)
    Life expectancy
    Region




    africa
    americas
    asia
    europe
    gapminder %>%
    filter(year == 1900) %>%
    gap_plot()

    View full-size slide

















  23. ● ●

























    ● ●


































    ● ●






















    ●●














    ● ●





























    ● ●





























    10
    20
    30
    40
    50
    500 1000 2000 4000 8000
    Income (GDP / capita)
    Life expectancy
    Region




    africa
    americas
    asia
    europe
    gapminder %>%
    filter(year == 1905) %>%
    gap_plot()

    View full-size slide

  24. by_year <- gapminder %>%
    filter(year %% 5 == 0) %>%
    group_split(year)
    plots <- map(by_year, ~ gap_plot(.x))
    How can we do this with every plot?

    View full-size slide

  25. Datasets Plots

    View full-size slide

  26. The tidyverse is a
    language for solving
    data science challenges
    with R code.

    View full-size slide

  27. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    recipes
    parsnip
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org
    r4ds.had.co.nz

    View full-size slide

  28. The tidyverse is a
    language for solving
    data science challenges
    with R code.

    View full-size slide

  29. The disadvantages of code are obvious

    View full-size slide

  30. 1. Code is text
    2. Code is read-able
    3. Code is reproducible
    Why code?

    View full-size slide

  31. ⌘C
    ⌘V
    Copy
    Paste

    View full-size slide

  32. 1. Code is text
    2. Code is read-able
    3. Code is reproducible
    Why code?

    View full-size slide

  33. What have you done?

    View full-size slide

  34. 1. Code is text
    2. Code is read-able
    3. Code is reproducible
    Why code?

    View full-size slide

  35. .Rmd Prose and code
    .md Prose and results
    .html Human shareable

    View full-size slide

  36. .Rmd
    .md
    .html
    Prose and code
    Prose and results
    .doc .tex
    .pdf
    .ppt
    ...

    View full-size slide

  37. Other skills

    View full-size slide

  38. SQL
    git(hub)
    marketing

    View full-size slide

  39. We wanted users to be able to begin in an interactive
    environment, where they did not consciously think of
    themselves as programming. Then as their needs became
    clearer and their sophistication increased, they should be
    able to slide gradually into programming, when the
    language and system aspects would become more
    important.
    — John Chambers, “Stages in the Evolution of S”

    View full-size slide

  40. https://unsplash.com/photos/8HyrGTYPQ68 — Eric Muhr
    Pit of success

    View full-size slide

  41. https://simplystatistics.org/2018/07/12/use-r-keynote-2018/

    View full-size slide

  42. # A tibble: 153 x 6
    Ozone Solar.R Wind Temp Month Day

    1 41 190 7.4 67 5 1
    2 36 118 8 72 5 2
    3 12 149 12.6 74 5 3
    4 18 313 11.5 62 5 4
    5 NA NA 14.3 56 5 5
    6 28 NA 14.9 66 5 6
    7 23 299 8.6 65 5 7
    8 19 99 13.8 59 5 8
    9 8 19 20.1 61 5 9
    10 NA 194 8.6 69 5 10
    # … with 143 more rows

    View full-size slide

  43. library(dplyr)
    airquality %>%
    group_by(Month) %>%
    summarize(o3 = mean(Ozone, na.rm = TRUE))

    View full-size slide

  44. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    )

    View full-size slide

  45. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    )
    Why doesn’t
    airquality$Ozone work?

    View full-size slide

  46. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    ) Passing a function to another function

    View full-size slide

  47. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    ) Argument to mean, not to aggregate()

    View full-size slide

  48. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    recipes
    parsnip
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org
    r4ds.had.co.nz

    View full-size slide

  49. This work is licensed as
    Creative Commons

    Attribution-ShareAlike 4.0 

    International
    To view a copy of this license, visit 

    https://creativecommons.org/licenses/by-sa/4.0/

    View full-size slide