$30 off During Our Annual Pro Sale. View Details »

Welcome to the tidyverse

Welcome to the tidyverse

Keynote presentation at ASA CSP conference.

The tidyverse is a language for solving data science challenges with R code. This talk discusses how the tidyverse is a language that you can use to work with data with little programming knowledge; my model of data science; and why coding is so important.

Hadley Wickham

February 15, 2019
Tweet

More Decks by Hadley Wickham

Other Decks in Programming

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    Welcome to the 

    tidyverse
    February 2019

    View Slide

  2. The tidyverse is a
    language for solving
    data science challenges
    with R code.

    View Slide

  3. Your turn: What data do we
    need to recreate this plot?

    View Slide

  4. # A tibble: 193 x 6
    country four_regions year income life_exp pop

    1 Afghanistan asia 2015 1750 57.9 33700000
    2 Albania europe 2015 11000 77.6 2920000
    3 Algeria africa 2015 13700 77.3 39900000
    4 Andorra europe 2015 46600 82.5 78000
    5 Angola africa 2015 6230 64 27900000
    6 Antigua and Barbuda americas 2015 20100 77.2 99900
    7 Argentina americas 2015 19100 76.5 43400000
    8 Armenia europe 2015 8180 75.4 2920000
    9 Australia asia 2015 43800 82.6 23800000
    10 Austria europe 2015 44100 81.4 8680000
    # … with 183 more rows
    We need five variables

    View Slide

  5. # A tibble: 193 x 6
    country four_regions year income life_exp pop

    1 Afghanistan asia 2015 1750 57.9 33700000
    2 Albania europe 2015 11000 77.6 2920000
    3 Algeria africa 2015 13700 77.3 39900000
    4 Andorra europe 2015 46600 82.5 78000
    5 Angola africa 2015 6230 64 27900000
    6 Antigua and Barbuda americas 2015 20100 77.2 99900
    7 Argentina americas 2015 19100 76.5 43400000
    8 Armenia europe 2015 8180 75.4 2920000
    9 Australia asia 2015 43800 82.6 23800000
    10 Austria europe 2015 44100 81.4 8680000
    # … with 183 more rows
    This is “tidy” data
    Variable
    Observation

    View Slide

  6. # A tibble: 193 x 220
    country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809` `1810` `1811` `1812` `1813`

    1 Afghan… 603 603 603 603 603 603 603 603 603 603 604 604 604 604
    2 Albania 667 667 667 667 667 668 668 668 668 668 668 668 668 668
    3 Algeria 715 716 717 718 719 720 721 722 723 724 725 726 727 728
    4 Andorra 1200 1200 1200 1200 1210 1210 1210 1210 1220 1220 1220 1220 1220 1230
    5 Angola 618 620 623 626 628 631 634 637 640 642 645 648 651 654
    6 Antigu… 757 757 757 757 757 757 757 758 758 758 758 758 758 758
    7 Argent… 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510
    8 Armenia 514 514 514 514 514 514 514 514 514 514 514 515 515 515
    9 Austra… 814 816 818 820 822 824 825 827 829 831 833 835 837 839
    10 Austria 1850 1850 1860 1870 1880 1880 1890 1900 1910 1920 1920 1930 1940 1950
    # … with 183 more rows, and 205 more variables: `1814` , `1815` , `1816` , `1817` ,
    # `1818` , `1819` , `1820` , `1821` , `1822` , `1823` , `1824` ,
    # `1825` , `1826` , `1827` , `1828` , `1829` , `1830` , `1831` ,
    # `1832` , `1833` , `1834` , `1835` , `1836` , `1837` , `1838` ,
    # `1839` , `1840` , `1841` , `1842` , `1843` , `1844` , `1845` ,
    # `1846` , `1847` , `1848` , `1849` , `1850` , `1851` , `1852` , ...
    The original data looked like this

    View Slide

  7. gapminder %>%
    filter(year == 2015) ->
    gapminder15
    Start with single year
    Called the pipe;
    pronounced “then”
    Called the reverse assignment
    operator; pronounced “creates”

    View Slide

  8. gapminder %>%
    filter(year == 2015) ->
    gapminder15
    Phonics are important!
    filter rows where year equals 2015, creating
    Take the gapminder data, then
    gapminder15 variable

    View Slide

  9. 50
    60
    70
    80
    0 25000 50000 75000 100000 12500
    income
    life_exp
    gapminder15 %>%
    ggplot(aes(income, life_exp))

    View Slide


  10. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    0 25000 50000 75000 100000 12500
    income
    life_exp
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point()

    View Slide


  11. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    0 25000 50000 75000 100000 12500
    income
    life_exp

    View Slide


  12. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point() +
    scale_x_log10()

    View Slide


  13. ● ●





















































    ● ●






















    ● ●




















































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    four_regions




    africa
    americas
    asia
    europe
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(colour = four_regions)) +
    scale_x_log10()

    View Slide


  14. ● ●





















































    ● ●












































































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    pop


    5e+08
    1e+09
    four_regions




    africa
    americas
    asia
    europe
    gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(colour = four_regions, size = pop)) +
    scale_x_log10()

    View Slide


  15. ● ●





















































    ● ●












































































    ● ●



















































    50
    60
    70
    80
    1e+03 1e+04 1e+05
    income
    life_exp
    pop


    5e+08
    1e+09
    four_regions




    africa
    americas
    asia
    europe
    Your turn: What’s missing?

    View Slide

  16. gapminder15 %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(fill = four_regions, size = pop), shape = 21) +
    scale_x_log10(breaks = 2^(-1:7) * 1000) +
    scale_size(range = c(1, 20), guide = FALSE) +
    scale_fill_manual(
    guide = FALSE,
    values = c(
    africa = "#60D2E6",
    americas = "#9AE847",
    asia = "#EC6475",
    europe = "#FBE84D"
    )
    ) +
    labs(
    x = "Income (GDP / capita)",
    y = "Life expectancy (years)"
    )
    It’s a lot more work to make a expository graphic

    View Slide


  17. ● ●




















    ● ●






























    ● ●










































































    ● ●


















































    50
    60
    70
    80
    500 1000 2000 4000 8000 16000 32000 64000 128000
    Income (GDP / capita)
    Life expectancy (years)
    Your turn: Can you spot
    the subtle problem?

    View Slide














































































  18. ● ●








    ● ●

























    ● ●




    ● ●





    ● ●






    ● ●


















    ● ●

















    ● ●







    50
    60
    70
    80
    500 1000 2000 4000 8000 16000 32000 64000 128000
    Income (GDP / capita)
    Life expectancy (years)

    View Slide

  19. data %>%
    arrange(desc(pop)) %>%
    ggplot(aes(income, life_exp)) +
    ...
    A little motivation

    View Slide

  20. gap_plot <- function(data) {
    data %>%
    arrange(desc(pop)) %>%
    ggplot(aes(income, life_exp)) +
    geom_point(aes(fill = four_regions, size = pop), shape = 21) +
    scale_x_log10(breaks = 2^(-1:7) * 1000) +
    scale_size(range = c(1, 20), guide = FALSE) +
    scale_fill_manual(values = c(
    africa = "#60D2E6",
    americas = "#9AE847",
    asia = "#EC6475",
    europe = "#FBE84D"
    )) +
    labs(
    x = "Income (GDP / capita)",
    y = "Life expectancy",
    fill = "Region"
    )
    }
    You can also turn your code into a function

    View Slide


























  21. ● ●


















































    ● ●


    ● ●

    ● ●

    ● ●

    ● ●

    ● ●
    ● ●

    ● ●
    ● ●
    ● ● ●
    40
    50
    60
    70
    80
    1000 2000 4000 8000 16000 32000
    Income (GDP / capita)
    Life expectancy
    Region
    ● asia
    gapminder %>%
    filter(country == "New Zealand") %>%
    gap_plot()

    View Slide













































































  22. ● ●

























    ● ●




    ● ●







    ● ●





























































    20
    30
    40
    50
    500 1000 2000 4000 8000
    Income (GDP / capita)
    Life expectancy
    Region




    africa
    americas
    asia
    europe
    gapminder %>%
    filter(year == 1900) %>%
    gap_plot()

    View Slide

















  23. ● ●

























    ● ●


































    ● ●






















    ●●














    ● ●





























    ● ●





























    10
    20
    30
    40
    50
    500 1000 2000 4000 8000
    Income (GDP / capita)
    Life expectancy
    Region




    africa
    americas
    asia
    europe
    gapminder %>%
    filter(year == 1905) %>%
    gap_plot()

    View Slide

  24. by_year <- gapminder %>%
    filter(year %% 5 == 0) %>%
    group_split(year)
    plots <- map(by_year, ~ gap_plot(.x))
    How can we do this with every plot?

    View Slide

  25. View Slide

  26. Datasets Plots

    View Slide

  27. Demo

    View Slide

  28. The tidyverse is a
    language for solving
    data science challenges
    with R code.

    View Slide

  29. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    recipes
    parsnip
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org
    r4ds.had.co.nz

    View Slide

  30. The tidyverse is a
    language for solving
    data science challenges
    with R code.

    View Slide

  31. The disadvantages of code are obvious

    View Slide

  32. 1. Code is text
    2. Code is read-able
    3. Code is reproducible
    Why code?

    View Slide

  33. ⌘C
    ⌘V
    Copy
    Paste

    View Slide

  34. View Slide

  35. 1. Code is text
    2. Code is read-able
    3. Code is reproducible
    Why code?

    View Slide

  36. What have you done?

    View Slide

  37. 1. Code is text
    2. Code is read-able
    3. Code is reproducible
    Why code?

    View Slide

  38. .Rmd Prose and code
    .md Prose and results
    .html Human shareable

    View Slide

  39. .Rmd
    .md
    .html
    Prose and code
    Prose and results
    .doc .tex
    .pdf
    .ppt
    ...

    View Slide

  40. View Slide

  41. Other skills

    View Slide

  42. SQL
    git(hub)
    marketing

    View Slide

  43. Conclusions

    View Slide

  44. We wanted users to be able to begin in an interactive
    environment, where they did not consciously think of
    themselves as programming. Then as their needs became
    clearer and their sophistication increased, they should be
    able to slide gradually into programming, when the
    language and system aspects would become more
    important.
    — John Chambers, “Stages in the Evolution of S”

    View Slide

  45. https://unsplash.com/photos/8HyrGTYPQ68 — Eric Muhr
    Pit of success

    View Slide

  46. https://simplystatistics.org/2018/07/12/use-r-keynote-2018/

    View Slide

  47. # A tibble: 153 x 6
    Ozone Solar.R Wind Temp Month Day

    1 41 190 7.4 67 5 1
    2 36 118 8 72 5 2
    3 12 149 12.6 74 5 3
    4 18 313 11.5 62 5 4
    5 NA NA 14.3 56 5 5
    6 28 NA 14.9 66 5 6
    7 23 299 8.6 65 5 7
    8 19 99 13.8 59 5 8
    9 8 19 20.1 61 5 9
    10 NA 194 8.6 69 5 10
    # … with 143 more rows

    View Slide

  48. library(dplyr)
    airquality %>%
    group_by(Month) %>%
    summarize(o3 = mean(Ozone, na.rm = TRUE))

    View Slide

  49. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    )

    View Slide

  50. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    )
    Why doesn’t
    airquality$Ozone work?

    View Slide

  51. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    ) Passing a function to another function

    View Slide

  52. aggregate(
    airquality[“Ozone”],
    airquality[“Month”],
    mean, na.rm = TRUE
    ) Argument to mean, not to aggregate()

    View Slide

  53. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    recipes
    parsnip
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org
    r4ds.had.co.nz

    View Slide

  54. This work is licensed as
    Creative Commons

    Attribution-ShareAlike 4.0 

    International
    To view a copy of this license, visit 

    https://creativecommons.org/licenses/by-sa/4.0/

    View Slide