Slide 1

Slide 1 text

Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio Welcome to the 
 tidyverse February 2019

Slide 2

Slide 2 text

The tidyverse is a language for solving data science challenges with R code.

Slide 3

Slide 3 text

Your turn: What data do we need to recreate this plot?

Slide 4

Slide 4 text

# A tibble: 193 x 6 country four_regions year income life_exp pop 1 Afghanistan asia 2015 1750 57.9 33700000 2 Albania europe 2015 11000 77.6 2920000 3 Algeria africa 2015 13700 77.3 39900000 4 Andorra europe 2015 46600 82.5 78000 5 Angola africa 2015 6230 64 27900000 6 Antigua and Barbuda americas 2015 20100 77.2 99900 7 Argentina americas 2015 19100 76.5 43400000 8 Armenia europe 2015 8180 75.4 2920000 9 Australia asia 2015 43800 82.6 23800000 10 Austria europe 2015 44100 81.4 8680000 # … with 183 more rows We need five variables

Slide 5

Slide 5 text

# A tibble: 193 x 6 country four_regions year income life_exp pop 1 Afghanistan asia 2015 1750 57.9 33700000 2 Albania europe 2015 11000 77.6 2920000 3 Algeria africa 2015 13700 77.3 39900000 4 Andorra europe 2015 46600 82.5 78000 5 Angola africa 2015 6230 64 27900000 6 Antigua and Barbuda americas 2015 20100 77.2 99900 7 Argentina americas 2015 19100 76.5 43400000 8 Armenia europe 2015 8180 75.4 2920000 9 Australia asia 2015 43800 82.6 23800000 10 Austria europe 2015 44100 81.4 8680000 # … with 183 more rows This is “tidy” data Variable Observation

Slide 6

Slide 6 text

# A tibble: 193 x 220 country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809` `1810` `1811` `1812` `1813` 1 Afghan… 603 603 603 603 603 603 603 603 603 603 604 604 604 604 2 Albania 667 667 667 667 667 668 668 668 668 668 668 668 668 668 3 Algeria 715 716 717 718 719 720 721 722 723 724 725 726 727 728 4 Andorra 1200 1200 1200 1200 1210 1210 1210 1210 1220 1220 1220 1220 1220 1230 5 Angola 618 620 623 626 628 631 634 637 640 642 645 648 651 654 6 Antigu… 757 757 757 757 757 757 757 758 758 758 758 758 758 758 7 Argent… 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 8 Armenia 514 514 514 514 514 514 514 514 514 514 514 515 515 515 9 Austra… 814 816 818 820 822 824 825 827 829 831 833 835 837 839 10 Austria 1850 1850 1860 1870 1880 1880 1890 1900 1910 1920 1920 1930 1940 1950 # … with 183 more rows, and 205 more variables: `1814` , `1815` , `1816` , `1817` , # `1818` , `1819` , `1820` , `1821` , `1822` , `1823` , `1824` , # `1825` , `1826` , `1827` , `1828` , `1829` , `1830` , `1831` , # `1832` , `1833` , `1834` , `1835` , `1836` , `1837` , `1838` , # `1839` , `1840` , `1841` , `1842` , `1843` , `1844` , `1845` , # `1846` , `1847` , `1848` , `1849` , `1850` , `1851` , `1852` , ... The original data looked like this

Slide 7

Slide 7 text

gapminder %>% filter(year == 2015) -> gapminder15 Start with single year Called the pipe; pronounced “then” Called the reverse assignment operator; pronounced “creates”

Slide 8

Slide 8 text

gapminder %>% filter(year == 2015) -> gapminder15 Phonics are important! filter rows where year equals 2015, creating Take the gapminder data, then gapminder15 variable

Slide 9

Slide 9 text

50 60 70 80 0 25000 50000 75000 100000 12500 income life_exp gapminder15 %>% ggplot(aes(income, life_exp))

Slide 10

Slide 10 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 0 25000 50000 75000 100000 12500 income life_exp gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point()

Slide 11

Slide 11 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 0 25000 50000 75000 100000 12500 income life_exp

Slide 12

Slide 12 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 1e+03 1e+04 1e+05 income life_exp gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point() + scale_x_log10()

Slide 13

Slide 13 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 1e+03 1e+04 1e+05 income life_exp four_regions ● ● ● ● africa americas asia europe gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point(aes(colour = four_regions)) + scale_x_log10()

Slide 14

Slide 14 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 1e+03 1e+04 1e+05 income life_exp pop ● ● 5e+08 1e+09 four_regions ● ● ● ● africa americas asia europe gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point(aes(colour = four_regions, size = pop)) + scale_x_log10()

Slide 15

Slide 15 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 1e+03 1e+04 1e+05 income life_exp pop ● ● 5e+08 1e+09 four_regions ● ● ● ● africa americas asia europe Your turn: What’s missing?

Slide 16

Slide 16 text

gapminder15 %>% ggplot(aes(income, life_exp)) + geom_point(aes(fill = four_regions, size = pop), shape = 21) + scale_x_log10(breaks = 2^(-1:7) * 1000) + scale_size(range = c(1, 20), guide = FALSE) + scale_fill_manual( guide = FALSE, values = c( africa = "#60D2E6", americas = "#9AE847", asia = "#EC6475", europe = "#FBE84D" ) ) + labs( x = "Income (GDP / capita)", y = "Life expectancy (years)" ) It’s a lot more work to make a expository graphic

Slide 17

Slide 17 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 500 1000 2000 4000 8000 16000 32000 64000 128000 Income (GDP / capita) Life expectancy (years) Your turn: Can you spot the subtle problem?

Slide 18

Slide 18 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 60 70 80 500 1000 2000 4000 8000 16000 32000 64000 128000 Income (GDP / capita) Life expectancy (years)

Slide 19

Slide 19 text

data %>% arrange(desc(pop)) %>% ggplot(aes(income, life_exp)) + ... A little motivation

Slide 20

Slide 20 text

gap_plot <- function(data) { data %>% arrange(desc(pop)) %>% ggplot(aes(income, life_exp)) + geom_point(aes(fill = four_regions, size = pop), shape = 21) + scale_x_log10(breaks = 2^(-1:7) * 1000) + scale_size(range = c(1, 20), guide = FALSE) + scale_fill_manual(values = c( africa = "#60D2E6", americas = "#9AE847", asia = "#EC6475", europe = "#FBE84D" )) + labs( x = "Income (GDP / capita)", y = "Life expectancy", fill = "Region" ) } You can also turn your code into a function

Slide 21

Slide 21 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 50 60 70 80 1000 2000 4000 8000 16000 32000 Income (GDP / capita) Life expectancy Region ● asia gapminder %>% filter(country == "New Zealand") %>% gap_plot()

Slide 22

Slide 22 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 50 500 1000 2000 4000 8000 Income (GDP / capita) Life expectancy Region ● ● ● ● africa americas asia europe gapminder %>% filter(year == 1900) %>% gap_plot()

Slide 23

Slide 23 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 500 1000 2000 4000 8000 Income (GDP / capita) Life expectancy Region ● ● ● ● africa americas asia europe gapminder %>% filter(year == 1905) %>% gap_plot()

Slide 24

Slide 24 text

by_year <- gapminder %>% filter(year %% 5 == 0) %>% group_split(year) plots <- map(by_year, ~ gap_plot(.x)) How can we do this with every plot?

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Datasets Plots

Slide 27

Slide 27 text

Demo

Slide 28

Slide 28 text

The tidyverse is a language for solving data science challenges with R code.

Slide 29

Slide 29 text

Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr dplyr forcats hms ggplot2 recipes parsnip readr readxl haven xml2 lubridate stringr tidyverse.org r4ds.had.co.nz

Slide 30

Slide 30 text

The tidyverse is a language for solving data science challenges with R code.

Slide 31

Slide 31 text

The disadvantages of code are obvious

Slide 32

Slide 32 text

1. Code is text 2. Code is read-able 3. Code is reproducible Why code?

Slide 33

Slide 33 text

⌘C ⌘V Copy Paste

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

1. Code is text 2. Code is read-able 3. Code is reproducible Why code?

Slide 36

Slide 36 text

What have you done?

Slide 37

Slide 37 text

1. Code is text 2. Code is read-able 3. Code is reproducible Why code?

Slide 38

Slide 38 text

.Rmd Prose and code .md Prose and results .html Human shareable

Slide 39

Slide 39 text

.Rmd .md .html Prose and code Prose and results .doc .tex .pdf .ppt ...

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Other skills

Slide 42

Slide 42 text

SQL git(hub) marketing

Slide 43

Slide 43 text

Conclusions

Slide 44

Slide 44 text

We wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important. — John Chambers, “Stages in the Evolution of S”

Slide 45

Slide 45 text

https://unsplash.com/photos/8HyrGTYPQ68 — Eric Muhr Pit of success

Slide 46

Slide 46 text

https://simplystatistics.org/2018/07/12/use-r-keynote-2018/

Slide 47

Slide 47 text

# A tibble: 153 x 6 Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 # … with 143 more rows

Slide 48

Slide 48 text

library(dplyr) airquality %>% group_by(Month) %>% summarize(o3 = mean(Ozone, na.rm = TRUE))

Slide 49

Slide 49 text

aggregate( airquality[“Ozone”], airquality[“Month”], mean, na.rm = TRUE )

Slide 50

Slide 50 text

aggregate( airquality[“Ozone”], airquality[“Month”], mean, na.rm = TRUE ) Why doesn’t airquality$Ozone work?

Slide 51

Slide 51 text

aggregate( airquality[“Ozone”], airquality[“Month”], mean, na.rm = TRUE ) Passing a function to another function

Slide 52

Slide 52 text

aggregate( airquality[“Ozone”], airquality[“Month”], mean, na.rm = TRUE ) Argument to mean, not to aggregate()

Slide 53

Slide 53 text

Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr dplyr forcats hms ggplot2 recipes parsnip readr readxl haven xml2 lubridate stringr tidyverse.org r4ds.had.co.nz

Slide 54

Slide 54 text

This work is licensed as Creative Commons
 Attribution-ShareAlike 4.0 
 International To view a copy of this license, visit 
 https://creativecommons.org/licenses/by-sa/4.0/