Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data wrangling & manipulation in R - Day 1 slides

Data wrangling & manipulation in R - Day 1 slides

Ruan van Mazijk

July 01, 2019
Tweet

More Decks by Ruan van Mazijk

Other Decks in Programming

Transcript

  1. data_wrangling() && ("manipulation" %in% R) postgraduate_workshop( dept = "Biological Sciences",

    presenter = c( "Ruan van Mazijk", "MSc candidate" ) ) %>% %>% %>% > logos() > face()
  2. > introduce( ) • BSc + Hons here at UCT

    • Ecology & evolution • (Mostly plant) comparative biology • Biogeography
  3. > introduce( ) • BSc + Hons here at UCT

    • Ecology & evolution • (Mostly plant) comparative biology • Biogeography • Been working with R for 4½ years • Every major project I’ve done…
  4. > introduce( ) Schoenus compar Silvermine, Table Mountatin NP R.

    van Mazijk 2018 Tetraria ustulata Marloth NR R. van Mazijk 2018 Tetraria thermalis Silvermine, Table Mountain NP R. van Mazijk 2018
  5. > workshop$goals • More reproducible science • Save time by:

    • Automating repetitive tasks • Eliminating human error
  6. > workshop$goals • More reproducible science • Save time by:

    • Automating repetitive tasks • Eliminating human error • Boost your skills • Think about your data programmatically
  7. > workshop$outline[1:3] DAY 1 Tidy data principles & tidyr DAY

    2 Manipulating data & an intro to dplyr DAY 3 Extending your data with mutate(), summarise() & friends
  8. data <- read.csv("my-data.csv") data1 <- f(data, arg1 = "something") data2

    <- g(data1, another.thing = "blah") data3 <- h(data2, a.setting = TRUE)
  9. data <- read.csv("my-data.csv") data1 <- f(data, arg1 = "something") data2

    <- g(data1, another.thing = "blah") data3 <- h(data2, a.setting = TRUE) data4 <- data3[data3$a.column == "cough", ]
  10. data <- read.csv("my-data.csv") data1 <- f(data, arg1 = "something") data2

    <- g(data1, another.thing = "blah") data3 <- h(data2, a.setting = TRUE) data4 <- data3[data3$a.column == "cough", ]
  11. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE )
  12. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE )
  13. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE )
  14. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE )
  15. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE )
  16. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE ) data <- data[data$a.column == "cough", ]
  17. %>% Solution: the pipe! { } [ ] [[ ]]

    <- = ( ) , " " ' ' Read: “then”
  18. data <- read.csv("my-data.csv") data1 <- f(data, arg1 = "something") data2

    <- g(data1, another.thing = "blah") data3 <- h(data2, a.setting = TRUE) data4 <- data3[data3$a.column == "cough", ]
  19. data <- read.csv("my-data.csv") data1 <- f(data, arg1 = "something") data2

    <- g(data1, another.thing = "blah") data3 <- h(data2, a.setting = TRUE) data4 <- data3[data3$a.column == "cough", ]
  20. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE ) data <- data[data$a.column == "cough", ]
  21. h(g(f(x))) x %>% f() %>% g() %>% h() ↓ f()

    ↓ g() ↓ h() ↓ Some subsetting ↓ new data data
  22. data <- read.csv("my-data.csv") data <- h( g( f(data, arg1 =

    "something"), another.thing = "blah" ), a.setting = TRUE )
  23. data <- read.csv("my-data.csv") data <- data %>% f(arg1 = "something")

    %>% g(another.thing = "blah") %>% h(a.setting = TRUE)
  24. data <- read.csv("my-data.csv") data <- data %>% f(arg1 = "something")

    %>% g(another.thing = "blah") %>% h(a.setting = TRUE) ↓ f() ↓ g() ↓ h() ↓ Some subsetting ↓ new data data
  25. data <- read.csv("my-data.csv") data <- data %>% f(arg1 = "something")

    %>% g(another.thing = "blah") %>% h(a.setting = TRUE) data <- data[data$a.column == "cough", ] ???????
  26. > workshop$outline[1:3] DAY 1 Tidy data principles & tidyr DAY

    2 Manipulating data & an intro to dplyr DAY 3 Extending your data with mutate(), summarise() & friends
  27. An example data-collection scenario in biology Kogelberg NR, R. van

    Mazijk 2019 Observation Pk, R. van Mazijk 2018 Near Pearly Beach, Agulhas Plains, R. van Mazijk 2018
  28. An example data-collection scenario in biology Kogelberg NR, R. van

    Mazijk 2019 Observation Pk, R. van Mazijk 2018 Near Pearly Beach, Agulhas Plains, R. van Mazijk 2018
  29. An example data-collection scenario in biology Kogelberg NR, R. van

    Mazijk 2019 Observation Pk, R. van Mazijk 2018 Near Pearly Beach, Agulhas Plains, R. van Mazijk 2018
  30. An example data-collection scenario in biology Kogelberg NR, R. van

    Mazijk 2019 Observation Pk, R. van Mazijk 2018 Near Pearly Beach, Agulhas Plains, R. van Mazijk 2018
  31. An example data-collection scenario in biology Kogelberg NR, R. van

    Mazijk 2019 Observation Pk, R. van Mazijk 2018 Near Pearly Beach, Agulhas Plains, R. van Mazijk 2018
  32. Site 1 Site 2 Site 3 Sp 1 Sp 2

    Sp 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3
  33. One way to lay out your collected data… Site 1

    Site 2 Site 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3
  34. Site 1 Site 2 Site 3 Sp 1 Sp 2

    Sp 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3
  35. Site 1 Site 2 Site 3 Sp 1 Sp 2

    Sp 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3 ???
  36. Site 1 Site 2 Site 3 Sp 1 Sp 2

    Sp 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3 ???
  37. Site 1 Site 2 Site 3 Sp 1 Sp 2

    Sp 3 Sp 1 Sp 2 Sp 3 Sp 1 Sp 2 Sp 3 ???
  38. TIDY DATA CC BY-NC-ND 3.0 Grolemund & Wickham 2017. R

    for Data Science 1. Each variable must have its own column 2. Each observation must have its own row 3. Each value, therefore, must have its own cell
  39. CC BY-NC-ND 3.0 Grolemund & Wickham 2017. R for Data

    Science tidyr An R-package all about getting to this:
  40. # Verbs to tidy your data # Untidy observations? gather()

    # if > 1 observation per row spread() # if observations live in > 1 row
  41. # Verbs to tidy your data # Untidy observations? gather()

    # if > 1 observation per row spread() # if observations live in > 1 row # Untidy variables? separate() # if > 1 variable per column unite() # if variables live in > 1 column
  42. Note the following when choosing tidyr-verbs: • Be clear on

    what your observations are: • Like, what unit of your study “counts” as an observation • E.g. Leaf traits: plant leaf vs plant individual • E.g. Reproductive success: egg size vs clutch size
  43. Note the following when choosing tidyr-verbs: • Be clear on

    what your observations are: • Like, what unit of your study “counts” as an observation • E.g. Leaf traits: plant leaf vs plant individual • E.g. Reproductive success: egg size vs clutch size • This will depend on your study &/or data!
  44. Note the following when choosing tidyr-verbs: • Be clear on

    what your observations are: • Like, what unit of your study “counts” as an observation • E.g. Leaf traits: plant leaf vs plant individual • E.g. Reproductive success: egg size vs clutch size • This will depend on your study &/or data! • Variables are discrete, separate ideas!
  45. Note the following when choosing tidyr-verbs: • Be clear on

    what your observations are: • Like, what unit of your study “counts” as an observation • E.g. Leaf traits: plant leaf vs plant individual • E.g. Reproductive success: egg size vs clutch size • This will depend on your study &/or data! • Variables are discrete, separate ideas! • But again, this will depend on your study &/or data!
  46. # Verbs to tidy your data # Untidy observations? gather()

    # if > 1 observation per row spread() # if observations live in > 1 row # Untidy variables? separate() # if > 1 variable per column unite() # if variables live in > 1 column
  47. # Untidy observations? gather() # if > 1 observation per

    row data %>% gather(key, value, ...)
  48. # Untidy observations? gather() # if > 1 observation per

    row data %>% gather(key, value, ...) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  49. # Untidy observations? gather() # if > 1 observation per

    row data %>% gather(key, value, ...) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  50. # Untidy observations? gather() # if > 1 observation per

    row data %>% gather(year, cases, 1999, 2000) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  51. # Untidy observations? spread() # if observations live in >

    1 row data %>% spread(key, value) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  52. # Untidy observations? spread() # if observations live in >

    1 row data %>% spread(type, count) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  53. # Untidy variables? separate() # if > 1 variable per

    column data %>% separate(col, into, sep)
  54. # Untidy variables? separate() # if > 1 variable per

    column data %>% separate(col, into)
  55. # Untidy variables? separate() # if > 1 variable per

    column data %>% separate(col, into) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  56. # Untidy variables? separate() # if > 1 variable per

    column data %>% separate(rate, c("cases", "pop")) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  57. # Untidy variables? unite() # if variables live in >

    1 column data %>% unite(col, ..., sep)
  58. # Untidy variables? unite() # if variables live in >

    1 column data %>% unite(col, ...) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/
  59. # Untidy variables? unite() # if variables live in >

    1 column data %>% unite(year, century, year) CC BY SA RStudio https://www.rstudio.com/resources/cheatsheets/