Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Data manipulation in the tidyverse with tidyr a...

harp
October 15, 2019

Data manipulation in the tidyverse with tidyr and dplyr

harp

October 15, 2019
Tweet

More Decks by harp

Other Decks in Education

Transcript

  1. Tidying data - Missing data Missing data in R often

    has the value NA. NA has no truthiness associated with it, and any test expression involving NA will always return NA. x <- 3 x == NA ## NA x <- NA x == NA ## NA Instead, use is.na(x)
  2. Tidying data - Missing data For data frames, the tidyr

    package includes the drop_na() function for dropping rows with missing values. Any column with NA -> remove row Specified column(s) with NA -> remove row
  3. Tidying data - what is tidy data? There are three

    principles of tidy data 1. Each variable must have its own column 2. Each observation must have its own row 3. Each value must have its own cell What is a variable? What is an observation? In general if it is to mapped to the same aesthetic in a plot, it is a variable.
  4. Tidying data - how do we do it? Many data

    sets are "untidy" and we need to get several variables into one column...
  5. Tidying data - how do we do it? There are

    two functions in tidyr: pivot_longer and pivot_wider City May June July Cape Town 15.2 13.5 12.8 Copenhagen 11.8 15.6 17.2 tidy_data <- pivot_longer( df, cols = c(May, June, July) names_to = "Month", values_to = "Average_temperature" )
  6. Tidying data - how do we do it? City Month

    Average_temperature Cape Town May 15.2 Cape Town June 13.5 Cape Town July 12.8 Copenhagen May 11.8 Copenhagen June 15.6 Copenhagen July 17.2 wide_data <- pivot_wider( tidy_data, names_from = Month values_from = Average_temperature )
  7. Manipulating data The main package in the tidyverse for data

    manipulation is dplyr There are five main verbs that we will cover 1. filter 2. arrange 3. select 4. mutate 5. summarise Each of these verbs has a common syntax why the data frame is passed as the first argument, and what to do with it as other arguments.
  8. Manipulating data - filter filter is used to create a

    subset of observations based on the values of one or more columns. Test expressions are separated by commas that act as a logical AND operator. filter(df, age == 25, height >= 180, grepl("Smith", surname))
  9. Manipulating data - arrange arrange is used to sort an

    entire data frame based on the values in a particular column. If more than one column is requested the data frame is sorted in the order in which the columns are given arrange(df, age, desc(height), surname)
  10. Manipulating data - select select is used to extract columns

    of data from a data frame. Works in the same way as SQL SELECT select(df, surname, height) [SQL: SELECT surname, height FROM df]
  11. Manipulating data - select You can also select columns using

    helper functions • starts_with(“abc”): matches names that begin with “abc”. • ends_with(“xyz”): matches names that end with “xyz”. • contains(“ijk”): matches names that contain “ijk”. • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. Regular expressions is a big subject so we won’t be going into that here. • num_range(“x”, 1:3): matches x1, x2 and x3.
  12. Manipulating data - mutate mutate is used to apply a

    function to a column and optionally append a new column of the result to the end of the data frame Another form of mutate is transmute, which only returns the mutated columns mutate(df, height = height / 100) mutate(df, height_metres = height / 100)
  13. Manipulating data - summarise summarise is normally combined with group_by

    to obtain a single summary value for groups of observations. Here the pipe operator %>% becomes very useful as we group the data and then send the result to the summarise function group_by(df, gender) %>% summarise(mean_height = mean(height), std_dev_height = sd(height))