Data manipulation in the tidyverse with tidyr and dplyr

Data manipulation

Tidying data - Missing data Missing data in R often
has the value NA. NA has no truthiness associated with it, and any test expression involving NA will always return NA. x <- 3 x == NA ## NA x <- NA x == NA ## NA Instead, use is.na(x)

Tidying data - Missing data For data frames, the tidyr
package includes the drop_na() function for dropping rows with missing values. Any column with NA -> remove row Speciﬁed column(s) with NA -> remove row

Tidying data - Missing data Demo

Tidying data - what is tidy data? There are three
principles of tidy data 1. Each variable must have its own column 2. Each observation must have its own row 3. Each value must have its own cell What is a variable? What is an observation? In general if it is to mapped to the same aesthetic in a plot, it is a variable.

Tidying data - how do we do it? Many data
sets are "untidy" and we need to get several variables into one column...

Tidying data - how do we do it? There are
two functions in tidyr: pivot_longer and pivot_wider City May June July Cape Town 15.2 13.5 12.8 Copenhagen 11.8 15.6 17.2 tidy_data <- pivot_longer( df, cols = c(May, June, July) names_to = "Month", values_to = "Average_temperature" )

Tidying data - how do we do it? City Month
Average_temperature Cape Town May 15.2 Cape Town June 13.5 Cape Town July 12.8 Copenhagen May 11.8 Copenhagen June 15.6 Copenhagen July 17.2 wide_data <- pivot_wider( tidy_data, names_from = Month values_from = Average_temperature )

Tidying data - how do we do it? Demo

Manipulating data The main package in the tidyverse for data
manipulation is dplyr There are five main verbs that we will cover 1. filter 2. arrange 3. select 4. mutate 5. summarise Each of these verbs has a common syntax why the data frame is passed as the first argument, and what to do with it as other arguments.

Manipulating data - ﬁlter ﬁlter is used to create a
subset of observations based on the values of one or more columns. Test expressions are separated by commas that act as a logical AND operator. filter(df, age == 25, height >= 180, grepl("Smith", surname))

Manipulating data - ﬁlter Demo

Manipulating data - arrange arrange is used to sort an
entire data frame based on the values in a particular column. If more than one column is requested the data frame is sorted in the order in which the columns are given arrange(df, age, desc(height), surname)

Manipulating data - arrange Demo

Manipulating data - select select is used to extract columns
of data from a data frame. Works in the same way as SQL SELECT select(df, surname, height) [SQL: SELECT surname, height FROM df]

Manipulating data - select You can also select columns using
helper functions • starts_with(“abc”): matches names that begin with “abc”. • ends_with(“xyz”): matches names that end with “xyz”. • contains(“ijk”): matches names that contain “ijk”. • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. Regular expressions is a big subject so we won’t be going into that here. • num_range(“x”, 1:3): matches x1, x2 and x3.

Manipulating data - select Demo

Manipulating data - mutate mutate is used to apply a
function to a column and optionally append a new column of the result to the end of the data frame Another form of mutate is transmute, which only returns the mutated columns mutate(df, height = height / 100) mutate(df, height_metres = height / 100)

Manipulating data - mutate Demo

Manipulating data - summarise summarise is normally combined with group_by
to obtain a single summary value for groups of observations. Here the pipe operator %>% becomes very useful as we group the data and then send the result to the summarise function group_by(df, gender) %>% summarise(mean_height = mean(height), std_dev_height = sd(height))

Manipulating data - summarise Demo

Data manipulation in the tidyverse with tidyr a...

Data manipulation in the tidyverse with tidyr and dplyr

harp

More Decks by harp

Other Decks in Education

Featured

Transcript