has the value NA. NA has no truthiness associated with it, and any test expression involving NA will always return NA. x <- 3 x == NA ## NA x <- NA x == NA ## NA Instead, use is.na(x)
package includes the drop_na() function for dropping rows with missing values. Any column with NA -> remove row Specified column(s) with NA -> remove row
principles of tidy data 1. Each variable must have its own column 2. Each observation must have its own row 3. Each value must have its own cell What is a variable? What is an observation? In general if it is to mapped to the same aesthetic in a plot, it is a variable.
two functions in tidyr: pivot_longer and pivot_wider City May June July Cape Town 15.2 13.5 12.8 Copenhagen 11.8 15.6 17.2 tidy_data <- pivot_longer( df, cols = c(May, June, July) names_to = "Month", values_to = "Average_temperature" )
Average_temperature Cape Town May 15.2 Cape Town June 13.5 Cape Town July 12.8 Copenhagen May 11.8 Copenhagen June 15.6 Copenhagen July 17.2 wide_data <- pivot_wider( tidy_data, names_from = Month values_from = Average_temperature )
manipulation is dplyr There are five main verbs that we will cover 1. filter 2. arrange 3. select 4. mutate 5. summarise Each of these verbs has a common syntax why the data frame is passed as the first argument, and what to do with it as other arguments.
subset of observations based on the values of one or more columns. Test expressions are separated by commas that act as a logical AND operator. filter(df, age == 25, height >= 180, grepl("Smith", surname))
entire data frame based on the values in a particular column. If more than one column is requested the data frame is sorted in the order in which the columns are given arrange(df, age, desc(height), surname)
helper functions • starts_with(“abc”): matches names that begin with “abc”. • ends_with(“xyz”): matches names that end with “xyz”. • contains(“ijk”): matches names that contain “ijk”. • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. Regular expressions is a big subject so we won’t be going into that here. • num_range(“x”, 1:3): matches x1, x2 and x3.
function to a column and optionally append a new column of the result to the end of the data frame Another form of mutate is transmute, which only returns the mutated columns mutate(df, height = height / 100) mutate(df, height_metres = height / 100)
to obtain a single summary value for groups of observations. Here the pipe operator %>% becomes very useful as we group the data and then send the result to the summarise function group_by(df, gender) %>% summarise(mean_height = mean(height), std_dev_height = sd(height))