This presentation covers major computer programming principles and good practices that help to make R functions intuitive to other R users and our future selves.
command function getwd() # query function # rather than conditioning the command or query on the input wd(dir) # if dir variable exists wd() # if dir variable doesn’t exist
na.rm = TRUE) [1] 1 sum(c(1, NA), na.rm = FALSE) [1] NA # Instead, write two functions: # one for the TRUE case; and # one for the FALSE case sum_without_na <- function(...) sum(..., na.rm = TRUE) sum_with_na <- function(...) sum(..., na.rm = FALSE)
na.action=na.fail) Question: How can randomForest fit a model on the data, if the data arg is NULL? ## Is this better? randomForest(formula, data, data.env = parent.frame(), ..., subset, na. Answer (from the function documentation): `data` an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from NULL Case: Model `data` in parent.env(environment()) Non-NULL Case: Model `data` in environment()
thing is fitting a random forest model on data randomForest has a few prerequisites from its data variable: 1. The data variable is supplied; 2. The data is a data frame; and 3. The data data frame has no NA values. randomForest current implementation responds to data violations by: 1. Searching the parent environment for data; 2. Prompting an error; and 3. Prompting an error / Treating NA values (depends on `na.action`) If randomForest was truly doing one thing, then any assumption violation should prompt an ERROR
does ONE THING, it fits a random forest on the data; ## onlyRandomForest doesn’t handle errors from external sources. onlyRandomForest <- function(formula, data, ...){ stopifnot(isFALSE(missing(formula)), isFALSE(missing(data))) stopifnot(class(data) %in% "data.frame") stopifnot(isFALSE(any(is.na(data)))) randomForest::randomForest(formula, data, ...) } • • • ## Passing data with NAs to randomForest prompts an error data <- mtcars data[1:2, "hp"] <- NA onlyRandomForest("mpg ~ .", data) Error in onlyRandomForest ("mpg ~ .", data) : isFALSE(any(is.na(data))) is not TRUE
data, ...){ assertive::assert_is_data.frame(data) assertive::assert_all_are_not_na(data) randomForest::randomForest(formula, data, ...) } ## Passing data with NAs to onlyRandomForest prompts an informative error data <- mtcars data[1:2, "hp"] <- NA onlyRandomForest("mpg ~ .", data) Error in onlyRandomForest("mpg ~ .", data) : is_not_na : The values of data are sometimes NA. There were 2 failures: Position Value Cause 1 97 NA missing 2 98 NA missing