The Design of Everyday R Functions

The DESIGN of EVERYDAY R FUNCTIONS Harmonic Analytics

The DESIGN of EVERYDAY THINGS

Gulf of Execution

01 creating strong separation between commands and queries ➔ command
➔ query

# {base} example: conforming to the CQS principle setwd(dir) #
command function getwd() # query function # rather than conditioning the command or query on the input wd(dir) # if dir variable exists wd() # if dir variable doesn’t exist

# {base} example: not conforming to the CQS principle options(warn
= -1) # suppresses warnings globally options("warn") $warn [1] -1 options() setOption(warn = 0) Error in setOption(warn = 0) : could not find function "setOption" getOption("warn") [1] -1

02 data analysis revolves around data. Pass the data to
the function ﬁrst

# tidyverse examples: stringr::str_replace_all(string, pattern, replacement) ggplot2::ggplot(data = NULL, mapping
= aes(), ...) dplyr::select(.data, ...) # non-tidyverse counter examples: base::gsub(pattern, replacement, x) stats::lm(formula, data)

03 functions should have the fewest arguments as possible ➔
➔ ➔

UpSetR::upset UpSetR::upset

upset(data, nsets = 5, nintersects = 40, sets = NULL,
keep.order = F, set.metadata = NULL, intersections = NULL, matrix.color = "gray23", main.bar.color = "gray23", mainbar.y.label = "Intersection Size", mainbar.y.max = NULL, sets.bar.color = "gray23", sets.x.label = "Set Size", point.size = 2.2, line.size = 0.7, mb.ratio = c(0.7, 0.3), expression = NULL, att.pos = NULL, att.color = main.bar.color, order.by = c("freq", "degree"), decreasing = c(T, F), show.numbers = "yes", number.angles = 0, group.by = "degree", cutoff = NULL, queries = NULL, query.legend = "none", shade.color = "gray88", shade.alpha = 0.25, matrix.dot.alpha = 0.5, empty.intersections = NULL, color.pal = 1, boxplot.summary = NULL, attribute.plots = NULL, scale.intersections = "identity", scale.sets = "identity", text.scale = 1, set_size.angles = 0) UpSetR::upset

# {ggplot2} conceptual model: upset(movies) + sets_intersections(nsets=5, nintersects=40, sets=NULL) +
scale_x_discrete(label="Set Size", color="gray23") + ... # {graphics} conceptual model: par(nsets=5, nintersects=40, sets=NULL) # Command Function upset(movies) # Query Function title(xlab="Set Size", col.lab="gray23") ...

## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if
(!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE) ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset, na.action=na.fail) UpSetR::upset randomForest::randomForest ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE)

04 ➔ TRUE FALSE ➔ passing a Boolean to a
function means it does more than one thing

# sum() behaves differently for na.rm=TRUE and na.rm=FALSE sum(c(1, NA),
na.rm = TRUE) [1] 1 sum(c(1, NA), na.rm = FALSE) [1] NA # Instead, write two functions: # one for the TRUE case; and # one for the FALSE case sum_without_na <- function(...) sum(..., na.rm = TRUE) sum_with_na <- function(...) sum(..., na.rm = FALSE)

passing NULL as a pseudo-Boolean is worse than passing a
Boolean into a function ➔ NULL ➔ non-NULL

## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset,
na.action=na.fail) Question: How can randomForest fit a model on the data, if the data arg is NULL? ## Is this better? randomForest(formula, data, data.env = parent.frame(), ..., subset, na. Answer (from the function documentation): `data` an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from NULL Case: Model `data` in parent.env(environment()) Non-NULL Case: Model `data` in environment()

Gulf of Execution: Conclusion ## Original Function randomForest(formula, data=NULL, ...,
subset, na.action=na.fail) ## 01 Command-Query Separation base::options(na.action = "na.fail") randomForest(formula, data=NULL, ..., subset, na.action=getOption("na.action")) ## 02 Data Come First randomForest(data=NULL, formula, ..., subset, na.action=getOption("na.action")) ## 03 Three Arguments Max (0 is best) randomForest(data=NULL, formula, ...) ## 04 No Boolean Arguments Ever (Nor NULL as a pseudo-Boolean) randomForest(data, formula, ...)

Gulf of Evaluation

1 2 3 4 {progress} use base::message(), avoid base::print() and
base::cat()

06 a function should do one thing ➔ ➔

## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) randomForest ONE
thing is fitting a random forest model on data randomForest has a few prerequisites from its data variable: 1. The data variable is supplied; 2. The data is a data frame; and 3. The data data frame has no NA values. randomForest current implementation responds to data violations by: 1. Searching the parent environment for data; 2. Prompting an error; and 3. Prompting an error / Treating NA values (depends on `na.action`) If randomForest was truly doing one thing, then any assumption violation should prompt an ERROR

## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## onlyRandomForest
does ONE THING, it fits a random forest on the data; ## onlyRandomForest doesn’t handle errors from external sources. onlyRandomForest <- function(formula, data, ...){ stopifnot(isFALSE(missing(formula)), isFALSE(missing(data))) stopifnot(class(data) %in% "data.frame") stopifnot(isFALSE(any(is.na(data)))) randomForest::randomForest(formula, data, ...) } • • • ## Passing data with NAs to randomForest prompts an error data <- mtcars data[1:2, "hp"] <- NA onlyRandomForest("mpg ~ .", data) Error in onlyRandomForest ("mpg ~ .", data) : isFALSE(any(is.na(data))) is not TRUE

07 An error message should start with a general statement
of the problem then give a concise description of what went wrong

## onlyRandomForest prompts informative errors with {assertive} onlyRandomForest <- function(formula,
data, ...){ assertive::assert_is_data.frame(data) assertive::assert_all_are_not_na(data) randomForest::randomForest(formula, data, ...) } ## Passing data with NAs to onlyRandomForest prompts an informative error data <- mtcars data[1:2, "hp"] <- NA onlyRandomForest("mpg ~ .", data) Error in onlyRandomForest("mpg ~ .", data) : is_not_na : The values of data are sometimes NA. There were 2 failures: Position Value Cause 1 97 NA missing 2 98 NA missing

I hate {assertive}; what's the alternative? {assertr} {checkmate} {tester} {assertthat}
{testit} helper-functions

We’re done. Questions? 01 Command-Query Separation 03 Three Arguments Max
05 Progress Reporting 02 Data Come First 04 No Boolean Arguments Ever 06 Error Handling 07 Error Reporting Harel Lustiger [email protected]

The Design of Everyday R Functions

The Design of Everyday R Functions

Harel Lustiger

More Decks by Harel Lustiger

Other Decks in Programming

Featured

Transcript