Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Design of Everyday R Functions

The Design of Everyday R Functions

This presentation covers major computer programming principles and good practices that help to make R functions intuitive to other R users and our future selves.

Harel Lustiger

September 12, 2019
Tweet

More Decks by Harel Lustiger

Other Decks in Programming

Transcript

  1. # {base} example: conforming to the CQS principle setwd(dir) #

    command function getwd() # query function # rather than conditioning the command or query on the input wd(dir) # if dir variable exists wd() # if dir variable doesn’t exist
  2. # {base} example: not conforming to the CQS principle options(warn

    = -1) # suppresses warnings globally options("warn") $warn [1] -1 options() setOption(warn = 0) Error in setOption(warn = 0) : could not find function "setOption" getOption("warn") [1] -1
  3. # tidyverse examples: stringr::str_replace_all(string, pattern, replacement) ggplot2::ggplot(data = NULL, mapping

    = aes(), ...) dplyr::select(.data, ...) # non-tidyverse counter examples: base::gsub(pattern, replacement, x) stats::lm(formula, data)
  4. upset(data, nsets = 5, nintersects = 40, sets = NULL,

    keep.order = F, set.metadata = NULL, intersections = NULL, matrix.color = "gray23", main.bar.color = "gray23", mainbar.y.label = "Intersection Size", mainbar.y.max = NULL, sets.bar.color = "gray23", sets.x.label = "Set Size", point.size = 2.2, line.size = 0.7, mb.ratio = c(0.7, 0.3), expression = NULL, att.pos = NULL, att.color = main.bar.color, order.by = c("freq", "degree"), decreasing = c(T, F), show.numbers = "yes", number.angles = 0, group.by = "degree", cutoff = NULL, queries = NULL, query.legend = "none", shade.color = "gray88", shade.alpha = 0.25, matrix.dot.alpha = 0.5, empty.intersections = NULL, color.pal = 1, boxplot.summary = NULL, attribute.plots = NULL, scale.intersections = "identity", scale.sets = "identity", text.scale = 1, set_size.angles = 0) UpSetR::upset
  5. # {ggplot2} conceptual model: upset(movies) + sets_intersections(nsets=5, nintersects=40, sets=NULL) +

    scale_x_discrete(label="Set Size", color="gray23") + ... # {graphics} conceptual model: par(nsets=5, nintersects=40, sets=NULL) # Command Function upset(movies) # Query Function title(xlab="Set Size", col.lab="gray23") ...
  6. ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if

    (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE) ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset, na.action=na.fail) UpSetR::upset randomForest::randomForest ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE)
  7. 04 ➔ TRUE FALSE ➔ passing a Boolean to a

    function means it does more than one thing
  8. # sum() behaves differently for na.rm=TRUE and na.rm=FALSE sum(c(1, NA),

    na.rm = TRUE) [1] 1 sum(c(1, NA), na.rm = FALSE) [1] NA # Instead, write two functions: # one for the TRUE case; and # one for the FALSE case sum_without_na <- function(...) sum(..., na.rm = TRUE) sum_with_na <- function(...) sum(..., na.rm = FALSE)
  9. passing NULL as a pseudo-Boolean is worse than passing a

    Boolean into a function ➔ NULL ➔ non-NULL
  10. ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset,

    na.action=na.fail) Question: How can randomForest fit a model on the data, if the data arg is NULL? ## Is this better? randomForest(formula, data, data.env = parent.frame(), ..., subset, na. Answer (from the function documentation): `data` an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from NULL Case: Model `data` in parent.env(environment()) Non-NULL Case: Model `data` in environment()
  11. Gulf of Execution: Conclusion ## Original Function randomForest(formula, data=NULL, ...,

    subset, na.action=na.fail) ## 01 Command-Query Separation base::options(na.action = "na.fail") randomForest(formula, data=NULL, ..., subset, na.action=getOption("na.action")) ## 02 Data Come First randomForest(data=NULL, formula, ..., subset, na.action=getOption("na.action")) ## 03 Three Arguments Max (0 is best) randomForest(data=NULL, formula, ...) ## 04 No Boolean Arguments Ever (Nor NULL as a pseudo-Boolean) randomForest(data, formula, ...)
  12. 05

  13. ## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) randomForest ONE

    thing is fitting a random forest model on data randomForest has a few prerequisites from its data variable: 1. The data variable is supplied; 2. The data is a data frame; and 3. The data data frame has no NA values. randomForest current implementation responds to data violations by: 1. Searching the parent environment for data; 2. Prompting an error; and 3. Prompting an error / Treating NA values (depends on `na.action`) If randomForest was truly doing one thing, then any assumption violation should prompt an ERROR
  14. ## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## onlyRandomForest

    does ONE THING, it fits a random forest on the data; ## onlyRandomForest doesn’t handle errors from external sources. onlyRandomForest <- function(formula, data, ...){ stopifnot(isFALSE(missing(formula)), isFALSE(missing(data))) stopifnot(class(data) %in% "data.frame") stopifnot(isFALSE(any(is.na(data)))) randomForest::randomForest(formula, data, ...) } • • • ## Passing data with NAs to randomForest prompts an error data <- mtcars data[1:2, "hp"] <- NA onlyRandomForest("mpg ~ .", data) Error in onlyRandomForest ("mpg ~ .", data) : isFALSE(any(is.na(data))) is not TRUE
  15. 07 An error message should start with a general statement

    of the problem then give a concise description of what went wrong
  16. ## onlyRandomForest prompts informative errors with {assertive} onlyRandomForest <- function(formula,

    data, ...){ assertive::assert_is_data.frame(data) assertive::assert_all_are_not_na(data) randomForest::randomForest(formula, data, ...) } ## Passing data with NAs to onlyRandomForest prompts an informative error data <- mtcars data[1:2, "hp"] <- NA onlyRandomForest("mpg ~ .", data) Error in onlyRandomForest("mpg ~ .", data) : is_not_na : The values of data are sometimes NA. There were 2 failures: Position Value Cause 1 97 NA missing 2 98 NA missing
  17. We’re done. Questions? 01 Command-Query Separation 03 Three Arguments Max

    05 Progress Reporting 02 Data Come First 04 No Boolean Arguments Ever 06 Error Handling 07 Error Reporting Harel Lustiger [email protected]