R User Friendly Functions

Harmonic Analytics

“I want to have a solid mental model of the
problem before implementing anything”

➔ ➔ ➔

upset(data, nsets = 5, nintersects = 40, sets = NULL,
keep.order = F, set.metadata = NULL, intersections = NULL, matrix.color = "gray23", main.bar.color = "gray23", mainbar.y.label = "Intersection Size", mainbar.y.max = NULL, sets.bar.color = "gray23", sets.x.label = "Set Size", point.size = 2.2, line.size = 0.7, mb.ratio = c(0.7, 0.3), expression = NULL, att.pos = NULL, att.color = main.bar.color, order.by = c("freq", "degree"), decreasing = c(T, F), show.numbers = "yes", number.angles = 0, group.by = "degree", cutoff = NULL, queries = NULL, query.legend = "none", shade.color = "gray88", shade.alpha = 0.25, matrix.dot.alpha = 0.5, empty.intersections = NULL, color.pal = 1, boxplot.summary = NULL, attribute.plots = NULL, scale.intersections = "identity", scale.sets = "identity", text.scale = 1, set_size.angles = 0)

Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.,
& Manuscript, A. (2014). UpSet: Visualization of Intersecting Sets Europe PMC Funders Group. IEEE Trans Vis Comput Graph, 20(12), 1983–1992. https://doi.org/10.1109/TVCG.2014.2346248 Combination Matrix Set Menu Set View

# {ggplot2} mental model: ggplot(movies) + geom_upset(empty.intersections = FALSE) +
upset_view(label = "Intersection Size", fill = "gray23") + upset_matrix(colour = "gray88", fill = "gray23") + upset_menu(label = "Set Size")

# {graphics} mental model: upset_object <- upset(movies) par(nsets = 5,
nintersects = 40, sets = NULL) ?par par() plot(upset_object) title(xlab = "Set Size", col.lab = "gray23") ...

## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if
(!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE) ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE)

➔ TRUE FALSE ➔

# sum() behaves differently for na.rm=TRUE and na.rm=FALSE sum(c(1, NA),
na.rm = TRUE) [1] 1 sum(c(1, NA), na.rm = FALSE) [1] NA # Instead, write two functions: # one for the TRUE case; and # one for the FALSE case sum_without_na <- function(...) sum(..., na.rm = TRUE) sum_with_na <- function(...) sum(..., na.rm = FALSE)

➔ NULL ➔ non-NULL

## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset,
na.action=na.fail) Question: How can randomForest fit a model on the data, if the data arg is NULL? ## Is this better? randomForest(formula, data, data.env = parent.frame(), ..., subset, na. Answer (from the function documentation): `data` an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from NULL Case: Model `data` in parent.env(environment()) Non-NULL Case: Model `data` in environment()

➔ ➔

## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) randomForest ONE
thing is fitting a random forest model on data. It’s easy to make the case that randomForest is doing three things: 1. It searches for data in different environments; 2. It treats NA values in data (in accordance with na.action); and 3. It fits a random forest model on data. If randomForest was truly doing one thing, the absence of clean data would have prompted an ERROR.

## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## onlyRandomForest
does ONE THING, it fits a random forest on data; ## onlyRandomForest doesn’t handle errors from external sources. onlyRandomForest <- function(formula, data, ...){ stopifnot(isFALSE(missing(formula)), isFALSE(missing(data))) stopifnot(class(data) %in% "data.frame") stopifnot(isFALSE(any(is.na(data)))) randomForest::randomForest(formula, data, ...) }

tidyr::drop_na ONE thing is dropping rows with missing data. Instead
of defining how a function should handle missing values internally, tidyr::drop_na can be used to handle missing values externally. The original function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) Becomes data = tidyr::drop_na(data) randomForest(formula, data, ..., subset)

## Original Function randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, +20 more args)
## 01 Three Arguments Max (0 is best) randomForest(formula, data=NULL, ..., na.action=na.fail) ## 02 No Boolean Arguments Ever (Nor NULL as a pseudo-Boolean) randomForest(formula, data, ..., na.action=na.fail) ## 03 Do One Thing (Either handle errors, or do something else) randomForest(formula, data, ...)

R User Friendly Functions

R User Friendly Functions

Harel Lustiger

More Decks by Harel Lustiger

Other Decks in Programming

Featured

Transcript

Harmonic Analytics

“I want to have a solid mental model of the

➔ ➔ ➔

upset(data, nsets = 5, nintersects = 40, sets = NULL,

Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.,

# {ggplot2} mental model: ggplot(movies) + geom_upset(empty.intersections = FALSE) +

# {graphics} mental model: upset_object <- upset(movies) par(nsets = 5,

## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if

➔ TRUE FALSE ➔

# sum() behaves differently for na.rm=TRUE and na.rm=FALSE sum(c(1, NA),

➔ NULL ➔ non-NULL

## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset,

➔ ➔

## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) randomForest ONE

## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## onlyRandomForest

tidyr::drop_na ONE thing is dropping rows with missing data. Instead

## Original Function randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, +20 more args)