Upgrade to Pro — share decks privately, control downloads, hide ads and more …

R User Friendly Functions

R User Friendly Functions

This presentation was given at the New-Zealand Statistical Conference (NZSA), November 2019.

Harel Lustiger

November 28, 2019
Tweet

More Decks by Harel Lustiger

Other Decks in Programming

Transcript

  1. “I want to have a solid mental model of the

    problem before implementing anything”
  2. upset(data, nsets = 5, nintersects = 40, sets = NULL,

    keep.order = F, set.metadata = NULL, intersections = NULL, matrix.color = "gray23", main.bar.color = "gray23", mainbar.y.label = "Intersection Size", mainbar.y.max = NULL, sets.bar.color = "gray23", sets.x.label = "Set Size", point.size = 2.2, line.size = 0.7, mb.ratio = c(0.7, 0.3), expression = NULL, att.pos = NULL, att.color = main.bar.color, order.by = c("freq", "degree"), decreasing = c(T, F), show.numbers = "yes", number.angles = 0, group.by = "degree", cutoff = NULL, queries = NULL, query.legend = "none", shade.color = "gray88", shade.alpha = 0.25, matrix.dot.alpha = 0.5, empty.intersections = NULL, color.pal = 1, boxplot.summary = NULL, attribute.plots = NULL, scale.intersections = "identity", scale.sets = "identity", text.scale = 1, set_size.angles = 0)
  3. Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.,

    & Manuscript, A. (2014). UpSet: Visualization of Intersecting Sets Europe PMC Funders Group. IEEE Trans Vis Comput Graph, 20(12), 1983–1992. https://doi.org/10.1109/TVCG.2014.2346248 Combination Matrix Set Menu Set View
  4. # {ggplot2} mental model: ggplot(movies) + geom_upset(empty.intersections = FALSE) +

    upset_view(label = "Intersection Size", fill = "gray23") + upset_matrix(colour = "gray88", fill = "gray23") + upset_menu(label = "Set Size")
  5. # {graphics} mental model: upset_object <- upset(movies) par(nsets = 5,

    nintersects = 40, sets = NULL) ?par par() plot(upset_object) title(xlab = "Set Size", col.lab = "gray23") ...
  6. ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if

    (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE) ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE)
  7. # sum() behaves differently for na.rm=TRUE and na.rm=FALSE sum(c(1, NA),

    na.rm = TRUE) [1] 1 sum(c(1, NA), na.rm = FALSE) [1] NA # Instead, write two functions: # one for the TRUE case; and # one for the FALSE case sum_without_na <- function(...) sum(..., na.rm = TRUE) sum_with_na <- function(...) sum(..., na.rm = FALSE)
  8. ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset,

    na.action=na.fail) Question: How can randomForest fit a model on the data, if the data arg is NULL? ## Is this better? randomForest(formula, data, data.env = parent.frame(), ..., subset, na. Answer (from the function documentation): `data` an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from NULL Case: Model `data` in parent.env(environment()) Non-NULL Case: Model `data` in environment()
  9. ## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) randomForest ONE

    thing is fitting a random forest model on data. It’s easy to make the case that randomForest is doing three things: 1. It searches for data in different environments; 2. It treats NA values in data (in accordance with na.action); and 3. It fits a random forest model on data. If randomForest was truly doing one thing, the absence of clean data would have prompted an ERROR.
  10. ## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## onlyRandomForest

    does ONE THING, it fits a random forest on data; ## onlyRandomForest doesn’t handle errors from external sources. onlyRandomForest <- function(formula, data, ...){ stopifnot(isFALSE(missing(formula)), isFALSE(missing(data))) stopifnot(class(data) %in% "data.frame") stopifnot(isFALSE(any(is.na(data)))) randomForest::randomForest(formula, data, ...) }
  11. tidyr::drop_na ONE thing is dropping rows with missing data. Instead

    of defining how a function should handle missing values internally, tidyr::drop_na can be used to handle missing values externally. The original function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) Becomes data = tidyr::drop_na(data) randomForest(formula, data, ..., subset)
  12. ## Original Function randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, +20 more args)

    ## 01 Three Arguments Max (0 is best) randomForest(formula, data=NULL, ..., na.action=na.fail) ## 02 No Boolean Arguments Ever (Nor NULL as a pseudo-Boolean) randomForest(formula, data, ..., na.action=na.fail) ## 03 Do One Thing (Either handle errors, or do something else) randomForest(formula, data, ...)