Upgrade to Pro — share decks privately, control downloads, hide ads and more …

R User Friendly Functions

R User Friendly Functions

This presentation was given at the New-Zealand Statistical Conference (NZSA), November 2019.

Avatar for Harel Lustiger

Harel Lustiger

November 28, 2019
Tweet

More Decks by Harel Lustiger

Other Decks in Programming

Transcript

  1. “I want to have a solid mental model of the

    problem before implementing anything”
  2. upset(data, nsets = 5, nintersects = 40, sets = NULL,

    keep.order = F, set.metadata = NULL, intersections = NULL, matrix.color = "gray23", main.bar.color = "gray23", mainbar.y.label = "Intersection Size", mainbar.y.max = NULL, sets.bar.color = "gray23", sets.x.label = "Set Size", point.size = 2.2, line.size = 0.7, mb.ratio = c(0.7, 0.3), expression = NULL, att.pos = NULL, att.color = main.bar.color, order.by = c("freq", "degree"), decreasing = c(T, F), show.numbers = "yes", number.angles = 0, group.by = "degree", cutoff = NULL, queries = NULL, query.legend = "none", shade.color = "gray88", shade.alpha = 0.25, matrix.dot.alpha = 0.5, empty.intersections = NULL, color.pal = 1, boxplot.summary = NULL, attribute.plots = NULL, scale.intersections = "identity", scale.sets = "identity", text.scale = 1, set_size.angles = 0)
  3. Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., Pfister, H.,

    & Manuscript, A. (2014). UpSet: Visualization of Intersecting Sets Europe PMC Funders Group. IEEE Trans Vis Comput Graph, 20(12), 1983–1992. https://doi.org/10.1109/TVCG.2014.2346248 Combination Matrix Set Menu Set View
  4. # {ggplot2} mental model: ggplot(movies) + geom_upset(empty.intersections = FALSE) +

    upset_view(label = "Intersection Size", fill = "gray23") + upset_matrix(colour = "gray88", fill = "gray23") + upset_menu(label = "Set Size")
  5. # {graphics} mental model: upset_object <- upset(movies) par(nsets = 5,

    nintersects = 40, sets = NULL) ?par par() plot(upset_object) title(xlab = "Set Size", col.lab = "gray23") ...
  6. ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if

    (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE) ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## Default S3 method: randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if(replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if(!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y)&&is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE)
  7. # sum() behaves differently for na.rm=TRUE and na.rm=FALSE sum(c(1, NA),

    na.rm = TRUE) [1] 1 sum(c(1, NA), na.rm = FALSE) [1] NA # Instead, write two functions: # one for the TRUE case; and # one for the FALSE case sum_without_na <- function(...) sum(..., na.rm = TRUE) sum_with_na <- function(...) sum(..., na.rm = FALSE)
  8. ## S3 method for class 'formula' randomForest(formula, data=NULL, ..., subset,

    na.action=na.fail) Question: How can randomForest fit a model on the data, if the data arg is NULL? ## Is this better? randomForest(formula, data, data.env = parent.frame(), ..., subset, na. Answer (from the function documentation): `data` an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from NULL Case: Model `data` in parent.env(environment()) Non-NULL Case: Model `data` in environment()
  9. ## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) randomForest ONE

    thing is fitting a random forest model on data. It’s easy to make the case that randomForest is doing three things: 1. It searches for data in different environments; 2. It treats NA values in data (in accordance with na.action); and 3. It fits a random forest model on data. If randomForest was truly doing one thing, the absence of clean data would have prompted an ERROR.
  10. ## Original Function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) ## onlyRandomForest

    does ONE THING, it fits a random forest on data; ## onlyRandomForest doesn’t handle errors from external sources. onlyRandomForest <- function(formula, data, ...){ stopifnot(isFALSE(missing(formula)), isFALSE(missing(data))) stopifnot(class(data) %in% "data.frame") stopifnot(isFALSE(any(is.na(data)))) randomForest::randomForest(formula, data, ...) }
  11. tidyr::drop_na ONE thing is dropping rows with missing data. Instead

    of defining how a function should handle missing values internally, tidyr::drop_na can be used to handle missing values externally. The original function randomForest(formula, data=NULL, ..., subset, na.action=na.fail) Becomes data = tidyr::drop_na(data) randomForest(formula, data, ..., subset)
  12. ## Original Function randomForest.default(x, y=NULL, xtest=NULL, ytest=NULL, +20 more args)

    ## 01 Three Arguments Max (0 is best) randomForest(formula, data=NULL, ..., na.action=na.fail) ## 02 No Boolean Arguments Ever (Nor NULL as a pseudo-Boolean) randomForest(formula, data, ..., na.action=na.fail) ## 03 Do One Thing (Either handle errors, or do something else) randomForest(formula, data, ...)