UBC STAT545 Split Apply Combine Intro

STAT 545A Split-Apply-Combine aka Data Aggregation

Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith
Laboratories University of British Columbia [email protected] https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course

These slides were an introduction to a long segment with
hands-on coding. They aren’t meant to stand alone.

What is data aggregation? Take some data Split it up
into pieces Apply a computation to each piece Combine the results back together again

The split-apply-combine strategy for data analysis. Hadley Wickham. Journal of
Statistical Software, vol. 40, no. 1, pp. 1–29, 2011. http://www.jstatsoft.org/v40/i01/paper JSS Journal of Statistical Software April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece inde- pendently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements. Keywords: R , apply, split, data analysis.

It turns out that these things matter: • how you
specify the pieces to split the data into • how nicely the results are re-combined

Base R has long had the capability to do data
aggregation the “apply” functions but these functions are not well-harmonized re: how to specify the pieces nor do they return the results in a highly usable or predictable form

Consequence of these shortcomings: many long-time useRs knew they should
be using the “apply” functions but they did not actually do it because it’s kind of painful and annoying

I will give you the Big Picture which includes these
base R approaches But I highly recommend the dplyr and plyr packages for most of your data aggregation work • Better interface, better return values dplyr doesn’t completely replace plyr because you don’t always have data.frame … but keep your eye on purrr http://cran.rstudio.com/web/packages/plyr/ http://cran.rstudio.com/web/packages/dplyr/ http://cran.rstudio.com/web/packages/purrr/

♣ ♣ ♥ ♦ ♦ ♦ How do you want
to split your data into pieces? rows or columns of a matrix or data.frame groups of observations induced by levels of ≥1 factor(s) elements of a list This determines how you will attack data aggregation.

chunks are ... relevant functions rows, columns, etc. of matrix
or array apply() components of a list (remember data.frames are lists and variables are components!) sapply(),lapply(), vapply() groups of observations induced by levels of ≥ 1 factor(s) aggregate() tapply() by() split() + [sl]apply() How to do <sthg> for various pieces of a dataset ... using only base R functions

http://plyr.had.co.nz plyr is a more general predecessor of dplyr older,
slower, but still useful

The .progress argument controls display of a progress bar, and
is described at the end of Section 4. Note that all arguments start with “.”. This prevents name clashes with the arguments of the processing function, and helps to visually delineate arguments that control the repetition XXXXXXXXXXX Input Output Array Data frame List Discarded Array aaply adply alply a_ply Data frame daply ddply dlply d_ply List laply ldply llply l_ply Table 2: The 12 key functions of plyr . Arrays include matrices and vectors as special cases. 3. Usage Table 2 lists the basic set of plyr functions. Each function is named according to th input it accepts and the type of output it produces: a = array, d = data frame, l = _ means the output is discarded. The input type determines how the big data st broken apart into small pieces, described in Section 3.1; and the output type determ the pieces are joined back together again, described in Section 3.2. The e↵ects of the input and outputs types are orthogonal, so instead of having to 12 functions individually, it is su cient to learn the three types of input and the f of output. For this reason, we use the notation d*ply for functions with common complete row of Table 2, and *dply for functions with common output, a column o The functions have either two or three main arguments, depending on the type of a*ply(.data, .margins, .fun, ..., .progress = "none") d*ply(.data, .variables, .fun, ..., .progress = "none") l*ply(.data, .fun, ..., .progress = "none") The ﬁrst argument is the .data which will be split up, processed and recombined. T argument, .variables or .margins, describes how to split up the input into pieces. argument, .fun, is the processing function, and is applied to each piece in turn. A arguments are passed on to the processing function. If you omit .fun the individ will not be modiﬁed, but the entire data structure will be converted from one type to How to do <sthg> for various pieces of a dataset ... using plyr

Note that all arguments start with “.”. This prevents name
clashes with the arguments of the processing function, and helps to visually delineate arguments that control the repetition XXXXXXXXXXX Input Output Array Data frame List Discarded Array aaply adply alply a_ply Data frame daply ddply dlply d_ply List laply ldply llply l_ply Table 2: The 12 key functions of plyr . Arrays include matrices and vectors as special cases. a*ply(.data, .margins, .fun) something rectangular 1 㱺 pieces are rows 2 㱺 pieces are columns function to apply to each piece what you want back, i.e. a, d, l, nothing

clashes with the arguments of the processing function, and helps to visually delineate arguments that control the repetition XXXXXXXXXXX Input Output Array Data frame List Discarded Array aaply adply alply a_ply Data frame daply ddply dlply d_ply List laply ldply llply l_ply Table 2: The 12 key functions of plyr . Arrays include matrices and vectors as special cases. d*ply(.data, .variables, .fun) data.frame split by levels of these factor(s) function to apply to each piece what you want back, i.e. a, d, l, nothing

clashes with the arguments of the processing function, and helps to visually delineate arguments that control the repetition XXXXXXXXXXX Input Output Array Data frame List Discarded Array aaply adply alply a_ply Data frame daply ddply dlply d_ply List laply ldply llply l_ply Table 2: The 12 key functions of plyr . Arrays include matrices and vectors as special cases. l*ply(.data, .fun) list function to apply to each element what you want back, i.e. a, d, l, nothing

The .progress argument controls display of a progress bar, and
is described at the end of Section 4. Note that all arguments start with “.”. This prevents name clashes with the arguments of the processing function, and helps to visually delineate arguments that control the repetition XXXXXXXXXXX Input Output Array Data frame List Discarded Array aaply adply alply a_ply Data frame daply ddply dlply d_ply List laply ldply llply l_ply Table 2: The 12 key functions of plyr . Arrays include matrices and vectors as special cases. 3. Usage Table 2 lists the basic set of plyr functions. Each function is named according to th input it accepts and the type of output it produces: a = array, d = data frame, l = _ means the output is discarded. The input type determines how the big data st broken apart into small pieces, described in Section 3.1; and the output type determ the pieces are joined back together again, described in Section 3.2. The e↵ects of the input and outputs types are orthogonal, so instead of having to 12 functions individually, it is su cient to learn the three types of input and the f of output. For this reason, we use the notation d*ply for functions with common complete row of Table 2, and *dply for functions with common output, a column o The functions have either two or three main arguments, depending on the type of a*ply(.data, .margins, .fun, ..., .progress = "none") d*ply(.data, .variables, .fun, ..., .progress = "none") l*ply(.data, .fun, ..., .progress = "none") The ﬁrst argument is the .data which will be split up, processed and recombined. T argument, .variables or .margins, describes how to split up the input into pieces. argument, .fun, is the processing function, and is applied to each piece in turn. A arguments are passed on to the processing function. If you omit .fun the individ will not be modiﬁed, but the entire data structure will be converted from one type to the most useful one!

ddply(.data, .variables, .fun = NULL) apply this function to each
piece ... Take this data.frame ... divide it into pieces, i.e. smaller data.frames, based on this factor and ... ... glue the results back together and return as a data.frame

arguments are passed on to the processing function. If you
omit .fun the individual pieces will not be modiﬁed, but the entire data structure will be converted from one type to another. The .progress argument controls display of a progress bar, and is described at the end of Section 4. Note that all arguments start with “.”. This prevents name clashes with the arguments of the processing function, and helps to visually delineate arguments that control the repetition XXXXXXXXXXX Input Output Array Data frame List Discarded Array aaply adply alply a_ply Data frame daply ddply dlply d_ply List laply ldply llply l_ply Table 2: The 12 key functions of plyr . Arrays include matrices and vectors as special cases. the most useful one! The existence of dplyr is, in part, due to just how damn useful ddply proved to be.

♣ ♣ ♥ ♦ ♦ ♦ ♣ ♥ ♦ df
%>% group_by(suit) %>% summarize(…) ♣ ♣ ♥ ♦ ♦ ♦ ♣ ♣ ♥ ♦ ♦ ♦ df %>% group_by(suit) %>% <mutate or a window function, …> How to do <sthg> for various pieces of a dataset ... using dplyr ♣ ♣ ♥ ♦ ♦ ♦ ♣ ♥ ♦ df %>% group_by(suit) %>% do({…})

ddply(gapminder, ~ country, le_lin_fit) gapminder %>% group_by(country) %>% do(le_lin_fit(.)) good?,
2013, plyr solution better?, 2014, dplyr solution gapminder %>% group_by(country) %>% do(tidy(lm(lifeExp ~ year, data = .))) best? 2015, dplyr + broom solution

gapminder %>% group_by(country) %>% do(le_lin_fit(.)) gapminder %>% group_by(country) %>% do(tidy(lm(lifeExp
~ year, data = .))) If you compute an unnamed data.frame inside do(), you get row- bound data.frame back, which I love. When possible, do that! broom can be a huge help. When not possible … consider purrr?

UBC STAT545 Split Apply Combine Intro

UBC STAT545 Split Apply Combine Intro

Jennifer (Jenny) Bryan

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Featured

Transcript

STAT 545A Split-Apply-Combine aka Data Aggregation

Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith

These slides were an introduction to a long segment with

What is data aggregation? Take some data Split it up

What is data aggregation? Take some data Split it up

The split-apply-combine strategy for data analysis. Hadley Wickham. Journal of

It turns out that these things matter: • how you

Base R has long had the capability to do data

Consequence of these shortcomings: many long-time useRs knew they should

I will give you the Big Picture which includes these

♣ ♣ ♥ ♦ ♦ ♦ How do you want

chunks are ... relevant functions rows, columns, etc. of matrix

http://plyr.had.co.nz plyr is a more general predecessor of dplyr older,

The .progress argument controls display of a progress bar, and

Note that all arguments start with “.”. This prevents name

Note that all arguments start with “.”. This prevents name

Note that all arguments start with “.”. This prevents name

The .progress argument controls display of a progress bar, and

ddply(.data, .variables, .fun = NULL) apply this function to each

arguments are passed on to the processing function. If you

♣ ♣ ♥ ♦ ♦ ♦ ♣ ♥ ♦ df

ddply(gapminder, ~ country, le_lin_fit) gapminder %>% group_by(country) %>% do(le_lin_fit(.)) good?,

gapminder %>% group_by(country) %>% do(le_lin_fit(.)) gapminder %>% group_by(country) %>% do(tidy(lm(lifeExp