Jennifer (Jenny) Bryan
October 13, 2015
2.8k

UBC STAT545 Split Apply Combine Intro

Lecture slides from UBC STAT545 2015.
Not a stand-alone document.
Introduction to split-apply-combine aka data aggregation.
http://stat545-ubc.github.io/index.html

October 13, 2015

Transcript

1. STAT 545A
Split-Apply-Combine
aka
Data Aggregation

2. Dr. Jennifer (Jenny) Bryan
Department of Statistics and Michael Smith Laboratories
University of British Columbia
[email protected]
https://github.com/jennybc
http://www.stat.ubc.ca/~jenny/
https://github.com/STAT545-UBC
http://stat545-ubc.github.io

3. These slides were an introduction to a
long segment with hands-on coding.
They aren’t meant to stand alone.

4. What is data aggregation?
Take some data
Split it up into pieces
Apply a computation to each piece
Combine the results back together again

5. What is data aggregation?
Take some data
Split it up into pieces
Apply a computation to each piece
Combine the results back together again

6. The split-apply-combine strategy for data analysis.
Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011.
http://www.jstatsoft.org/v40/i01/paper
JSS Journal of Statistical Software
April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/
The Split-Apply-Combine Strategy for Data
Analysis
Rice University
Abstract
Many data analysis problems involve the application of a split-apply-combine strategy,
where you break up a big problem into manageable pieces, operate on each piece inde-
pendently and then put all the pieces back together. This insight gives rise to a new
R
package that allows you to smoothly apply this strategy, without having to worry about
the type of structure in which your data is stored.
The paper includes two case studies showing how these insights make it easier to work
with batting records for veteran baseball players and a large 3d array of spatio-temporal
ozone measurements.
Keywords:
R
, apply, split, data analysis.

7. It turns out that these things matter:
• how you specify the pieces to split the data into
• how nicely the results are re-combined

8. Base R has long had the capability to do data
aggregation
the “apply” functions
but these functions are not well-harmonized re: how
to specify the pieces
nor do they return the results in a highly usable or
predictable form

9. Consequence of these shortcomings:
many long-time useRs knew they should be using the
“apply” functions
but they did not actually do it
because it’s kind of painful and annoying

10. I will give you the Big Picture which includes these
base R approaches
But I highly recommend the dplyr and plyr
packages for most of your data aggregation work
• Better interface, better return values
dplyr doesn’t completely replace plyr because
you don’t always have data.frame … but keep your
eye on purrr
http://cran.rstudio.com/web/packages/plyr/
http://cran.rstudio.com/web/packages/dplyr/
http://cran.rstudio.com/web/packages/purrr/

11. How do you want to split your data into pieces?
rows or
columns of
a matrix or
data.frame
groups of
observations
induced by
levels of ≥1
factor(s)
elements of a list
This determines how you will attack data aggregation.

12. chunks are ... relevant functions
rows, columns, etc. of
matrix or array apply()
components of a list
(remember data.frames are lists
and variables are components!)
sapply(),lapply(),
vapply()
groups of observations
induced by levels of ≥ 1
factor(s)
aggregate()
tapply()
by()
split() + [sl]apply()
How to do for various pieces of a dataset
... using only base R functions

plyr is a more general
predecessor of dplyr
older, slower, but still useful

14. The .progress argument controls display of a progress bar, and is described at the end of
Section 4.
Note that all arguments start with “.”. This prevents name clashes with the arguments of
the processing function, and helps to visually delineate arguments that control the repetition
XXXXXXXXXXX
Input
Output
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply
Table 2: The 12 key functions of
plyr
. Arrays include matrices and vectors as special cases.
3. Usage
Table 2 lists the basic set of
plyr
functions. Each function is named according to th
input it accepts and the type of output it produces: a = array, d = data frame, l =
_ means the output is discarded. The input type determines how the big data st
broken apart into small pieces, described in Section 3.1; and the output type determ
the pieces are joined back together again, described in Section 3.2.
The e↵ects of the input and outputs types are orthogonal, so instead of having to
12 functions individually, it is su cient to learn the three types of input and the f
of output. For this reason, we use the notation d*ply for functions with common
complete row of Table 2, and *dply for functions with common output, a column o
The functions have either two or three main arguments, depending on the type of
a*ply(.data, .margins, .fun, ..., .progress = "none")
d*ply(.data, .variables, .fun, ..., .progress = "none")
l*ply(.data, .fun, ..., .progress = "none")
The ﬁrst argument is the .data which will be split up, processed and recombined. T
argument, .variables or .margins, describes how to split up the input into pieces.
argument, .fun, is the processing function, and is applied to each piece in turn. A
arguments are passed on to the processing function. If you omit .fun the individ
will not be modiﬁed, but the entire data structure will be converted from one type to
How to do for various pieces of a dataset
... using plyr

15. Note that all arguments start with “.”. This prevents name clashes with the arguments of
the processing function, and helps to visually delineate arguments that control the repetition
XXXXXXXXXXX
Input
Output
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply
Table 2: The 12 key functions of
plyr
. Arrays include matrices and vectors as special cases.
a*ply(.data, .margins, .fun)
something
rectangular 1 㱺 pieces are rows
2 㱺 pieces are columns
function to
apply to each
piece
what you want back, i.e. a, d, l, nothing

16. Note that all arguments start with “.”. This prevents name clashes with the arguments of
the processing function, and helps to visually delineate arguments that control the repetition
XXXXXXXXXXX
Input
Output
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply
Table 2: The 12 key functions of
plyr
. Arrays include matrices and vectors as special cases.
d*ply(.data, .variables, .fun)
data.frame split by levels of
these factor(s)
function to
apply to each
piece
what you want back, i.e. a, d, l, nothing

17. Note that all arguments start with “.”. This prevents name clashes with the arguments of
the processing function, and helps to visually delineate arguments that control the repetition
XXXXXXXXXXX
Input
Output
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply
Table 2: The 12 key functions of
plyr
. Arrays include matrices and vectors as special cases.
l*ply(.data, .fun)
list function to
apply to each
element
what you want back, i.e. a, d, l, nothing

18. The .progress argument controls display of a progress bar, and is described at the end of
Section 4.
Note that all arguments start with “.”. This prevents name clashes with the arguments of
the processing function, and helps to visually delineate arguments that control the repetition
XXXXXXXXXXX
Input
Output
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply
Table 2: The 12 key functions of
plyr
. Arrays include matrices and vectors as special cases.
3. Usage
Table 2 lists the basic set of
plyr
functions. Each function is named according to th
input it accepts and the type of output it produces: a = array, d = data frame, l =
_ means the output is discarded. The input type determines how the big data st
broken apart into small pieces, described in Section 3.1; and the output type determ
the pieces are joined back together again, described in Section 3.2.
The e↵ects of the input and outputs types are orthogonal, so instead of having to
12 functions individually, it is su cient to learn the three types of input and the f
of output. For this reason, we use the notation d*ply for functions with common
complete row of Table 2, and *dply for functions with common output, a column o
The functions have either two or three main arguments, depending on the type of
a*ply(.data, .margins, .fun, ..., .progress = "none")
d*ply(.data, .variables, .fun, ..., .progress = "none")
l*ply(.data, .fun, ..., .progress = "none")
The ﬁrst argument is the .data which will be split up, processed and recombined. T
argument, .variables or .margins, describes how to split up the input into pieces.
argument, .fun, is the processing function, and is applied to each piece in turn. A
arguments are passed on to the processing function. If you omit .fun the individ
will not be modiﬁed, but the entire data structure will be converted from one type to
the most useful one!

19. ddply(.data, .variables, .fun = NULL)
apply this
function to
each piece ...
Take this
data.frame ...
divide it into pieces,
i.e. smaller
data.frames, based
on this factor and ...
... glue the results back together and
return as a data.frame

20. arguments are passed on to the processing function. If you omit .fun the individual pieces
will not be modiﬁed, but the entire data structure will be converted from one type to another.
The .progress argument controls display of a progress bar, and is described at the end of
Section 4.
Note that all arguments start with “.”. This prevents name clashes with the arguments of
the processing function, and helps to visually delineate arguments that control the repetition
XXXXXXXXXXX
Input
Output
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply
Table 2: The 12 key functions of
plyr
. Arrays include matrices and vectors as special cases.
the most useful one!
The existence of dplyr is, in
part, due to just how damn useful
ddply proved to be.

21. df %>%
group_by(suit) %>%
summarize(…)

df %>%
group_by(suit) %>%

How to do for various pieces of a dataset
... using dplyr

df %>%
group_by(suit) %>%
do({…})

22. ddply(gapminder,
~ country,
le_lin_fit)
gapminder %>%
group_by(country) %>%
do(le_lin_fit(.))
good?, 2013, plyr solution
better?, 2014, dplyr solution
gapminder %>%
group_by(country) %>%
do(tidy(lm(lifeExp ~ year, data = .)))
best? 2015, dplyr + broom solution

23. gapminder %>%
group_by(country) %>%
do(le_lin_fit(.))
gapminder %>%
group_by(country) %>%
do(tidy(lm(lifeExp ~ year, data = .)))
If you compute an unnamed
data.frame inside do(), you get row-
bound data.frame back, which I love.
When possible, do that!
broom can be a huge help.
When not possible …
consider purrr?