$30 off During Our Annual Pro Sale. View Details »

UBC STAT545 Split Apply Combine Intro

UBC STAT545 Split Apply Combine Intro

Lecture slides from UBC STAT545 2015.
Not a stand-alone document.
Introduction to split-apply-combine aka data aggregation.
http://stat545-ubc.github.io/index.html

Jennifer (Jenny) Bryan

October 13, 2015
Tweet

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Transcript

  1. STAT 545A
    Split-Apply-Combine
    aka
    Data Aggregation

    View Slide

  2. Dr. Jennifer (Jenny) Bryan
    Department of Statistics and Michael Smith Laboratories
    University of British Columbia
    [email protected]
    https://github.com/jennybc
    http://www.stat.ubc.ca/~jenny/
    @JennyBryan ← personal, professional Twitter
    https://github.com/STAT545-UBC
    http://stat545-ubc.github.io
    @STAT545 ← Twitter as lead instructor of this course

    View Slide

  3. These slides were an introduction to a
    long segment with hands-on coding.
    They aren’t meant to stand alone.

    View Slide

  4. What is data aggregation?
    Take some data
    Split it up into pieces
    Apply a computation to each piece
    Combine the results back together again

    View Slide

  5. What is data aggregation?
    Take some data
    Split it up into pieces
    Apply a computation to each piece
    Combine the results back together again

    View Slide

  6. The split-apply-combine strategy for data analysis.
    Hadley Wickham.
    Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011.
    http://www.jstatsoft.org/v40/i01/paper
    JSS Journal of Statistical Software
    April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/
    The Split-Apply-Combine Strategy for Data
    Analysis
    Hadley Wickham
    Rice University
    Abstract
    Many data analysis problems involve the application of a split-apply-combine strategy,
    where you break up a big problem into manageable pieces, operate on each piece inde-
    pendently and then put all the pieces back together. This insight gives rise to a new
    R
    package that allows you to smoothly apply this strategy, without having to worry about
    the type of structure in which your data is stored.
    The paper includes two case studies showing how these insights make it easier to work
    with batting records for veteran baseball players and a large 3d array of spatio-temporal
    ozone measurements.
    Keywords:
    R
    , apply, split, data analysis.

    View Slide

  7. It turns out that these things matter:
    • how you specify the pieces to split the data into
    • how nicely the results are re-combined

    View Slide

  8. Base R has long had the capability to do data
    aggregation
    the “apply” functions
    but these functions are not well-harmonized re: how
    to specify the pieces
    nor do they return the results in a highly usable or
    predictable form

    View Slide

  9. Consequence of these shortcomings:
    many long-time useRs knew they should be using the
    “apply” functions
    but they did not actually do it
    because it’s kind of painful and annoying

    View Slide

  10. I will give you the Big Picture which includes these
    base R approaches
    But I highly recommend the dplyr and plyr
    packages for most of your data aggregation work
    • Better interface, better return values
    dplyr doesn’t completely replace plyr because
    you don’t always have data.frame … but keep your
    eye on purrr
    http://cran.rstudio.com/web/packages/plyr/
    http://cran.rstudio.com/web/packages/dplyr/
    http://cran.rstudio.com/web/packages/purrr/

    View Slide







  11. How do you want to split your data into pieces?
    rows or
    columns of
    a matrix or
    data.frame
    groups of
    observations
    induced by
    levels of ≥1
    factor(s)
    elements of a list
    This determines how you will attack data aggregation.

    View Slide

  12. chunks are ... relevant functions
    rows, columns, etc. of
    matrix or array apply()
    components of a list
    (remember data.frames are lists
    and variables are components!)
    sapply(),lapply(),
    vapply()
    groups of observations
    induced by levels of ≥ 1
    factor(s)
    aggregate()
    tapply()
    by()
    split() + [sl]apply()
    How to do for various pieces of a dataset
    ... using only base R functions

    View Slide

  13. http://plyr.had.co.nz
    plyr is a more general
    predecessor of dplyr
    older, slower, but still useful

    View Slide

  14. The .progress argument controls display of a progress bar, and is described at the end of
    Section 4.
    Note that all arguments start with “.”. This prevents name clashes with the arguments of
    the processing function, and helps to visually delineate arguments that control the repetition
    XXXXXXXXXXX
    Input
    Output
    Array Data frame List Discarded
    Array aaply adply alply a_ply
    Data frame daply ddply dlply d_ply
    List laply ldply llply l_ply
    Table 2: The 12 key functions of
    plyr
    . Arrays include matrices and vectors as special cases.
    3. Usage
    Table 2 lists the basic set of
    plyr
    functions. Each function is named according to th
    input it accepts and the type of output it produces: a = array, d = data frame, l =
    _ means the output is discarded. The input type determines how the big data st
    broken apart into small pieces, described in Section 3.1; and the output type determ
    the pieces are joined back together again, described in Section 3.2.
    The e↵ects of the input and outputs types are orthogonal, so instead of having to
    12 functions individually, it is su cient to learn the three types of input and the f
    of output. For this reason, we use the notation d*ply for functions with common
    complete row of Table 2, and *dply for functions with common output, a column o
    The functions have either two or three main arguments, depending on the type of
    a*ply(.data, .margins, .fun, ..., .progress = "none")
    d*ply(.data, .variables, .fun, ..., .progress = "none")
    l*ply(.data, .fun, ..., .progress = "none")
    The first argument is the .data which will be split up, processed and recombined. T
    argument, .variables or .margins, describes how to split up the input into pieces.
    argument, .fun, is the processing function, and is applied to each piece in turn. A
    arguments are passed on to the processing function. If you omit .fun the individ
    will not be modified, but the entire data structure will be converted from one type to
    How to do for various pieces of a dataset
    ... using plyr

    View Slide

  15. Note that all arguments start with “.”. This prevents name clashes with the arguments of
    the processing function, and helps to visually delineate arguments that control the repetition
    XXXXXXXXXXX
    Input
    Output
    Array Data frame List Discarded
    Array aaply adply alply a_ply
    Data frame daply ddply dlply d_ply
    List laply ldply llply l_ply
    Table 2: The 12 key functions of
    plyr
    . Arrays include matrices and vectors as special cases.
    a*ply(.data, .margins, .fun)
    something
    rectangular 1 㱺 pieces are rows
    2 㱺 pieces are columns
    function to
    apply to each
    piece
    what you want back, i.e. a, d, l, nothing

    View Slide

  16. Note that all arguments start with “.”. This prevents name clashes with the arguments of
    the processing function, and helps to visually delineate arguments that control the repetition
    XXXXXXXXXXX
    Input
    Output
    Array Data frame List Discarded
    Array aaply adply alply a_ply
    Data frame daply ddply dlply d_ply
    List laply ldply llply l_ply
    Table 2: The 12 key functions of
    plyr
    . Arrays include matrices and vectors as special cases.
    d*ply(.data, .variables, .fun)
    data.frame split by levels of
    these factor(s)
    function to
    apply to each
    piece
    what you want back, i.e. a, d, l, nothing

    View Slide

  17. Note that all arguments start with “.”. This prevents name clashes with the arguments of
    the processing function, and helps to visually delineate arguments that control the repetition
    XXXXXXXXXXX
    Input
    Output
    Array Data frame List Discarded
    Array aaply adply alply a_ply
    Data frame daply ddply dlply d_ply
    List laply ldply llply l_ply
    Table 2: The 12 key functions of
    plyr
    . Arrays include matrices and vectors as special cases.
    l*ply(.data, .fun)
    list function to
    apply to each
    element
    what you want back, i.e. a, d, l, nothing

    View Slide

  18. The .progress argument controls display of a progress bar, and is described at the end of
    Section 4.
    Note that all arguments start with “.”. This prevents name clashes with the arguments of
    the processing function, and helps to visually delineate arguments that control the repetition
    XXXXXXXXXXX
    Input
    Output
    Array Data frame List Discarded
    Array aaply adply alply a_ply
    Data frame daply ddply dlply d_ply
    List laply ldply llply l_ply
    Table 2: The 12 key functions of
    plyr
    . Arrays include matrices and vectors as special cases.
    3. Usage
    Table 2 lists the basic set of
    plyr
    functions. Each function is named according to th
    input it accepts and the type of output it produces: a = array, d = data frame, l =
    _ means the output is discarded. The input type determines how the big data st
    broken apart into small pieces, described in Section 3.1; and the output type determ
    the pieces are joined back together again, described in Section 3.2.
    The e↵ects of the input and outputs types are orthogonal, so instead of having to
    12 functions individually, it is su cient to learn the three types of input and the f
    of output. For this reason, we use the notation d*ply for functions with common
    complete row of Table 2, and *dply for functions with common output, a column o
    The functions have either two or three main arguments, depending on the type of
    a*ply(.data, .margins, .fun, ..., .progress = "none")
    d*ply(.data, .variables, .fun, ..., .progress = "none")
    l*ply(.data, .fun, ..., .progress = "none")
    The first argument is the .data which will be split up, processed and recombined. T
    argument, .variables or .margins, describes how to split up the input into pieces.
    argument, .fun, is the processing function, and is applied to each piece in turn. A
    arguments are passed on to the processing function. If you omit .fun the individ
    will not be modified, but the entire data structure will be converted from one type to
    the most useful one!

    View Slide

  19. ddply(.data, .variables, .fun = NULL)
    apply this
    function to
    each piece ...
    Take this
    data.frame ...
    divide it into pieces,
    i.e. smaller
    data.frames, based
    on this factor and ...
    ... glue the results back together and
    return as a data.frame

    View Slide

  20. arguments are passed on to the processing function. If you omit .fun the individual pieces
    will not be modified, but the entire data structure will be converted from one type to another.
    The .progress argument controls display of a progress bar, and is described at the end of
    Section 4.
    Note that all arguments start with “.”. This prevents name clashes with the arguments of
    the processing function, and helps to visually delineate arguments that control the repetition
    XXXXXXXXXXX
    Input
    Output
    Array Data frame List Discarded
    Array aaply adply alply a_ply
    Data frame daply ddply dlply d_ply
    List laply ldply llply l_ply
    Table 2: The 12 key functions of
    plyr
    . Arrays include matrices and vectors as special cases.
    the most useful one!
    The existence of dplyr is, in
    part, due to just how damn useful
    ddply proved to be.

    View Slide










  21. df %>%
    group_by(suit) %>%
    summarize(…)












    df %>%
    group_by(suit) %>%

    How to do for various pieces of a dataset
    ... using dplyr









    df %>%
    group_by(suit) %>%
    do({…})

    View Slide

  22. ddply(gapminder,
    ~ country,
    le_lin_fit)
    gapminder %>%
    group_by(country) %>%
    do(le_lin_fit(.))
    good?, 2013, plyr solution
    better?, 2014, dplyr solution
    gapminder %>%
    group_by(country) %>%
    do(tidy(lm(lifeExp ~ year, data = .)))
    best? 2015, dplyr + broom solution

    View Slide

  23. gapminder %>%
    group_by(country) %>%
    do(le_lin_fit(.))
    gapminder %>%
    group_by(country) %>%
    do(tidy(lm(lifeExp ~ year, data = .)))
    If you compute an unnamed
    data.frame inside do(), you get row-
    bound data.frame back, which I love.
    When possible, do that!
    broom can be a huge help.
    When not possible …
    consider purrr?

    View Slide