Etienne
November 15, 2012
1.6k

# Hadley Ecosystem: Reshape, Plyr, GGplot

We will give you a fly over of a few of the packages Hadley Wickham and his collaborators have created. Many of us now use these packages in every project we tackle and they have become an essential tool of the R enthusiast tool box. A brief tutorial providing the key features and how to implement them will be presented for each package, each followed by a hands on application exercise. Tips and trick for super users will also be provided.

-reshape: make your data play nice (Too many columns, no problem)

-plyr: split/apply/combine (extract the slope of a linear model for each of your thousand replicates)

## Etienne

November 15, 2012

## Transcript

Ecosystem:
reshape
plyr
ggplot
Etienne Low-Decarie
Journal of Statistical Software
7
2
1
1
2
1,2
Figure
1: T
he
three
ways to
split up
a
2d
m
atrix, labelled
above
by
the
dim
ensions that they
slice.
O
riginal m
atrix
show
n
at
top
left, w
ith
dim
ensions
labelled.
A
single
piece
under
each
splitting
schem
e
is
colored
blue.
3
2
1
1
2
3
1,2
1,3
2,3
1,2,3
Figure
2:
T
he
seven
ways
to
split
up
a
3d
array, labelled
above
by
the
dim
ensions
that
they
slice
up.
O
riginal array
show
n
at
top
left, w
ith
dim
ensions
labelled.
Blue
indicates
a
single
piece
of the
output.
m*ply()
takes a
m
atrix, list-array, or data
fram
e, splits it up
by
row
s and
calls the
processing
function
supplying
each
piece
as
its
param
eters.
Figure
3
show
s
how
you
m
ight
use
this
to
draw
random
num
bers
from
norm
al distributions
w
ith
varying
param
eters.
Input:
D
ata
fram
e
(d*ply)
W
hen
operating
on
a
data
fram
e, you
usually
want
to
split
it
up
into
groups
based
on
com
-
binations
of
variables
in
the
data
set.
For
d*ply
you
specify
w
hich
variables
(or
functions
of variables)
to
use.
T
hese
variables
are
speciﬁed
in
a
special way
to
highlight
that
they
are

2. goud
engelhardt
windmill
george
maxine
corey
arthur
rollert
tanya
ziegler
rudolph
gillis
tang
kathryn
labrecque
friesen
caroline
tyler
nicolas
peika
brianne
limberger
paul
krause
moshyk
julia
sims
demarsh
denis
haller
caitlin
charpentier
surprenant
kyle
eric
sylvain
cao
alexandra
rob
romana
romain
andriy
colin
gauthier
evans
nick
miller
zofia
yinan
martins
jacob
sacha
murphy
heather
benjamin
winegardner
taranu
ben
pedersen
alex
haine
ellie
amanda
white
morrison
chivers
gibb
seng
sumenr
You

3. 0
50
100
Aberdeen
Austin, TX
Calgary, AB
Campinas
C..te Saint−Luc, QC
Edinburgh
Lasalle, QC
Laval, QC
Mississauga, ON
Montreal, QC
Montr..al, QC
Montr..al−Ouest, QC
New York, NY
Ottawa, ON
Outremont, QC
Palo Alto, CA
Sainte−Julie, QC
Stowe, VT
Toronto, ON
Verdun, QC
Washington, DC
Location
count
attendee FALSE TRUE
You

4. 0
20
40
60
0 2 4 6
RSVPed.Yes
count
attendee FALSE TRUE
You

5. You
  R level?
  Have plotted with base R?
  Have you:
  used reshape ?
  used plyr ?
  used ggplot?
You

6. Outline
  reshape
  Make your data play nice
  10 minutes hands on
  plyr
  Split-Apply-Combine on steroids
  to summarize or transform your data
  15 minutes hands on
  ggplot
  beautiful plots one layer at a time
  15 minutes hands on
  Power user goodies on demand

7. on demand during hands on:
superuser stuff
 ggplot themes
 plyr
  multicore
  progress bar
 reshape, plyr and ggplot all together
  great exploratory plots
 upcoming dplyr
 more of the Hadely ecosystem
Journal of Statistical Software
7
2
1
1
2
1,2
Figure
1: T
he
three
ways to
split up
a
2d
m
atrix, labelled
above
by
the
dim
ensions that they
slice.
O
riginal m
atrix
show
n
at
top
left, w
ith
dim
ensions
labelled.
A
single
piece
under
each
splitting
schem
e
is
colored
blue.
3
2
1
1
2
3
1,2
1,3
2,3
1,2,3
Figure
2:
T
he
seven
ways
to
split
up
a
3d
array, labelled
above
by
the
dim
ensions
that
they
slice
up.
O
riginal array
show
n
at
top
left, w
ith
dim
ensions
labelled.
Blue
indicates
a
single
piece
of the
output.
m*ply()
takes a
m
atrix, list-array, or data
fram
e, splits it up
by
row
s and
calls the
processing
function
supplying
each
piece
as
its
param
eters.
Figure
3
show
s
how
you
m
ight
use
this
to
draw
random
num
bers
from
norm
al distributions
w
ith
varying
param
eters.
Input:
D
ata
fram
e
(d*ply)
W
hen
operating
on
a
data
fram
e, you
usually
want
to
split
it
up
into
groups
based
on
com
-
binations
of
variables
in
the
data
set.
For
d*ply
you
specify
w
hich
variables
(or
functions
of variables)
to
use.
T
hese
variables
are
speciﬁed
in
a
special way
to
highlight
that
they
are

  Code and HTML available at:
  https://github.com/MontrealRUserGroup

9. Required packages
  the obvious:
  plyr
  reshape(2)
  ggplot2
  for a little more data to play with:
  vegan
  vegetarian
  for pretty graphic tables
  gridExtra
  help(package=“package name”)

10. reshape
reshape

11. reshape
  Wide
  Each level of a factor gets a column
  Multiple measurements per row
  Excel, SPSS…
  Pros
  Plays nice with humans
  No data repetition
  “Eyeballable”
  Cons
  Does not play nice with R
ID variable Level 1 Level 2
ID 1 Measured value Measured value
ID 2 Measured value Measured value

12.   Long
  Levels are expressed in a column
  One measured value per row
  eg. really long: XML, JSON (tag:content pairs)
  Pros
  Plays nice with computers (API, databases, plyr,
ggplot2…)
  Cons
  Does not play nice with humans
  Lots of copy pasting and forget eyeballing it!
ID variable Factor Measured value
ID 1 Level 1 Measured value
ID 1 Level 2 Measured value
ID 2 Level 1 Measured value
ID 2 Level 2 Measured value
reshape

13. Look at data
  What format is…?
  data(simesants)
  data(iris)
  data(sipoo)
  Look at more data
  data()
reshape
long/wide?

14. ID variable Factor Measured value
ID 1 Level 1 Measured value
ID 1 Level 2 Measured value
ID 2 Level 1 Measured value
ID 2 Level 2 Measured value
ID variable Level 1 Level 2
ID 1 Measured value Measured value
ID 2 Measured value Measured value
Wide
Long
reshape

15. Make your data play nice
  Switching from long to wide
  library(reshape)
  melt()
  cast()
reshape

16. Melt: go long
molten.dataid.vars=ls("id.var.1", "id.var.2"),
measure.vars=ls("measure.vars", "measure.vars"),
variable_name = "variable")!
!

reshape
Super user hint: produce beautiful
tables with require(gridExtra) and
grid.table()

17. Melt: go long

iris\$id
molten.irisid.vars=c("Species", "id"),
#measure.vars=c("measure.vars", "measure.vars"),
variable_name = "measure")

reshape

18. Cast: go wide
cast.dataformula = id_var_1 + id_var_2 ~
measure_var_1 + measure_var_2)!
!
… means all other variables

Super user hint: skip plyr and summarize
your data with incomplete formula and
cast(fun.aggregate=…)
reshape

19. Cast: go wide

cast.irisformula = Species + id ~ ...)

Super user hint: skip plyr and summarize
your data with incomplete formula and
cast(fun.aggregate=…)
reshape

  Try melt and cast
  with baseball produce ->
  with iris: produce:
reshape
Discuss how you format/store your data

21. plyr
plyr

22. plyr
Plyr

23. Split-Apply-Combine
  Equivalent
  SQL GROUP BY
  Pivot Tables (Excel, SPSS, …)
  Split
  Define a subset of your data
  Apply
  Do anything to this subset
  calculation, modeling, simulations, plotting
  Combine
  Repeat this for all subsets
  collect the results
Journal of Statistical Software
7
2
1
1
2 1,2
Figure 1: The three ways to split up a 2d matrix, labelled above by the dimensions that they
slice. Original matrix shown at top left, with dimensions labelled. A single piece under each
splitting scheme is colored blue.
3
2
1
1 2 3
1,2 1,3 2,3
1,2,3
Figure 2: The seven ways to split up a 3d array, labelled above by the dimensions that they
slice up. Original array shown at top left, with dimensions labelled. Blue indicates a single
piece of the output.
m*ply() takes a matrix, list-array, or data frame, splits it up by rows and calls the processing
function supplying each piece as its parameters. Figure 3 shows how you might use this to
draw random numbers from normal distributions with varying parameters.
Input: Data frame (d*ply)
When operating on a data frame, you usually want to split it up into groups based on com-
binations of variables in the data set. For d*ply you specify which variables (or functions
of variables) to use. These variables are speciﬁed in a special way to highlight that they are
Split
plyr

24. Functions
  functions
  _ _ ply
  d = data.frame
  a = array
  l = list
  special
  r = replicate
ddply
input format output format
plyr
Super user hint:
check out help(package=plyr) for
things like each, join, colwise..

25. my.function! ! ! resultsreturn(data.frame(results)}!
!
my.function can produce as many rows as subset.data (transform)
or fewer rows than subset.data (summarize)
!
returned.results.variable=c("variable1", "variable2”),!
! ! my.function(subset.data))!
!
!
How it works
Super user hint:
•  look under the hood as plyr is
written in R
•  think you can do better: plyr is
on GitHub
Warning: idiosyncrasies
present
plyr

26. Example 1
  Calculate the mean of each measure for
each species using the molten data set
Super user hint: note __ply’s helper
function rbind.fill() very useful for
merging many data.frames
molten.means!.variables=c("Species", "measure"),!
function(subset.data) data.frame(mean=mean(subset.data\$value)))
plyr

27. Example 3
  Slope of width on length
Super user hint: on big jobs, plyr can
tell you where its at (.progress=“text”)
plyr
length.on.width.slopewith(subset.data,{
slope.sepalslope.petalreturn(data.frame(slope.sepal=slope.sepal,
slope.petal=slope.petal))
})
}
iris.slopes.variables="Species",
function(x)length.on.width.slope(x))

  try mean calculation on original iris
  create different outputs
  dlply
  daply
  d_ply
  when would you use this?
  take in different inputs
  ldply
  rdply
  change functions
  sd, length
  range=max()-min()
  to calculate many statistics
  to do more complex stuff
  calculate slope and intercept of Sepal.Width~Sepal.Length
  to plot
  apply to other data
  melt and cast data
  simesants, rats, iris, sipoo, weeds, your own data
plyr
how you would/
have used plyr

29. ggplot
ggplot

30. 6 H. WICKHAM
Figure 1. Graphics objects produced by (from left to right): geometric objects, scales and coordinate system,
plot annotations.
ggplot
1. a graphic is made of (independent)
elements layers (as opposed to a single
encapsulating name)
  data
  aesthetics
  transformation
  geoms (geometric objects)
  axis (coordinate system)
  scales
Grammar of graphics (gg)

31. ggplot
2. editing an element produces a new
graph
  just change the coordinate system!
Grammar of graphics (gg)
A LAYERED GRAMMAR OF GRAPHICS 23
Figure 16. Bar chart (left) and equivalent Coxcomb plot (right) of clarity distribution. The Coxcomb plot is a
bar chart in polar coordinates. Note that the categories abut in the Coxcomb, but are separated in the bar chart:
this is an example of a graphical convention that differs in different coordinate systems.

32. ggplot
1.  create a simple plot object
  plot.object  options available on:!
  http://docs.ggplot2.org!
  repeat step 2 until satisfied!
3.  print your object to screen (or to graphical
device)
  print(plot.object)!
How it works
Super user request:
send me your best ggplot (pdf)
[email protected]
and you can show it off and discuss it

33. ggplot
Example 1
  Most basic plot
basic.plotx=Sepal.Length,!
y=Sepal.Width)!
!
! ! !print(basic.plot)!

34. ggplot
Example 1
  Most basic plot (categorical)
categorical.plotx=Species,!
y=Sepal.Width)!
!
! ! !print(categorical.plot)

35. ggplot
Example 1
  Edited most basic plot
basic.plotx=Sepal.Length,!
xlab="Sepal Width (mm)",!
y=Sepal.Width,!
ylab="Sepal Length (mm)",!
main="Sepal dimensions")!
!
!
!
! ! !print(basic.plot)

36. ggplot
Example 1
basic.plotx=Sepal.Length,!
xlab="Sepal Width (mm)",!
y=Sepal.Width,!
ylab="Sepal Length (mm)",!
main="Sepal dimensions",!
colour=Species,!
shape=Species,!
alpha=I(0.5))!
!
print(basic.p!
! ! !print(basic.plot)!

37. ggplot
Example 1
  Add a geom (eg. linear smooth)
plot.with.linear.smoothprint(plot.with.linear.smooth)!

38. ggplot
Example 2
CO2.plotx=conc,!
y=uptake,!
colour=Treatment)!
!
print(CO2.plot)!

39. ggplot
Example 2
  Facets
CO2.plotprint(CO2.plot)!

40. ggplot
Example 2
print(CO2.plot+geom_line())!

41. ggplot
Example 2
  Specify groups
CO2.plotprint(CO2.plot)!

42. ggplot
Example 2
  Line with specified statistic
CO2.plot.mean! !geom_line(stat="summary", fun.y="mean",!
! ! ! ! size=I(3), alpha=I(0.3))!
print(CO2.plot)!

docs.ggplot.org
ggplot
Time to show off!
prettiest plot you ever
  base
  use data(simeants) ->
  to produce :

44. You
  What was most interesting/
useful?
  What do you still need to
  use reshape, plyr, ggplot?
  to have fun using R?

45. Acknowledgements
  Reshape, plyr and ggplot2 are all brought to you on
GitHub by:
Wickham, H. (2011). "The split-apply-
combine strategy for data analysis."
Journal of Statis.
Wickham, H. (2010). "A layered
grammar of graphics." Journal of
Computational and Graphical
Statistics 19(1): 3-28.

46. Superuser stuff
  ggplot themes
  plyr
  multicore
  progress bar
  reshape, plyr and ggplot all together
  great exploratory plots
  upcoming dplyr
  more of the Hadely ecosystem
Super user approved
plyr
Journal of Statistical Software
7
2
1
1
2
1,2
Figure
1: T
he
three
ways to
split up
a
2d
m
atrix, labelled
above
by
the
dim
ensions that they
slice.
O
riginal m
atrix
show
n
at
top
left, w
ith
dim
ensions
labelled.
A
single
piece
under
each
splitting
schem
e
is
colored
blue.
3
2
1
1
2
3
1,2
1,3
2,3
1,2,3
Figure
2:
T
he
seven
ways
to
split
up
a
3d
array, labelled
above
by
the
dim
ensions
that
they
slice
up.
O
riginal array
show
n
at
top
left, w
ith
dim
ensions
labelled.
Blue
indicates
a
single
piece
of the
output.
m*ply()
takes a
m
atrix, list-array, or data
fram
e, splits it up
by
row
s and
calls the
processing
function
supplying
each
piece
as
its
param
eters.
Figure
3
show
s
how
you
m
ight
use
this
to
draw
random
num
bers
from
norm
al distributions
w
ith
varying
param
eters.
Input:
D
ata
fram
e
(d*ply)
W
hen
operating
on
a
data
fram
e, you
usually
want
to
split
it
up
into
groups
based
on
com
-
binations
of
variables
in
the
data
set.
For
d*ply
you
specify
w
hich
variables
(or
functions
of variables)
to
use.
T
hese
variables
are
speciﬁed
in
a
special way
to
highlight
that
they
are

47. ggplot
ggplot themes
  theme_set(theme())
  or plot+theme()
  themes
  theme_bw()
  theme_grey()
  edit themes
  mytheme theme(plot.title = element_text(colour = "red"))
  p + mytheme

48. multicore plyr
#install.packages(parallel)!
#install.packages(doMC)!
library(parallel)!
library(doMC)!
!
registerDoMC(2) # 2 cores!
!
iris.slopes! !.variables="Species",!
! !length.on.width.slope,!
! !.parallel=T)!
Super user approved
plyr

49. progress plyr
  “text” progress bar
  |=================================================| 100%
  “tk” on unix, linux and mac
  “win” on windows
!
iris.slopes! !.variables="Species",!
! !length.on.width.slope,!
! !.progress= "text")
Super user approved
plyr

50. reshape plyr plot
Super user approved
Warning: d_ply is not
parallel compatible
1 10 11 12 13 14 15 16
17 18 19 2 20 21 22 23
24 25 26 27 28 29 3 30
31 32 33 34 35 36 37 38
39 4 40 41 42 43 44 45
46 47 48 49 5 50 6 7
8 9
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0 2 4 6 810 0 2 4 6 810
virginica
Width
Length
part
Sepal
Petal
100 51 52 53 54 55 56 57
58 59 60 61 62 63 64 65
66 67 68 69 70 71 72 73
74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97
98 99
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0 2 4 6 810 0 2 4 6 810
virginica
Width
Length
part
Sepal
Petal
101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116
117 118 119 120 121 122 123 124
125 126 127 128 129 130 131 132
133 134 135 136 137 138 139 140
141 142 143 144 145 146 147 148
149 150
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0 2 4 6 810 0 2 4 6 810
virginica
Width
Length
part
Sepal
Petal
plyr

51. reshape plyr plot
Super user approved
Warning: strsplit is not
vectorized
Prepare data using reshape!
!
molten.iris\$row.namesmolten.iris.variables="row.names",
part=unlist(strsplit(x=as.character(measure), split="\\."))[1],
dimension=unlist(strsplit(x=as.character(measure), split="\\."))[2],
transform)

cast.irisformula=Species + id + part ~ dimension)
plyr

52. plot plyr
Super user approved
Warning: ggplot is slow
pdf("iris sepal explore plot.pdf")

d_ply(.data=cast.iris,
.variables="Species",
function(data){
print(qplot(data=data,
ymin=I(0),
ymax=Length,
xmin=I(0),
xmax=Width,
geom="rect",
xlim=c(-1, 10),
ylim=c(-1, 10),
facets=~id,
main=unique(data\$Species),
alpha=I(0.3),
fill=part))})

graphics.off()
plyr

53. Super user approved
plyr
universal plyr:
coming soon
dplyr
data.table[,,]

54.   devtools: create packages, install
development versions…
  stringr: easier manipulations of strings