Etienne
November 15, 2012
1.6k

# Hadley Ecosystem: Reshape, Plyr, GGplot

We will give you a fly over of a few of the packages Hadley Wickham and his collaborators have created. Many of us now use these packages in every project we tackle and they have become an essential tool of the R enthusiast tool box. A brief tutorial providing the key features and how to implement them will be presented for each package, each followed by a hands on application exercise. Tips and trick for super users will also be provided.

-reshape: make your data play nice (Too many columns, no problem)

-plyr: split/apply/combine (extract the slope of a linear model for each of your thousand replicates)

-ggplot2: the grammar of graphics (start with a basic plot and intuitively add layers of complexity)

## Etienne

November 15, 2012

## Transcript

5. ### You   R level?   Have plotted with base R?

  Have you:   used reshape ?   used plyr ?   used ggplot? You
6. ### Outline   reshape   Make your data play nice 

10 minutes hands on   plyr   Split-Apply-Combine on steroids   to summarize or transform your data   15 minutes hands on   ggplot   beautiful plots one layer at a time   15 minutes hands on   Power user goodies on demand
7. ### on demand during hands on: superuser stuff  ggplot themes  plyr

on demand during hands on: superuser stuff  ggplot themes  plyr   multicore   progress bar  reshape, plyr and ggplot all together   great exploratory plots  upcoming dplyr  more of the Hadely ecosystem

9. ### Required packages   the obvious:   plyr   reshape(2) 

ggplot2   for a little more data to play with:   vegan   vegetarian   for pretty graphic tables   gridExtra   help(package=“package name”)

11. ### reshape   Wide   Each level of a factor gets

a column   Multiple measurements per row   Excel, SPSS…   Pros   Plays nice with humans   No data repetition   “Eyeballable”   Cons   Does not play nice with R ID variable Level 1 Level 2 ID 1 Measured value Measured value ID 2 Measured value Measured value
12. ###   Long   Levels are expressed in a column 

One measured value per row   eg. really long: XML, JSON (tag:content pairs)   Pros   Plays nice with computers (API, databases, plyr, ggplot2…)   Cons   Does not play nice with humans   Lots of copy pasting and forget eyeballing it! ID variable Factor Measured value ID 1 Level 1 Measured value ID 1 Level 2 Measured value ID 2 Level 1 Measured value ID 2 Level 2 Measured value reshape
13. ### Look at data   What format is…?   data(simesants) 

head(simesants) or str(simesants)   data(iris)   data(sipoo)   your data???   Look at more data   data() reshape why is your data long/wide?
14. ### ID variable Factor Measured value ID 1 Level 1 Measured

value ID 1 Level 2 Measured value ID 2 Level 1 Measured value ID 2 Level 2 Measured value ID variable Level 1 Level 2 ID 1 Measured value Measured value ID 2 Measured value Measured value Wide Long reshape
15. ### Make your data play nice   Switching from long to

wide   library(reshape)   melt()   cast() reshape
16. ### Melt: go long molten.data<-melt(data,   id.vars=ls("id.var.1", "id.var.2"),   measure.vars=ls("measure.vars", "measure.vars"),

variable_name = "variable")! ! head(iris)         reshape Super user hint: produce beautiful tables with require(gridExtra) and grid.table()
17. ### Melt: go long   iris\$id<-row.names(iris)     molten.iris<-melt(iris,   id.vars=c("Species",

"id"),   #measure.vars=c("measure.vars", "measure.vars"),   variable_name = "measure")     head(molten.iris)         reshape
18. ### Cast: go wide cast.data<-cast(molten.data,   formula = id_var_1 + id_var_2

~   measure_var_1 + measure_var_2)! ! … means all other variables           Super user hint: skip plyr and summarize your data with incomplete formula and cast(fun.aggregate=…) reshape
19. ### Cast: go wide     cast.iris<-cast(molten.iris,   formula = Species

+ id ~ ...)     head(cast.iris)       Super user hint: skip plyr and summarize your data with incomplete formula and cast(fun.aggregate=…) reshape
20. ### Your turn   Try melt and cast   with baseball

produce ->   with iris: produce: reshape Discuss how you format/store your data with your neighbor

23. ### Split-Apply-Combine   Equivalent   SQL GROUP BY   Pivot Tables

Split-Apply-Combine   Equivalent   SQL GROUP BY   Pivot Tables (Excel, SPSS, …)   Split   Define a subset of your data   Apply   Do anything to this subset   calculation, modeling, simulations, plotting   Combine   Repeat this for all subsets   collect the results Split plyr
24. ### Functions   functions   _ _ ply   d =

data.frame   a = array   l = list   special   _ = discard   r = replicate ddply input format output format plyr Super user hint: check out help(package=plyr) for things like each, join, colwise..
25. ### my.function<-function(subset.data){! ! ! ! results<-do.something(subset.data)! return(data.frame(results)}! ! my.function can produce

as many rows as subset.data (transform) or fewer rows than subset.data (summarize) ! returned.results<-ddply(.data=data,! .variable=c("variable1", "variable2”),! ! ! my.function(subset.data))! ! ! How it works Super user hint: •  look under the hood as plyr is written in R •  think you can do better: plyr is on GitHub Warning: idiosyncrasies present plyr
26. ### Example 1   Calculate the mean of each measure for

each species using the molten data set Super user hint: note __ply’s helper function rbind.fill() very useful for merging many data.frames molten.means<-ddply(.data=molten.iris,! !.variables=c("Species", "measure"),! function(subset.data) data.frame(mean=mean(subset.data\$value)))   plyr
27. ### Example 3   Slope of width on length Super user

hint: on big jobs, plyr can tell you where its at (.progress=“text”) we can talk about that plyr length.on.width.slope<-function(subset.data){ with(subset.data,{ slope.sepal<-lm(Sepal.Width~Sepal.Length)\$coefficients[2] slope.petal<-lm(Petal.Width~Petal.Length)\$coefficients[2] return(data.frame(slope.sepal=slope.sepal, slope.petal=slope.petal)) }) } iris.slopes<-ddply(.data=iris, .variables="Species", function(x)length.on.width.slope(x))
28. ### Your turn   try mean calculation on original iris 

create different outputs   dlply   daply   d_ply   when would you use this?   take in different inputs   ldply   rdply   change functions   sd, length   range=max()-min()   write your own function   to calculate many statistics   to do more complex stuff   calculate slope and intercept of Sepal.Width~Sepal.Length   to plot   apply to other data   melt and cast data   simesants, rats, iris, sipoo, weeds, your own data plyr Show your neighbor how you would/ have used plyr

30. ### 6 H. WICKHAM Figure 1. Graphics objects produced by (from

left to right): geometric objects, scales and coordinate system, plot annotations. ggplot 1. a graphic is made of (independent) elements layers (as opposed to a single encapsulating name)   data   aesthetics   transformation   geoms (geometric objects)   axis (coordinate system)   scales Grammar of graphics (gg)
31. ### ggplot 2. editing an element produces a new graph 

just change the coordinate system! Grammar of graphics (gg) A LAYERED GRAMMAR OF GRAPHICS 23 Figure 16. Bar chart (left) and equivalent Coxcomb plot (right) of clarity distribution. The Coxcomb plot is a bar chart in polar coordinates. Note that the categories abut in the Coxcomb, but are separated in the bar chart: this is an example of a graphical convention that differs in different coordinate systems.
32. ### ggplot 1.  create a simple plot object   plot.object<-qplot()! 2.

add graphical layers/complexity   plot.object<-plot.object+layer()!   options available on:!   http://docs.ggplot2.org!   repeat step 2 until satisfied! 3.  print your object to screen (or to graphical device)   print(plot.object)! How it works Super user request: send me your best ggplot (pdf) [email protected] and you can show it off and discuss it
33. ### ggplot Example 1   Most basic plot basic.plot<-qplot(data=iris,! x=Sepal.Length,! y=Sepal.Width)!

! ! ! !print(basic.plot)!
34. ### ggplot Example 1   Most basic plot (categorical) categorical.plot<-qplot(data=iris,! x=Species,!

y=Sepal.Width)! ! ! ! !print(categorical.plot)
35. ### ggplot Example 1   Edited most basic plot basic.plot<-qplot(data=iris,! x=Sepal.Length,!

xlab="Sepal Width (mm)",! y=Sepal.Width,! ylab="Sepal Length (mm)",! main="Sepal dimensions")! ! ! ! ! ! !print(basic.plot)
36. ### ggplot Example 1   Add aesthetics basic.plot<-qplot(data=iris,! x=Sepal.Length,! xlab="Sepal Width

(mm)",! y=Sepal.Width,! ylab="Sepal Length (mm)",! main="Sepal dimensions",! colour=Species,! shape=Species,! alpha=I(0.5))! ! print(basic.p! ! ! !print(basic.plot)!
37. ### ggplot Example 1   Add a geom (eg. linear smooth)

plot.with.linear.smooth<-basic.plot+geom_smooth(method="lm", se=F)! print(plot.with.linear.smooth)!

42. ### ggplot Example 2   Line with specified statistic CO2.plot.mean<-CO2.plot+! !

!geom_line(stat="summary", fun.y="mean",! ! ! ! ! size=I(3), alpha=I(0.3))! print(CO2.plot)!
43. ### Your turn docs.ggplot.org ggplot Time to show off! Show your

neighbor the prettiest plot you ever made! http://chrisladroue.com/2011/10/an-exercise-in-plyr-and-ggplot2-using-triathlon-results/   base   use data(simeants) ->   advanced   use http://chrisladroue.com/files/stratford.csv   to produce :
44. ### You   What was most interesting/ useful?   What do

you still need to   use reshape, plyr, ggplot?   to have fun using R?
45. ### Acknowledgements   Reshape, plyr and ggplot2 are all brought to

you on GitHub by:   Hadley Wickham   had.co.nz Wickham, H. (2011). "The split-apply- combine strategy for data analysis." Journal of Statis. Wickham, H. (2010). "A layered grammar of graphics." Journal of Computational and Graphical Statistics 19(1): 3-28.
46. ### Superuser stuff   ggplot themes   plyr   multicore 

Superuser stuff   ggplot themes   plyr   multicore   progress bar   reshape, plyr and ggplot all together   great exploratory plots   upcoming dplyr   more of the Hadely ecosystem Super user approved plyr
47. ### ggplot ggplot themes   theme_set(theme())   or plot+theme()   themes

  theme_bw()   theme_grey()   edit themes   mytheme <- theme_grey() + theme(plot.title = element_text(colour = "red"))   p + mytheme
48. ### multicore plyr #install.packages(parallel)! #install.packages(doMC)! library(parallel)! library(doMC)! ! registerDoMC(2) # 2

cores! ! iris.slopes<-ddply(.data=iris,! ! !.variables="Species",! ! !length.on.width.slope,! ! !.parallel=T)! Super user approved plyr
49. ### progress plyr   “text” progress bar   |=================================================| 100% 

“tk” on unix, linux and mac   “win” on windows ! iris.slopes<-ddply(.data=iris,! ! !.variables="Species",! ! !length.on.width.slope,! ! !.progress= "text")   Super user approved plyr
50. ### reshape plyr plot Super user approved Warning: d_ply is not

reshape plyr plot Super user approved Warning: d_ply is not parallel compatible plyr
51. ### reshape plyr plot Super user approved Warning: strsplit is not

vectorized Prepare data using reshape! ! molten.iris\$row.names<-row.names(molten.iris)   molten.iris<-ddply(.data=molten.iris,   .variables="row.names",   part=unlist(strsplit(x=as.character(measure), split="\\."))[1],   dimension=unlist(strsplit(x=as.character(measure), split="\\."))[2],   transform)         cast.iris<-cast(data=molten.iris,   formula=Species + id + part ~ dimension)   plyr
52. ### plot plyr Super user approved Warning: ggplot is slow pdf("iris

sepal explore plot.pdf")     d_ply(.data=cast.iris,   .variables="Species",   function(data){   print(qplot(data=data,   ymin=I(0),   ymax=Length,   xmin=I(0),   xmax=Width,   geom="rect",   xlim=c(-1, 10),   ylim=c(-1, 10),   facets=~id,   main=unique(data\$Species),   alpha=I(0.3),   fill=part))})     graphics.off()   plyr

54. ###   devtools: create packages, install development versions…   stringr: easier

manipulations of strings More from the Hadley Ecosystem