Pipelines for data analysis in R

Slide 1

Slide 1 text

Hadley Wickham   @hadleywickham Chief Scientist, RStudio Pipelines for   data analysis in R September 2015

Slide 2

Slide 2 text

Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight

Slide 3

Slide 3 text

Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight

Slide 4

Slide 4 text

Transform Visualise Model Tidy Import Surprises, but doesn't scale Scales, but doesn't (fundamentally) surprise Create new variables & new summaries Consistent way of storing data

Slide 5

Slide 5 text

Transform Visualise Model tidyr dplyr Tidy Import readr readxl haven DBI httr broom ggplot2 ggvis

Slide 6

Slide 6 text

Pipelines

Slide 7

Slide 7 text

Think it Do it Describe it Cognitive Computational (precisely)

Slide 8

Slide 8 text

Cognition time ≫ Computation time http://www.ﬂickr.com/photos/mutsmuts/4695658106

Slide 9

Slide 9 text

%>% Inspirations: unix, F#, haskell, clojure, method chaining magrittr::

Slide 10

Slide 10 text

foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

Slide 11

Slide 11 text

x %>% f(y) # f(x, y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) # Turns function composition (hard to read) # into sequence (easy to read)

Slide 12

Slide 12 text

# Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result. # tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings

Slide 13

Slide 13 text

Tidy

Slide 14

Slide 14 text

Transform Visualise Model tidyr dplyr Tidy Import readr readxl haven DBI httr ggplot2 ggvis broom

Slide 15

Slide 15 text

Storage Meaning Table / File Data set Rows Observations Columns Variables Tidy data = data that makes data analysis easy

Slide 16

Slide 16 text

Source: local data frame [5,769 x 22] iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int) What are the variables in this dataset? (Hint: f = female,   u = unknown, 1524 = 15-24)

Slide 17

Slide 17 text

# To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2

Slide 18

Slide 18 text

# Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3 # Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest

Slide 19

Slide 19 text

Google for “tidyr” & “tidy data”

Slide 20

Slide 20 text

Transform

Slide 21

Slide 21 text

Transform Visualise Model tidyr dplyr Tidy ggplot2 ggvis broom Import readr readxl haven DBI httr

Slide 22

Slide 22 text

Think it Do it Describe it Cognitive Computational (precisely)

Slide 23

Slide 23 text

One table verbs • select: subset variables by name • ﬁlter: subset observations by value • mutate: add new variables • summarise: reduce to a single obs • arrange: re-order the observations + group by

Slide 24

Slide 24 text

Demo

Slide 25

Slide 25 text

right_join() full_join() inner_join() left_join() Mutating semi_join() anti_join() Filtering Set intersect() setdiff() union()

Slide 26

Slide 26 text

dplyr sources • Local data frame (C++) • Local data table • Local data cube (experimental) • RDMS: Postgres, MySQL, SQLite, Oracle, MS SQL, JDBC, Impala • MonetDB, BigQuery

Slide 27

Slide 27 text

Google for “dplyr”

Slide 28

Slide 28 text

Visualise

Slide 29

Slide 29 text

Transform Visualise Model tidyr dplyr Tidy ggplot2 ggvis broom Import readr readxl haven DBI httr

Slide 30

Slide 30 text

What is ggvis? •A grammar of graphics   (like ggplot2) •Reactive (interactive & dynamic)   (like shiny) •A pipeline (a la dplyr) •Of the web (drawn with vega)

Slide 31

Slide 31 text

Demo 4-ggvis.R 4-ggvis.Rmd

Slide 32

Slide 32 text

Google for “ggvis”

Slide 33

Slide 33 text

Model with broom, by David Robinson

Slide 34

Slide 34 text

Transform Visualise Model tidyr dplyr Tidy ggplot2 ggvis broom Import readr readxl haven DBI httr

Slide 35

Slide 35 text

2.5 5.0 7.5 1990 1995 2000 2005 2010 2015 date log(sales) 46 TX cities, ~25 years of data What makes it hard to see the long term trend?

Slide 36

Slide 36 text

# Models are useful as tool for removing # known patterns tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )

Slide 37

Slide 37 text

−2 −1 0 1 1990 1995 2000 2005 2010 2015 date resid

Slide 38

Slide 38 text

# Models are also useful in their own right models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )

Slide 39

Slide 39 text

Model summaries • Model level: one row per model • Coeﬃcient level: one row per coeﬃcient (per model) • Observation level: one row per observation (per model)

Slide 40

Slide 40 text

Demo 5-broom.R

Slide 41

Slide 41 text

Google for “broom r”

Slide 42

Slide 42 text

Big data and R

Slide 43

Slide 43 text

Big Can’t ﬁt in memory on one computer: >5 TB Medium Fits in memory on a server: 10 GB-5 TB Small Fits in memory on a laptop: <10 GB R is great at this!

Slide 44

Slide 44 text

R • R provides an excellent environment for rapid interactive exploration of small data. • There is no technical reason why it can’t also work well with medium size data. (But the work mostly hasn’t been done) • What about big data?

Slide 45

Slide 45 text

1. Can be reduced to a small data problem with subsetting/sampling/ summarising (90%) 2. Can be reduced to a very large number of small data problems (9%) 3. Is irreducibly big (1%)

Slide 46

Slide 46 text

The right small data • Rapid iteration essential • dplyr supports this activity by avoiding cognitive costs of switching between languages.

Slide 47

Slide 47 text

Lots of small problems • Embarrassingly parallel (e.g. Hadoop) • R wrappers like foreach, rhipe, rhadoop • Challenging is matching architecture of computing to data storage

Slide 48

Slide 48 text

Irreducibly big • Computation must be performed by specialised system. • Typically C/C++, Fortran, Scala. • R needs to be able to talk to those systems.

Slide 49

Slide 49 text

Future work

Slide 50

Slide 50 text

End game Provide a ﬂuent interface where you spent your mental energy on the speciﬁc data problem, not general data analysis process. The best tools become invisible with time! Still a lot of work to do, especially on the connection between modelling and visualisation.

Slide 51

Slide 51 text

Transform Visualise Model tidyr dplyr Tidy Import readr readxl haven DBI httr broom ggplot2 ggvis