July 04, 2012
3.5k

# The future of data analysis

July 04, 2012

## Transcript

1. ### Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department

of Statistics / Rice University The future of data analysis July 2012 @ hadleyw ickham Monday, July 9, 12
2. ### Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department

of Statistics / Rice University The future of data analysis July 2012 @ hadleyw ickham near ^ Monday, July 9, 12
3. ### Hadley Wickham Data Scientist in Residence Metamarkets July 2012 The

future of data analysis @ hadleyw ickham near ^ Monday, July 9, 12

6. ### Data analysis is the process by which data becomes understanding,

knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight Monday, July 9, 12
7. ### Data analysis is the process by which data becomes understanding,

knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight Monday, July 9, 12

9, 12

13. ### # Load data and create smaller subsets tb <- read.csv("tb.csv")

tb2008 <- subset(tb, year == 2008) # Choropleth map ------------------------------------------------------------- borders <- read.csv("world-borders.csv") choro <- merge(tb2008, borders, by = "iso2") choro <- choro[order(choro\$order), ] qplot(long, lat, data = choro, fill = cut_number(rate, 5), geom = "polygon", group = group) + scale_fill_brewer("Rate", pal = "Blues") # Bubble maps ---------------------------------------------------------------- centres <- read.csv("world-centres.csv") bubble <- merge(centres, tb2008, by = "iso2") world_coord <- coord_map(xlim = c(-180, 180), ylim = c(-50, 70)) # This is basically what a choropleth is showing us qplot(long, lat, data = bubble, size = area, colour = rate) + scale_area(to = c(2, 25), legend = FALSE) + world_coord # More traditional options qplot(long, lat, data = bubble, size = rate) + world_coord qplot(long, lat, data = bubble, size = log10(pop), colour = rate) + world_coord # Even better if we add world boundaries ggplot(bubble, aes(long, lat)) + geom_polygon(data = borders, aes(group = group)) + geom_point(aes(colour = rate)) + coord_map() ggsave("world-4.png", width = 8, height = 6, dpi = 128) # Works better if we tweak aesthetics ggplot(bubble, aes(long, lat)) + geom_polygon(data = borders, aes(group = group), colour = "grey70", fill = NA) + Just text Monday, July 9, 12

17. ### R + js python data scientists + sql regex xpath

data languages Monday, July 9, 12
18. ### R + js python data scientists + C/C++ fortran scala

tool builders + sql regex xpath data languages Monday, July 9, 12
19. ### The future is already here – it’s just not evenly

distributed. William Gibson Monday, July 9, 12
20. ### More of this in the future • Human readable text

formats • Programming data analysis with open source software • Git and github (for code and data) • Virtual machines (+ EC2) • Open web APIs (for paid services) Monday, July 9, 12

July 9, 12
27. ### Word sucks. Latex sucks. HTML sucks (to write in). Markdown

rules. Monday, July 9, 12

32. ### Still to come • View source • Capture and recreate

dependencies • Download data • Build virtual machine • Github integration: forking & pull requests Monday, July 9, 12

34. ### Still to come • Synchronisation between presenter and audience •

“Run this code” and environment synching Monday, July 9, 12

36. ### ggplot2 + concise + data tools - static d3 +

web + ﬂexible - verbose Monday, July 9, 12
37. ### ggplot2 + concise + data tools - static d3 +

web + ﬂexible - verbose ??? Monday, July 9, 12
38. ### r2d3 • R DSL builds json (trivial serialisation of ggplot2

call) • Rendered in browser with js + d3 • Websockets allow callbacks from browser to and computation engine (and vice versa) • Declare interaction with functional reactive programming • (If front-end api is right, should be possible to support multiple language backends) Supported by m etam arkets Monday, July 9, 12
39. ### Hybrid computation In browser (js) Local compute (R, python, ...)

Distributed compute The future is heterogeneous Monday, July 9, 12

41. ### Reproducible research/ Deployment = data analysis + software development. Better

tools to capture all dependencies. Easy way of instantiating vm, either locally or on the cloud. Better debugging tools. Monday, July 9, 12
42. ### Data analysis Hard to track progress and replay different branches

of the analysis and ensure you've explored the space fully. Hard to swap out data/pieces of the analysis. How can we do better? Monday, July 9, 12

44. ### Conclusion • Important to program data analysis (for frequent users)

• We need better tools for sharing (incl. teaching), reproducible research, visualisation and introspection. • Most data analysis challenges are not purely statistical. Monday, July 9, 12