Slide 1

Slide 1 text

Druid + R aggregate all your data

Slide 2

Slide 2 text

agenda An Overview of Druid RDruid Lab Conclusions

Slide 3

Slide 3 text

motivation visualize big data existing data engines did not meet our needs

Slide 4

Slide 4 text

motivation relational databases scans were too slow! NoSQL computationally intractable pre-computations took too long! nothing existed that could solve our problems (or was cost prohibitive)

Slide 5

Slide 5 text

enter Druid real-time distributed column-oriented analytical data store scales horizontally open-source

Slide 6

Slide 6 text

how is Druid different highly optimized fast scans & aggregations real-time data ingestion explore events within milliseconds no pre-computation arbitrarily slice & dice data highly available

Slide 7

Slide 7 text

using Druid we will explore Druid architecture in future meetups let's learn to use Druid!

Slide 8

Slide 8 text

RDruid slicing & dicing on steroids

Slide 9

Slide 9 text

what are we addressing? slicing and dicing data in R is fun… …until you run out of memory

Slide 10

Slide 10 text

solution fire up a 64G EC2 machine and hope it works or let Druid do the work for you

Slide 11

Slide 11 text

how we use it ad-hoc reporting analyze client data internal metrics prototyping

Slide 12

Slide 12 text

metrics

Slide 13

Slide 13 text

let’s try it code bit.ly/YtJ1Xj

Slide 14

Slide 14 text

setup launch your favorite R environment install and load the druid R package install.packages("devtools") install.packages("ggplot2") library(devtools) install_github("RDruid", "metamx") library(RDruid) library(ggplot2) druid-meetup.R

Slide 15

Slide 15 text

concepts Druid always computes aggregates events are based in time Druid understands time bucketing dimensions along which to slice & dice metrics to aggregate

Slide 16

Slide 16 text

concepts think aggregates and group by in SQL SELECT hour(timestamp), time page, language, dimensions sum(count) metrics GROUP BY hour(timestamp), page, language

Slide 17

Slide 17 text

data sources connect to our cluster druid <- druid.url("druid-meetup.mmx.io") Wikipedia druid.query.dimensions(url = druid, dataSource = "wikipedia_editstream") druid.query.metrics(url = druid, dataSource = "wikipedia_editstream") Twitter dataSource = "twitterstream" x0-sources.R

Slide 18

Slide 18 text

timeseries Wikipedia page edits since January, by hour edits <- druid.query.timeseries( url = druid, dataSource = "wikipedia_editstream", intervals = interval(ymd("2013-01-01"), ymd("2013-04-01")), aggregations = sum(metric("count")), granularity = "hour" ) qplot(data = edits, x = timestamp, y = count, geom = "line") x1-timeseries.R

Slide 19

Slide 19 text

filters what if I'm only interested in articles in English and French enfr <- druid.query.timeseries( [...] granularity = "hour", filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ) ) x2-filters.R

Slide 20

Slide 20 text

group by let's break it out by language enfr <- druid.query.groupBy( [...] filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ), dimensions = list("language") ) qplot(data = enfr, x = timestamp, y = count, geom = "line", color = language) x3-groupby.R

Slide 21

Slide 21 text

granularity arbitrary time slices granularity = granularity( "PT6H", timeZone = "America/Los_Angeles" ) try out a few more P1D · P1W · P1M x4-timeslices.R

Slide 22

Slide 22 text

aggregations sum, min, max aggregations = list( count = sum(metric("count")), total = sum(metric("added")) ) timestamp total count 1 2013-01-01 127232693 346895 2 2013-01-02 130657602 403504 3 2013-01-03 134643672 387462 x5-aggs.R

Slide 23

Slide 23 text

math you can do math too + - * / constants aggregations = list( count = sum(metric("count")), added = sum(metric("added")), deleted = sum(metric("deleted")) ), postAggregations = list( average = field("added") / field("count"), pct = field("deleted") / field("added") * -100 ) x6-postaggs.R

Slide 24

Slide 24 text

more advanced all pages edited by users matching regex '^Bob.*' druid.query.groupBy([...] intervals = interval(ymd("2013-03-01"), ymd("2013-04-01")), granularity = "all", single time bucket filter = dimension("user") %~% "^Bob.*", dimensions = list("user", "page") ) x7-advanced.R

Slide 25

Slide 25 text

academy awards stats awards <- druid.query.groupBy( url = druid, dataSource = "twitterstream", intervals = interval(ymd("2013-02-24"), ymd("2013-02-28")), aggregations = list(tweets = sum(metric("count"))), granularity = granularity("PT1H"), filter = dimension("first_hashtag") %~% "academyawards" | dimension("first_hashtag") %~% "oscars", dimensions = list("first_hashtag")) awards <- subset(awards, tweets > 10) qplot(data=awards, x = timestamp, y = tweets, color = first_hashtag, geom="line") x8-awards.R

Slide 26

Slide 26 text

academy awards stats x8-awards.R

Slide 27

Slide 27 text

roll your own run your own Druid cluster github.com/metamx/druid/wiki/ Druid-Personal-Demo-Cluster

Slide 28

Slide 28 text

contribute fork us on github Druid github.com/metamx/druid RDruid github.com/metamx/RDruid

Slide 29

Slide 29 text

thank you