Druid + R

Druid + R aggregate all your data

agenda An Overview of Druid RDruid Lab Conclusions

motivation visualize big data existing data engines did not meet
our needs

motivation relational databases scans were too slow! NoSQL computationally intractable
pre-computations took too long! nothing existed that could solve our problems (or was cost prohibitive)

enter Druid real-time distributed column-oriented analytical data store scales horizontally
open-source

how is Druid different highly optimized fast scans & aggregations
real-time data ingestion explore events within milliseconds no pre-computation arbitrarily slice & dice data highly available

using Druid we will explore Druid architecture in future meetups
let's learn to use Druid!

RDruid slicing & dicing on steroids

what are we addressing? slicing and dicing data in R
is fun… …until you run out of memory

solution fire up a 64G EC2 machine and hope it
works or let Druid do the work for you

how we use it ad-hoc reporting analyze client data internal
metrics prototyping

metrics

let’s try it code bit.ly/YtJ1Xj

setup launch your favorite R environment install and load the
druid R package install.packages("devtools") install.packages("ggplot2") library(devtools) install_github("RDruid", "metamx") library(RDruid) library(ggplot2) druid-meetup.R

concepts Druid always computes aggregates events are based in time
Druid understands time bucketing dimensions along which to slice & dice metrics to aggregate

concepts think aggregates and group by in SQL SELECT hour(timestamp),
time page, language, dimensions sum(count) metrics GROUP BY hour(timestamp), page, language

data sources connect to our cluster druid <- druid.url("druid-meetup.mmx.io") Wikipedia
druid.query.dimensions(url = druid, dataSource = "wikipedia_editstream") druid.query.metrics(url = druid, dataSource = "wikipedia_editstream") Twitter dataSource = "twitterstream" x0-sources.R

timeseries Wikipedia page edits since January, by hour edits <-
druid.query.timeseries( url = druid, dataSource = "wikipedia_editstream", intervals = interval(ymd("2013-01-01"), ymd("2013-04-01")), aggregations = sum(metric("count")), granularity = "hour" ) qplot(data = edits, x = timestamp, y = count, geom = "line") x1-timeseries.R

filters what if I'm only interested in articles in English
and French enfr <- druid.query.timeseries( [...] granularity = "hour", filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ) ) x2-filters.R

group by let's break it out by language enfr <-
druid.query.groupBy( [...] filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ), dimensions = list("language") ) qplot(data = enfr, x = timestamp, y = count, geom = "line", color = language) x3-groupby.R

granularity arbitrary time slices granularity = granularity( "PT6H", timeZone =
"America/Los_Angeles" ) try out a few more P1D · P1W · P1M x4-timeslices.R

aggregations sum, min, max aggregations = list( count = sum(metric("count")),
total = sum(metric("added")) ) timestamp total count 1 2013-01-01 127232693 346895 2 2013-01-02 130657602 403504 3 2013-01-03 134643672 387462 x5-aggs.R

math you can do math too + - * /
constants aggregations = list( count = sum(metric("count")), added = sum(metric("added")), deleted = sum(metric("deleted")) ), postAggregations = list( average = field("added") / field("count"), pct = field("deleted") / field("added") * -100 ) x6-postaggs.R

more advanced all pages edited by users matching regex '^Bob.*'
druid.query.groupBy([...] intervals = interval(ymd("2013-03-01"), ymd("2013-04-01")), granularity = "all", single time bucket filter = dimension("user") %~% "^Bob.*", dimensions = list("user", "page") ) x7-advanced.R

academy awards stats awards <- druid.query.groupBy( url = druid, dataSource
= "twitterstream", intervals = interval(ymd("2013-02-24"), ymd("2013-02-28")), aggregations = list(tweets = sum(metric("count"))), granularity = granularity("PT1H"), filter = dimension("first_hashtag") %~% "academyawards" | dimension("first_hashtag") %~% "oscars", dimensions = list("first_hashtag")) awards <- subset(awards, tweets > 10) qplot(data=awards, x = timestamp, y = tweets, color = first_hashtag, geom="line") x8-awards.R

academy awards stats x8-awards.R

roll your own run your own Druid cluster github.com/metamx/druid/wiki/ Druid-Personal-Demo-Cluster

contribute fork us on github Druid github.com/metamx/druid RDruid github.com/metamx/RDruid

thank you

Druid + R

Druid + R

Metamarkets

More Decks by Metamarkets

Other Decks in Technology

Featured

Transcript

Druid + R aggregate all your data

agenda An Overview of Druid RDruid Lab Conclusions

motivation visualize big data existing data engines did not meet

motivation relational databases scans were too slow! NoSQL computationally intractable

enter Druid real-time distributed column-oriented analytical data store scales horizontally

how is Druid different highly optimized fast scans & aggregations

using Druid we will explore Druid architecture in future meetups

RDruid slicing & dicing on steroids

what are we addressing? slicing and dicing data in R

solution fire up a 64G EC2 machine and hope it

how we use it ad-hoc reporting analyze client data internal

metrics

let’s try it code bit.ly/YtJ1Xj

setup launch your favorite R environment install and load the

concepts Druid always computes aggregates events are based in time

concepts think aggregates and group by in SQL SELECT hour(timestamp),

data sources connect to our cluster druid <- druid.url("druid-meetup.mmx.io") Wikipedia

timeseries Wikipedia page edits since January, by hour edits <-

filters what if I'm only interested in articles in English

group by let's break it out by language enfr <-

granularity arbitrary time slices granularity = granularity( "PT6H", timeZone =

aggregations sum, min, max aggregations = list( count = sum(metric("count")),

math you can do math too + - * /

more advanced all pages edited by users matching regex '^Bob.*'

academy awards stats awards <- druid.query.groupBy( url = druid, dataSource

academy awards stats x8-awards.R

roll your own run your own Druid cluster github.com/metamx/druid/wiki/ Druid-Personal-Demo-Cluster

contribute fork us on github Druid github.com/metamx/druid RDruid github.com/metamx/RDruid

thank you