Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Druid + R

Druid + R

Metamarkets

April 03, 2013
Tweet

More Decks by Metamarkets

Other Decks in Technology

Transcript

  1. motivation relational databases scans were too slow! NoSQL computationally intractable

    pre-computations took too long! nothing existed that could solve our problems (or was cost prohibitive)
  2. how is Druid different highly optimized fast scans & aggregations

    real-time data ingestion explore events within milliseconds no pre-computation arbitrarily slice & dice data highly available
  3. what are we addressing? slicing and dicing data in R

    is fun… …until you run out of memory
  4. solution fire up a 64G EC2 machine and hope it

    works or let Druid do the work for you
  5. setup launch your favorite R environment install and load the

    druid R package install.packages("devtools") install.packages("ggplot2") library(devtools) install_github("RDruid", "metamx") library(RDruid) library(ggplot2) druid-meetup.R
  6. concepts Druid always computes aggregates events are based in time

    Druid understands time bucketing dimensions along which to slice & dice metrics to aggregate
  7. concepts think aggregates and group by in SQL SELECT hour(timestamp),

    time page, language, dimensions sum(count) metrics GROUP BY hour(timestamp), page, language
  8. data sources connect to our cluster druid <- druid.url("druid-meetup.mmx.io") Wikipedia

    druid.query.dimensions(url = druid, dataSource = "wikipedia_editstream") druid.query.metrics(url = druid, dataSource = "wikipedia_editstream") Twitter dataSource = "twitterstream" x0-sources.R
  9. timeseries Wikipedia page edits since January, by hour edits <-

    druid.query.timeseries( url = druid, dataSource = "wikipedia_editstream", intervals = interval(ymd("2013-01-01"), ymd("2013-04-01")), aggregations = sum(metric("count")), granularity = "hour" ) qplot(data = edits, x = timestamp, y = count, geom = "line") x1-timeseries.R
  10. filters what if I'm only interested in articles in English

    and French enfr <- druid.query.timeseries( [...] granularity = "hour", filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ) ) x2-filters.R
  11. group by let's break it out by language enfr <-

    druid.query.groupBy( [...] filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ), dimensions = list("language") ) qplot(data = enfr, x = timestamp, y = count, geom = "line", color = language) x3-groupby.R
  12. granularity arbitrary time slices granularity = granularity( "PT6H", timeZone =

    "America/Los_Angeles" ) try out a few more P1D · P1W · P1M x4-timeslices.R
  13. aggregations sum, min, max aggregations = list( count = sum(metric("count")),

    total = sum(metric("added")) ) timestamp total count 1 2013-01-01 127232693 346895 2 2013-01-02 130657602 403504 3 2013-01-03 134643672 387462 x5-aggs.R
  14. math you can do math too + - * /

    constants aggregations = list( count = sum(metric("count")), added = sum(metric("added")), deleted = sum(metric("deleted")) ), postAggregations = list( average = field("added") / field("count"), pct = field("deleted") / field("added") * -100 ) x6-postaggs.R
  15. more advanced all pages edited by users matching regex '^Bob.*'

    druid.query.groupBy([...] intervals = interval(ymd("2013-03-01"), ymd("2013-04-01")), granularity = "all", single time bucket filter = dimension("user") %~% "^Bob.*", dimensions = list("user", "page") ) x7-advanced.R
  16. academy awards stats awards <- druid.query.groupBy( url = druid, dataSource

    = "twitterstream", intervals = interval(ymd("2013-02-24"), ymd("2013-02-28")), aggregations = list(tweets = sum(metric("count"))), granularity = granularity("PT1H"), filter = dimension("first_hashtag") %~% "academyawards" | dimension("first_hashtag") %~% "oscars", dimensions = list("first_hashtag")) awards <- subset(awards, tweets > 10) qplot(data=awards, x = timestamp, y = tweets, color = first_hashtag, geom="line") x8-awards.R