Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Druid + R
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Metamarkets
April 03, 2013
Technology
210
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Druid + R
Metamarkets
April 03, 2013
More Decks by Metamarkets
See All by Metamarkets
R Workshop for Beginners
metamx
2
4.7k
Other Decks in Technology
See All in Technology
スタートアップにAmazon EKSは早すぎる? マルチプロダクト戦略を加速する Platform Engineeringの実践 / Is Amazon EKS Too Soon for Startups? Practical Platform Engineering to Accelerate a Multi-Product Strategy
elmodev09
1
1.7k
サイバーエージェントにおけるAI推進戦略と変革への取り組み
shotatsuge
0
450
現場のトークンマネジメント
dak2
1
170
Claude Codeをどのように キャッチアップしているか
oikon48
13
8.7k
AI-DLCを “そのまま導入しなかった”話 ~組織に合わせてアジャストした 私たちの実践共有~
hiroramos4
PRO
1
390
アジャイルな経理と Claude Code と経営の未来
kawaguti
PRO
3
180
AIネイティブな開発のサプライチェーンリスク対策 〜激動の開発現場でリスクに立ち向かう〜【ZennFes】
cscengineer
PRO
2
150
“詰む”前に仕組みを作れ 〜技術の波に溺れないためのキャッチアップ術〜
takasyou
7
3.5k
不要なレビューをAIにまかせて AIコーディングの環境改善を加速した
shoota
1
250
複数のSONiCディストリビューションを触りながら比較してみた
sonic
0
110
【セミナー資料】Claude Code をセキュアに使うための考え方と設定の勘どころ / Claude Code Webinar 20260616
masahirokawahara
2
450
クレデンシャル流出 ― 攻撃 3 時間 vs 復旧 10 時間。この非対称性にどう備えるか
kazzpapa3
3
510
Featured
See All Featured
New Earth Scene 8
popppiees
3
2.3k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.8k
Code Review Best Practice
trishagee
74
20k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
170
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
A Modern Web Designer's Workflow
chriscoyier
698
190k
Into the Great Unknown - MozCon
thekraken
41
2.6k
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
2.1k
How to Think Like a Performance Engineer
csswizardry
28
2.7k
GitHub's CSS Performance
jonrohan
1033
470k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
170
Art, The Web, and Tiny UX
lynnandtonic
304
22k
Transcript
Druid + R aggregate all your data
agenda An Overview of Druid RDruid Lab Conclusions
motivation visualize big data existing data engines did not meet
our needs
motivation relational databases scans were too slow! NoSQL computationally intractable
pre-computations took too long! nothing existed that could solve our problems (or was cost prohibitive)
enter Druid real-time distributed column-oriented analytical data store scales horizontally
open-source
how is Druid different highly optimized fast scans & aggregations
real-time data ingestion explore events within milliseconds no pre-computation arbitrarily slice & dice data highly available
using Druid we will explore Druid architecture in future meetups
let's learn to use Druid!
RDruid slicing & dicing on steroids
what are we addressing? slicing and dicing data in R
is fun… …until you run out of memory
solution fire up a 64G EC2 machine and hope it
works or let Druid do the work for you
how we use it ad-hoc reporting analyze client data internal
metrics prototyping
metrics
let’s try it code bit.ly/YtJ1Xj
setup launch your favorite R environment install and load the
druid R package install.packages("devtools") install.packages("ggplot2") library(devtools) install_github("RDruid", "metamx") library(RDruid) library(ggplot2) druid-meetup.R
concepts Druid always computes aggregates events are based in time
Druid understands time bucketing dimensions along which to slice & dice metrics to aggregate
concepts think aggregates and group by in SQL SELECT hour(timestamp),
time page, language, dimensions sum(count) metrics GROUP BY hour(timestamp), page, language
data sources connect to our cluster druid <- druid.url("druid-meetup.mmx.io") Wikipedia
druid.query.dimensions(url = druid, dataSource = "wikipedia_editstream") druid.query.metrics(url = druid, dataSource = "wikipedia_editstream") Twitter dataSource = "twitterstream" x0-sources.R
timeseries Wikipedia page edits since January, by hour edits <-
druid.query.timeseries( url = druid, dataSource = "wikipedia_editstream", intervals = interval(ymd("2013-01-01"), ymd("2013-04-01")), aggregations = sum(metric("count")), granularity = "hour" ) qplot(data = edits, x = timestamp, y = count, geom = "line") x1-timeseries.R
filters what if I'm only interested in articles in English
and French enfr <- druid.query.timeseries( [...] granularity = "hour", filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ) ) x2-filters.R
group by let's break it out by language enfr <-
druid.query.groupBy( [...] filter = dimension("namespace") == "article" & ( dimension("language") == "en" | dimension("language") == "fr" ), dimensions = list("language") ) qplot(data = enfr, x = timestamp, y = count, geom = "line", color = language) x3-groupby.R
granularity arbitrary time slices granularity = granularity( "PT6H", timeZone =
"America/Los_Angeles" ) try out a few more P1D · P1W · P1M x4-timeslices.R
aggregations sum, min, max aggregations = list( count = sum(metric("count")),
total = sum(metric("added")) ) timestamp total count 1 2013-01-01 127232693 346895 2 2013-01-02 130657602 403504 3 2013-01-03 134643672 387462 x5-aggs.R
math you can do math too + - * /
constants aggregations = list( count = sum(metric("count")), added = sum(metric("added")), deleted = sum(metric("deleted")) ), postAggregations = list( average = field("added") / field("count"), pct = field("deleted") / field("added") * -100 ) x6-postaggs.R
more advanced all pages edited by users matching regex '^Bob.*'
druid.query.groupBy([...] intervals = interval(ymd("2013-03-01"), ymd("2013-04-01")), granularity = "all", single time bucket filter = dimension("user") %~% "^Bob.*", dimensions = list("user", "page") ) x7-advanced.R
academy awards stats awards <- druid.query.groupBy( url = druid, dataSource
= "twitterstream", intervals = interval(ymd("2013-02-24"), ymd("2013-02-28")), aggregations = list(tweets = sum(metric("count"))), granularity = granularity("PT1H"), filter = dimension("first_hashtag") %~% "academyawards" | dimension("first_hashtag") %~% "oscars", dimensions = list("first_hashtag")) awards <- subset(awards, tweets > 10) qplot(data=awards, x = timestamp, y = tweets, color = first_hashtag, geom="line") x8-awards.R
academy awards stats x8-awards.R
roll your own run your own Druid cluster github.com/metamx/druid/wiki/ Druid-Personal-Demo-Cluster
contribute fork us on github Druid github.com/metamx/druid RDruid github.com/metamx/RDruid
thank you