Indexing Timeseries with Samza and Druid

INDEXING TIME SERIES STREAMS WITH SAMZA AND DRUID GIAN MERLINO
· DRUID DEVELOPER · SAMZA USER

THE PROBLEM

THE PROBLEM ‣ Business intelligence for ad-tech ‣ Arbitrary and
interactive exploration ‣ Multi-tenancy: thousands of concurrent users ‣ Recency: explore current data, alert on major changes ‣ Efﬁciency: each event is individually very low-value

THE PROBLEM ‣ Questions lead to more questions ‣ Interested
not just in what happened, but why ‣ Dig into the dataset using ﬁlters, aggregates, and comparisons ‣ All interesting queries cannot be determined upfront

EXPLORING TIME SERIES

DRUID ‣ Druid project started in 2011, went open source
in 2012 ‣ Designed for low latency ingestion and ad-hoc aggregations ‣ Designed for keeping around a lot of history (years are ok) ‣ Growing Community • ~90 contributors • Used in production at numerous large and small organizations

TIME SERIES DATA ‣ Unifying feature: some notion of “event
timestamp” ‣ Questions are often time-oriented ‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min buckets ‣ Web analytics: How many unique users today? ‣ BI: Which accounts had large revenue deltas this week over last week? ‣ Performance: What was the 99%ile latency over the past hour?

2014 REALTIME INGESTION >500K EVENTS / SECOND AVERAGE >1M EVENTS
/ SECOND PEAK 10 – 100K EVENTS / SECOND / CORE DRUID IN PRODUCTION

2014 0.0 0.5 1.0 1.5 0 1 2 3 4
0 5 10 15 20 90%ile 95%ile 99%ile Feb 03 Feb 10 Feb 17 Feb 24 time query time (seconds) datasource a b c d e f g h Query latency percentiles QUERY LATENCY (500MS AVERAGE) 90% < 1S 95% < 5S 99% < 10S DRUID IN PRODUCTION

ONE WEIRD TRICK FOR FAST QUERIES ‣ Doctors hate it!
‣ Time-partitioned immutable shards ‣ Global index of time interval to shards ‣ Each shard contains indexes for fast boolean ﬁltering ‣ Each shard is column-oriented and compressed ‣ Compute partial results locally and merge hierarchically

DRUID INGESTION ‣ Must have denormalized, ﬂat data ‣ Druid
cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline

STREAMING DATA PIPELINES

OUR GOALS ‣ Input data: impressions, clicks, ID-to-name mappings ‣
Output: enhanced impressions ‣ Steps ‣ Join impressions with clicks ->“is_clicked” ‣ Look up IDs to names -> “publisher_name”, … ‣ Dissect user agent -> “browser”, “os”, … ‣ Lots of other additions

PIPELINE Impressions Clicks Druid ?

PIPELINE Impressions Partition 0 {key: 186bd591-9442-48f0, publisher: foo, …} {key:
9b5e2cd2-a8ac-4232, publisher: qux, …} … Partition 1 {key: 1079026c-7151-4871, publisher: baz, …} … Clicks Partition 0 … Partition 1 {key: 186bd591-9442-48f0} …

PIPELINE Impressions Clicks Druid

PIPELINE Impressions Clicks Shufﬂed Shufﬂe Druid

PIPELINE Shufﬂed Partition 0 {type: impression, key: 186bd591-9442-48f0, publisher: foo,
…} {type: impression, key: 1079026c-7151-4871, publisher: baz, …} {type: click, key: 186bd591-9442-48f0} … Partition 1 {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …} …

PIPELINE Impressions Clicks Shufﬂed Shufﬂe Druid

PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Druid

PIPELINE Joined Partition 0 {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo,
…} {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …} … Partition 1 {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …} …

PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Druid

PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Enhance & Output
Druid

ALTERNATIVE PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Enhance Druid
Enhanced

OPERATIONS

NICE THINGS ABOUT SAMZA ‣ Multi-tenancy: one main thread per
container ‣ Robustness: isolated containers limit slowness and failure ‣ Visibility ‣ Multistage jobs, lots of metrics per stage ‣ Can inspect the message queue in Kafka ‣ State is simple ‣ Logging and restoring handled for you ‣ Single-threaded programming is nice

THINGS TO WATCH OUT FOR ‣ Multitenancy issues on Kafka
‣ Samza state size (affects restore times— a few GB seems ok) ‣ Serialization time can add up ‣ Default task.commit.ms is 60s

MONITORING ‣ Kafka partition availability ‣ Kafka disk usage ‣
Samza consumer offsets ‣ Druid drop rate ‣ Druid query latency ‣ System metrics: CPU, network, disk ‣ Event counts at various stages

STREAM METRICS

REPROCESSING

WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Imprecise
streaming operations ‣ …like using short join windows ‣ Software limitations ‣ …Kafka and Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort

LAMBDA ARCHITECTURES ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies
• Hadoop MapReduce • Spark ‣ Streaming technologies • Samza • Storm • Spark Streaming

LAMBDA ARCHITECTURES ‣ Advantages? • Works as advertised • Works
with a huge variety of open software • Druid supports batch-replace-by-time-range through Hadoop

LAMBDA ARCHITECTURES ‣ Disadvantages? ‣ Need code to run on
two very different systems ‣ Maintaining two codebases is perilous ‣ …productivity loss ‣ …code drift ‣ …difﬁculty training new developers

LAMBDA ARCHITECTURES

KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying
the input stream ‣ Doesn’t require operating two systems ‣ Doesn’t overcome software limitations ‣ I don’t have much experience with this

DO TRY THIS AT HOME

2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Samza
- samza.apache.org - @samzastream ‣ Kafka - kafka.apache.org - @apachekafka

GLUE Tranquility Camus / Secor Druid Hadoop indexer

GLUE Camus / Secor Druid Hadoop indexer druid-kaka-eight

TAKE AWAYS ‣ Consider Kafka for making your streams available
‣ Consider Samza for streaming data integration ‣ Consider Druid for interactive exploration of streams ‣ Metrics, metrics, metrics ‣ Have a reprocessing strategy if you’re interested in historical data

THANK YOU

Indexing Timeseries with Samza and Druid

Indexing Timeseries with Samza and Druid

More Decks by Druid

Other Decks in Technology

Featured

Transcript