Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Indexing Timeseries with Samza and Druid

Druid
May 09, 2015

Indexing Timeseries with Samza and Druid

Druid

May 09, 2015
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. THE PROBLEM ‣ Business intelligence for ad-tech ‣ Arbitrary and

    interactive exploration ‣ Multi-tenancy: thousands of concurrent users ‣ Recency: explore current data, alert on major changes ‣ Efficiency: each event is individually very low-value
  2. THE PROBLEM ‣ Questions lead to more questions ‣ Interested

    not just in what happened, but why ‣ Dig into the dataset using filters, aggregates, and comparisons ‣ All interesting queries cannot be determined upfront
  3. DRUID ‣ Druid project started in 2011, went open source

    in 2012 ‣ Designed for low latency ingestion and ad-hoc aggregations ‣ Designed for keeping around a lot of history (years are ok) ‣ Growing Community • ~90 contributors • Used in production at numerous large and small organizations
  4. TIME SERIES DATA ‣ Unifying feature: some notion of “event

    timestamp” ‣ Questions are often time-oriented ‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min buckets ‣ Web analytics: How many unique users today? ‣ BI: Which accounts had large revenue deltas this week over last week? ‣ Performance: What was the 99%ile latency over the past hour?
  5. 2014 REALTIME INGESTION >500K EVENTS / SECOND AVERAGE >1M EVENTS

    / SECOND PEAK 10 – 100K EVENTS / SECOND / CORE DRUID IN PRODUCTION
  6. 2014 0.0 0.5 1.0 1.5 0 1 2 3 4

    0 5 10 15 20 90%ile 95%ile 99%ile Feb 03 Feb 10 Feb 17 Feb 24 time query time (seconds) datasource a b c d e f g h Query latency percentiles QUERY LATENCY (500MS AVERAGE) 90% < 1S 95% < 5S 99% < 10S DRUID IN PRODUCTION
  7. ONE WEIRD TRICK FOR FAST QUERIES ‣ Doctors hate it!

    ‣ Time-partitioned immutable shards ‣ Global index of time interval to shards ‣ Each shard contains indexes for fast boolean filtering ‣ Each shard is column-oriented and compressed ‣ Compute partial results locally and merge hierarchically
  8. DRUID INGESTION ‣ Must have denormalized, flat data ‣ Druid

    cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline
  9. OUR GOALS ‣ Input data: impressions, clicks, ID-to-name mappings ‣

    Output: enhanced impressions ‣ Steps ‣ Join impressions with clicks ->“is_clicked” ‣ Look up IDs to names -> “publisher_name”, … ‣ Dissect user agent -> “browser”, “os”, … ‣ Lots of other additions
  10. PIPELINE Impressions Partition 0 {key: 186bd591-9442-48f0, publisher: foo, …} {key:

    9b5e2cd2-a8ac-4232, publisher: qux, …} … Partition 1 {key: 1079026c-7151-4871, publisher: baz, …} … Clicks Partition 0 … Partition 1 {key: 186bd591-9442-48f0} …
  11. PIPELINE Shuffled Partition 0 {type: impression, key: 186bd591-9442-48f0, publisher: foo,

    …} {type: impression, key: 1079026c-7151-4871, publisher: baz, …} {type: click, key: 186bd591-9442-48f0} … Partition 1 {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …} …
  12. PIPELINE Joined Partition 0 {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo,

    …} {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …} … Partition 1 {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …} …
  13. NICE THINGS ABOUT SAMZA ‣ Multi-tenancy: one main thread per

    container ‣ Robustness: isolated containers limit slowness and failure ‣ Visibility ‣ Multistage jobs, lots of metrics per stage ‣ Can inspect the message queue in Kafka ‣ State is simple ‣ Logging and restoring handled for you ‣ Single-threaded programming is nice
  14. THINGS TO WATCH OUT FOR ‣ Multitenancy issues on Kafka

    ‣ Samza state size (affects restore times— a few GB seems ok) ‣ Serialization time can add up ‣ Default task.commit.ms is 60s
  15. MONITORING ‣ Kafka partition availability ‣ Kafka disk usage ‣

    Samza consumer offsets ‣ Druid drop rate ‣ Druid query latency ‣ System metrics: CPU, network, disk ‣ Event counts at various stages
  16. WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Imprecise

    streaming operations ‣ …like using short join windows ‣ Software limitations ‣ …Kafka and Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort
  17. LAMBDA ARCHITECTURES ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies

    • Hadoop MapReduce • Spark ‣ Streaming technologies • Samza • Storm • Spark Streaming
  18. LAMBDA ARCHITECTURES ‣ Advantages? • Works as advertised • Works

    with a huge variety of open software • Druid supports batch-replace-by-time-range through Hadoop
  19. LAMBDA ARCHITECTURES ‣ Disadvantages? ‣ Need code to run on

    two very different systems ‣ Maintaining two codebases is perilous ‣ …productivity loss ‣ …code drift ‣ …difficulty training new developers
  20. KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying

    the input stream ‣ Doesn’t require operating two systems ‣ Doesn’t overcome software limitations ‣ I don’t have much experience with this
  21. 2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Samza

    - samza.apache.org - @samzastream ‣ Kafka - kafka.apache.org - @apachekafka
  22. TAKE AWAYS ‣ Consider Kafka for making your streams available

    ‣ Consider Samza for streaming data integration ‣ Consider Druid for interactive exploration of streams ‣ Metrics, metrics, metrics ‣ Have a reprocessing strategy if you’re interested in historical data