interactive exploration ‣ Multi-tenancy: thousands of concurrent users ‣ Recency: explore current data, alert on major changes ‣ Efficiency: each event is individually very low-value
not just in what happened, but why ‣ Dig into the dataset using filters, aggregates, and comparisons ‣ All interesting queries cannot be determined upfront
in 2012 ‣ Designed for low latency ingestion and ad-hoc aggregations ‣ Designed for keeping around a lot of history (years are ok) ‣ Growing Community • ~90 contributors • Used in production at numerous large and small organizations
timestamp” ‣ Questions are often time-oriented ‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min buckets ‣ Web analytics: How many unique users today? ‣ BI: Which accounts had large revenue deltas this week over last week? ‣ Performance: What was the 99%ile latency over the past hour?
0 5 10 15 20 90%ile 95%ile 99%ile Feb 03 Feb 10 Feb 17 Feb 24 time query time (seconds) datasource a b c d e f g h Query latency percentiles QUERY LATENCY (500MS AVERAGE) 90% < 1S 95% < 5S 99% < 10S DRUID IN PRODUCTION
‣ Time-partitioned immutable shards ‣ Global index of time interval to shards ‣ Each shard contains indexes for fast boolean filtering ‣ Each shard is column-oriented and compressed ‣ Compute partial results locally and merge hierarchically
cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline
container ‣ Robustness: isolated containers limit slowness and failure ‣ Visibility ‣ Multistage jobs, lots of metrics per stage ‣ Can inspect the message queue in Kafka ‣ State is simple ‣ Logging and restoring handled for you ‣ Single-threaded programming is nice
streaming operations ‣ …like using short join windows ‣ Software limitations ‣ …Kafka and Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort
‣ Consider Samza for streaming data integration ‣ Consider Druid for interactive exploration of streams ‣ Metrics, metrics, metrics ‣ Have a reprocessing strategy if you’re interested in historical data