is complex! • Data manipulations/ETL, machine learning, build applications, etc. • Building data systems for business intelligence applications ‣ Dozens of solutions, projects, and methodologies ‣ How to choose the right tools for the job?
adopted, its limitations also become more well known ‣ General computing frameworks can handle many different distributed computing problems ‣ They are also sub-optimal for many use cases ‣ Analytic queries are inefficient ‣ Specialized technologies are adopted to address these inefficiencies
• Aggregate measures over time, broken down by dimensions • Revenue over time broken down by product type • Top selling products by volume in San Francisco • Number of unique visitors broken down by age • Not dumping the entire dataset • Not examining individual events
a subset of queries • Exponential scaling costs ‣ Range scans • Primary key: dimensions/attributes • Value: measures/metrics (things to aggregate) • Still too slow! KEY/VALUE STORES
‣ Different compression algorithms for different columns ‣ Encoding for string columns ‣ Compression for measure columns ‣ Different indexes for different columns COLUMN STORES
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...
2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ... timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...
deleted 2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ‣ Shard data by time ‣ Immutable blocks of data called “segments” Segment 2011-01-01T02/2011-01-01T03 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T00/2011-01-01T01
revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 ‣ Scan/load only what you need ‣ Compression! ‣ Indexes!
-> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111] ‣ Justin Bieber OR Ke$ha -> [111111] timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ...
events • 3M +events/s • 90% of queries < 1 second ‣ Growing Community • 150+ contributors • Many client libraries and UIs: R, Python, Perl, Node.js, Grafana, etc. • Used in production at numerous large and small organizations
cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline
streaming operations ‣ …like using short join windows ‣ Limitations of current software ‣ …Kafka, Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort
‣ Consider Samza for streaming data processing ‣ Consider Druid for interactive exploration of streams ‣ Have a reprocessing strategy if you’re interested in historical data