PROBLEM BUSINESS INTELLIGENCE ANALYTICS POSSIBILITIES CHOOSING THE RIGHT TOOLS FOR THE JOB ARCHITECTURE COMBINING TECHNOLOGIES NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW
2015 THE PROBLEM ‣ Working with large volumes of data is complex! • Data manipulations/ETL, machine learning, build applications, etc. • Building data systems for business intelligence applications ‣ Dozens of solutions, projects, and methodologies ‣ How to choose the right tools for the job?
2015 GENERAL SOLUTION LIMITATIONS ‣ When one technology becomes widely adopted, its limitations also become more well known ‣ General computing frameworks can handle many different distributed computing problems ‣ They are also sub-optimal for many use cases ‣ Analytic queries are inefficient ‣ Specialized technologies are adopted to address these inefficiencies
2015 MAKE QUERIES FASTER ‣ Optimizing business intelligence (OLAP) queries • Aggregate measures over time, broken down by dimensions • Revenue over time broken down by product type • Top selling products by volume in San Francisco • Number of unique visitors broken down by age • Not dumping the entire dataset • Not examining individual events
2015 ‣ Traditional data warehouse • Row store • Star schema • Aggregate tables • Query cache ‣ Becoming fast outdated • Scanning raw data is slow and expensive RDBMS
2015 ‣ Pre-computation • Pre-compute every possible query • Pre-compute a subset of queries • Exponential scaling costs ‣ Range scans • Primary key: dimensions/attributes • Value: measures/metrics (things to aggregate) • Still too slow! KEY/VALUE STORES
2015 ‣ Load/scan exactly what you need for a query ‣ Different compression algorithms for different columns ‣ Encoding for string columns ‣ Compression for measure columns ‣ Different indexes for different columns COLUMN STORES
2015 DATA! timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...
2015 PRE-AGGREGATION/ROLL-UP timestamp page language city country ... added deleted 2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ... timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...
2015 PARTITION DATA timestamp page language city country ... added deleted 2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ‣ Shard data by time ‣ Immutable blocks of data called “segments” Segment 2011-01-01T02/2011-01-01T03 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T00/2011-01-01T01
2015 IMMUTABLE SEGMENTS ‣ Fundamental storage unit in Druid ‣ No contention between reads and writes ‣ One thread scans one segment ‣ Multiple threads can access same underlying data
2013 COLUMN ORIENTATION timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 ‣ Scan/load only what you need ‣ Compression! ‣ Indexes!
2013 BITMAP INDICES ‣ Justin Bieber -> [0, 1, 2] -> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111] ‣ Justin Bieber OR Ke$ha -> [111111] timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ...
2015 DRUID ‣ Production ready ‣ Scale • 100+ trillion events • 3M +events/s • 90% of queries < 1 second ‣ Growing Community • 150+ contributors • Many client libraries and UIs: R, Python, Perl, Node.js, Grafana, etc. • Used in production at numerous large and small organizations
DRUID INGESTION ‣ Must have denormalized, flat data ‣ Druid cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline
WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Imprecise streaming operations ‣ …like using short join windows ‣ Limitations of current software ‣ …Kafka, Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort
LAMBDA ARCHITECTURES ‣ Advantages? • Works as advertised • Works with a huge variety of open software • Druid supports batch-replace-by-time-range through Hadoop
LAMBDA ARCHITECTURES ‣ Disadvantages? ‣ Need code to run on two very different systems ‣ Maintaining two codebases is perilous ‣ …productivity loss ‣ …code drift ‣ …difficulty training new developers
KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying the input stream ‣ Doesn’t require operating two systems ‣ Doesn’t overcome software limitations ‣ http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html
TAKE AWAYS ‣ Consider Kafka for making your streams available ‣ Consider Samza for streaming data processing ‣ Consider Druid for interactive exploration of streams ‣ Have a reprocessing strategy if you’re interested in historical data