PROBLEM BUSINESS INTELLIGENCE ANALYTICS POSSIBILITIES CHOOSING THE RIGHT TOOLS FOR THE JOB ARCHITECTURE COMBINING TECHNOLOGIES NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW
2015 THE PROBLEM ‣ Working with large volumes of data is complex! • Data manipulations/ETL, machine learning, IoT, etc. • Building data systems for business intelligence applications ‣ Dozens of solutions, projects, and methodologies ‣ How to choose the right tools for the job?
2015 GENERAL SOLUTION LIMITATIONS ‣ When one technology becomes widely adopted, its limitations also become more well known ‣ General computing frameworks can handle many different distributed computing problems ‣ They are also sub-optimal for many use cases ‣ Analytic queries are inefficient ‣ Specialized technologies are adopted to address these inefficiencies
2015 MAKE QUERIES FASTER ‣ Optimizing business intelligence (OLAP) queries • Aggregate measures over time, broken down by dimensions • Revenue over time broken down by product type • Top selling products by volume in San Francisco • Number of unique visitors broken down by age • Not dumping the entire dataset • Not examining individual events
2015 ‣ Traditional data warehouse • Row store • Star schema • Aggregate tables • Query cache ‣ Becoming fast outdated • Scanning raw data is slow and expensive RDBMS
2015 ‣ Pre-computation • Pre-compute every possible query • Pre-compute a subset of queries • Exponential scaling costs ‣ Range scans • Primary key: dimensions/attributes • Value: measures/metrics (things to aggregate) • Still too slow! KEY/VALUE STORES
2015 ‣ Load/scan exactly what you need for a query ‣ Different compression algorithms for different columns ‣ Encoding for string columns ‣ Compression for measure columns ‣ Different indexes for different columns COLUMN STORES
2015 DRUID ‣ Open source column store ‣ Unique optimizations for event data ‣ Data partitioning/sharding first done on time ‣ Data is partitioned into defined time buckets (hour/day/etc)
2013 IMMUTABLE SEGMENTS ‣ No contention between reads and writes ‣ Simple parallelization: one thread scans one segment ‣ Multiple threads can access same underlying data ‣ Supports lots of concurrent reads
2013 CREATING COLUMNS ‣ Incrementally build columns ‣ Columns are created as data is streamed in In-memory Row Buffer Column Format Streaming Events Convert
2013 INDEXES ‣ Druid can translate any Boolean predicate into rows to be scanned ‣ Creates inverted indexes for dimension values WHERE user = ‘alice’ OR user = ‘bob’ 0 1 0 0 1 0 1 1 0 0 0 1 OR 1 1 0 0 1 1 =
2015 DRUID - KEY FEATURES ‣ Column store optimized for event data and BI queries ‣ Supports lots of concurrent reads ‣ Streaming data ingestion ‣ Supports extremely fast filters ‣ Ideal for powering user-facing analytic applications
2015 DRUID ‣ Production ready ‣ Scale • 60+ trillion events • Over 3M events/s • 90% of queries < 1 second ‣ Growing Community • 120+ contributors • Many client libraries and UIs: R, Python, Perl, Node.js, Grafana, etc. • Used in production at numerous large and small organizations
2015 LOAD FROM HADOOP [144e10, 144e10, 144e10, 145e10] [alice, alice, bob, carol] [/foo, /bar, /bar, /baz] [2, 1, 1, 1] [144e10, 145e10] [carol, dave] [/qux, /baz] [1, 3] DeterminePartitionsJob IndexGeneratorJob Events Hadoop M/R Loader create a set of segments, each 500MB–1GB one reducer creates one segment can replace or append to existing data
2015 KAFKA PULL (CURRENT) Kafka Firehose uses Kafka high-level consumer (+) commit offsets when persisted to disk (+) easy to scale ingestion up and down (-) not HA (-) can generate duplicates during rebalances High Level Consumer Kafka Firehose Events Task #N
2015 KAFKA PULL (NEXT-GEN) New Kafka Ingestion uses Kafka simple or new consumer (+) store offsets along with Druid segments (+) easy to scale ingestion up and down (+) HA— control over who consumes what (+) no duplicates during rebalances Simple/New Consumer New Kafka Ingestion Events Task #N
WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Need to restate existing data ‣ Limitations of current software ‣ …Kafka 0.8.x, Samza 0.9.x can generate duplicate messages ‣ …Druid 0.7.x streaming ingestion is best-effort
HYBRID BATCH/REALTIME ‣ Advantages? • Works as advertised • Works with a huge variety of open software • Druid supports realtime ingest of streams • Druid supports batch replace with Hadoop
KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying the input stream from the start ‣ Doesn’t require operating two systems ‣ I don’t have much experience with this ‣ http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html
TAKE AWAYS ‣ Consider Kafka for making your streams available ‣ Consider Samza for streaming data integration ‣ Consider Druid for interactive exploration of streams ‣ Have a reprocessing strategy if you’re interested in historical data