Data Day Texas 2016

OPEN SOURCE LAMBDA ARCHITECTURE KAFKA · HADOOP · SAMZA ·
DRUID FANGJIN YANG · CO-FOUNDER @ IMPLY

PROBLEM BUSINESS INTELLIGENCE ANALYTICS POSSIBILITIES CHOOSING THE RIGHT TOOLS FOR
THE JOB ARCHITECTURE COMBINING TECHNOLOGIES NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW

THE PROBLEM

2015 THE PROBLEM ‣ Working with large volumes of data
is complex! • Data manipulations/ETL, machine learning, build applications, etc. • Building data systems for business intelligence applications ‣ Dozens of solutions, projects, and methodologies ‣ How to choose the right tools for the job?

DEMO IN CASE THE INTERNET DIDN’T WORK PRETEND YOU SAW
SOMETHING COOL

2015 A GENERAL SOLUTION? ‣ Load all your data into
Hadoop/Spark. Query it. Done! ‣ Good job guys, let’s go home

2015 A GENERAL SOLUTION? Hadoop/Spark Event Data Business Intelligence Applications

2015 GENERAL SOLUTION LIMITATIONS ‣ When one technology becomes widely
adopted, its limitations also become more well known ‣ General computing frameworks can handle many different distributed computing problems ‣ They are also sub-optimal for many use cases ‣ Analytic queries are inefﬁcient ‣ Specialized technologies are adopted to address these inefﬁciencies

POSSIBLE SOLUTIONS

2015 MAKE QUERIES FASTER ‣ Optimizing business intelligence (OLAP) queries
• Aggregate measures over time, broken down by dimensions • Revenue over time broken down by product type • Top selling products by volume in San Francisco • Number of unique visitors broken down by age • Not dumping the entire dataset • Not examining individual events

2015 OPTIMIZING QUERIES Sharded RDBMS? Event Data Business Intelligence Applications
ETL

2015 ‣ Traditional data warehouse • Row store • Star
schema • Aggregate tables • Query cache ‣ Becoming fast outdated • Scanning raw data is slow and expensive RDBMS

2015 OPTIMIZING QUERIES Key/Value Stores? Event Data Business Intelligence Applications
ETL

2015 ‣ Pre-computation • Pre-compute every possible query • Pre-compute
a subset of queries • Exponential scaling costs ‣ Range scans • Primary key: dimensions/attributes • Value: measures/metrics (things to aggregate) • Still too slow! KEY/VALUE STORES

2015 OPTIMIZING QUERIES Column Stores Event Data Business Intelligence Applications
ETL

2015 ‣ Load/scan exactly what you need for a query
‣ Different compression algorithms for different columns ‣ Encoding for string columns ‣ Compression for measure columns ‣ Different indexes for different columns COLUMN STORES

2015 DATA! timestamp page language city country ... added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...

2015 PRE-AGGREGATION/ROLL-UP timestamp page language city country ... added deleted
2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ... timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...

2015 PARTITION DATA timestamp page language city country ... added
deleted 2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ‣ Shard data by time ‣ Immutable blocks of data called “segments” Segment 2011-01-01T02/2011-01-01T03 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T00/2011-01-01T01

2015 IMMUTABLE SEGMENTS ‣ Fundamental storage unit in Druid ‣
No contention between reads and writes ‣ One thread scans one segment ‣ Multiple threads can access same underlying data

2013 COLUMN ORIENTATION timestamp publisher advertiser gender country impressions clicks
revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 ‣ Scan/load only what you need ‣ Compression! ‣ Indexes!

2013 BITMAP INDICES ‣ Justin Bieber -> [0, 1, 2]
-> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111] ‣ Justin Bieber OR Ke$ha -> [111111] timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ...

2015 DRUID - ARCHITECTURE ‣ Different node types (processes) to
solve different problems ‣ Processes dedicated for: ‣ Historical data ‣ Ingestion ‣ Coordination ‣ Result merging Druid Realtime Workers Druid Historical Nodes Druid Broker Nodes User queries

2015 DRUID ‣ Production ready ‣ Scale • 100+ trillion
events • 3M +events/s • 90% of queries < 1 second ‣ Growing Community • 150+ contributors • Many client libraries and UIs: R, Python, Perl, Node.js, Grafana, etc. • Used in production at numerous large and small organizations

2015 OPTIMIZING QUERIES Druid Event Data Business Intelligence Applications ETL

INGESTION

2015 ‣ Write-optimized data structure: hash map in heap ‣
Convert write optimized -> read optimized ‣ Read-optimized data structure: Druid segments ‣ Query data immediately STREAMING DATA INTO DRUID Memory Segment Events Queries Convert

DRUID INGESTION ‣ Must have denormalized, ﬂat data ‣ Druid
cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline

2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Immediate Druid Historical
Nodes Periodic Druid Broker Nodes Data Source User queries

Nodes Periodic Druid Broker Nodes Data Source Stream Processor User queries

Nodes Periodic Druid Broker Nodes User queries

STREAMING DATA PIPELINES

AN EXAMPLE: ONLINE ADS ‣ Input data: impressions, clicks ‣
Output: enhanced impressions ‣ Steps ‣ Join impressions with clicks ->“clicks” ‣ Look up IDs to names -> “advertiser”, “publisher”, … ‣ Geocode -> “country”, … ‣ Lots of other additions

PIPELINE Impressions Clicks Druid ?

PIPELINE Impressions Partition 0 {key: 186bd591-9442-48f0, publisher: foo, …} {key:
9b5e2cd2-a8ac-4232, publisher: qux, …} … Partition 1 {key: 1079026c-7151-4871, publisher: baz, …} … Clicks Partition 0 … Partition 1 {key: 186bd591-9442-48f0} …

PIPELINE Impressions Clicks Druid

PIPELINE Impressions Clicks Shufﬂed Shufﬂe Druid

PIPELINE Shufﬂed Partition 0 {type: impression, key: 186bd591-9442-48f0, publisher: foo,
…} {type: impression, key: 1079026c-7151-4871, publisher: baz, …} {type: click, key: 186bd591-9442-48f0} … Partition 1 {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …} …

PIPELINE Impressions Clicks Shufﬂed Shufﬂe Druid

PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Druid

PIPELINE Joined Partition 0 {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo,
…} {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …} … Partition 1 {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …} …

PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Druid

PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Enhance & Output
Druid

ALTERNATIVE PIPELINE Impressions Clicks Shufﬂed Joined Shufﬂe Join Enhance Druid
Enhanced

REPROCESSING

WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Imprecise
streaming operations ‣ …like using short join windows ‣ Limitations of current software ‣ …Kafka, Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort

LAMBDA ARCHITECTURES ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies
• Hadoop MapReduce • Spark ‣ Streaming technologies • Samza • Storm • Spark Streaming

LAMBDA ARCHITECTURES ‣ Advantages? • Works as advertised • Works
with a huge variety of open software • Druid supports batch-replace-by-time-range through Hadoop

DRUID REPLACE BY TIME

LAMBDA ARCHITECTURES ‣ Disadvantages? ‣ Need code to run on
two very different systems ‣ Maintaining two codebases is perilous ‣ …productivity loss ‣ …code drift ‣ …difﬁculty training new developers

LAMBDA ARCHITECTURES Data streaming

LAMBDA ARCHITECTURES Data batch

LAMBDA ARCHITECTURES Data streaming batch

KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying
the input stream ‣ Doesn’t require operating two systems ‣ Doesn’t overcome software limitations ‣ http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html

DO TRY THIS AT HOME

2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Samza
- samza.apache.org - @samzastream ‣ Kafka - kafka.apache.org - @apachekafka ‣ Hadoop - hadoop.apache.org

GLUE Tranquility Camus / Secor Druid Hadoop indexer

GLUE Camus / Secor Druid Hadoop indexer druid-kaka-eight

TAKE AWAYS ‣ Consider Kafka for making your streams available
‣ Consider Samza for streaming data processing ‣ Consider Druid for interactive exploration of streams ‣ Have a reprocessing strategy if you’re interested in historical data

THANK YOU

Data Day Texas 2016

Data Day Texas 2016

More Decks by Druid

Other Decks in Technology

Featured

Transcript