REAL-TIME ANALYTICS WITH OPEN SOURCE
TECHNOLOGIES
KAFKA · HADOOP · STORM · DRUID
FANGJIN YANG · GIAN MERLINO
SOFTWARE ENGINEERS @ METAMARKETS
Slide 2
Slide 2 text
PROBLEM DEALING WITH EVENT DATA
MOTIVATION EVOLUTION OF A “REAL-TIME” STACK
ARCHITECTURE THE “RAD”-STACK
NEXT STEPS TRY IT OUT FOR YOURSELF
OVERVIEW
Slide 3
Slide 3 text
THE PROBLEM
Slide 4
Slide 4 text
2013
Fangjin Yang 2013
Slide 5
Slide 5 text
2013
Fangjin Yang 2013
Slide 6
Slide 6 text
2013
Fangjin Yang 2013
Slide 7
Slide 7 text
2013
Fangjin Yang 2013
Event Stream
Slide 8
Slide 8 text
2013
Fangjin Yang 2013
Event Stream
Slide 9
Slide 9 text
2013
Fangjin Yang 2013
...AND WE ANALYZE DATA
WE ARE
METAMARKETS...
Slide 10
Slide 10 text
2013
THE PROBLEM
‣ Arbitrary and interactive exploration
‣ Recency matters! Alert on major changes
‣ Availability
Slide 11
Slide 11 text
2013
A SOLUTION
‣ Load all your data into Hadoop. Query it. Done!
‣ Good job guys, let’s go home
Slide 12
Slide 12 text
2013
PROBLEMS OF THE NAIVE SOLUTION
‣ MapReduce can handle almost every distributed computing
problem
‣ MapReduce over your raw data is flexible but slow
‣ Hadoop is not optimized for query latency
‣ To optimize queries, we need a query layer
Slide 13
Slide 13 text
2013
FINDING A SOLUTION
Hadoop
Event Streams
Insight
2013
MAKE QUERIES FASTER
‣ What types of queries to optimize for?
• Revenue over time broken down by demographic
• Top publishers by clicks over the last month
• Number of unique visitors broken down by any dimension
• Not dumping the entire dataset
• Not examining individual events
Slide 17
Slide 17 text
2013
FINDING A SOLUTION
Hadoop RDBMS
Hadoop
Event Streams
Insight
2013
DRUID
‣ Druid project started in mid 2011
‣ Open sourced in Oct. 2012
‣ Growing Community
• ~30 contributors (not all publicly listed) from many different companies
‣ Designed for low latency ingestion and aggregation
• Optimized for the types of queries we were trying to make
Slide 23
Slide 23 text
2013
ARCHITECTURE (EARLY DAYS)
Slide 24
Slide 24 text
2013
DATA
timestamp page language city country ... added deleted!
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65!
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62!
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45!
2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87!
2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99!
2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53!
...
Slide 25
Slide 25 text
2013
COLUMN COMPRESSION · DICTIONARIES
‣ Create ids
• Justin Bieber -> 0, Ke$ha -> 1
‣ Store
• page -> [0 0 0 1 1 1]
• language -> [0 0 0 0 0 0]
timestamp page language city country ... added deleted!
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65!
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62!
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45!
2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87!
2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99!
2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53!
...
Slide 26
Slide 26 text
2013
BITMAP INDICES
‣ Justin Bieber -> [0, 1, 2] -> [111000]
‣ Ke$ha -> [3, 4, 5] -> [000111]
timestamp page language city country ... added deleted!
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65!
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62!
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45!
2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87!
2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99!
2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53!
...
2013
MORE PROBLEMS
‣ We’ve solved the query problem
• Druid gave us arbitrary data exploration & fast queries
‣ What about data freshness?
• Batch loading is slow!
• We need “real-time”
• Alerts, operational monitoring, etc.
Slide 29
Slide 29 text
A FASTER DATA PIPELINE
Slide 30
Slide 30 text
2013
THE STORY SO FAR
Hadoop Druid
Hadoop
Event Streams
Insight
Slide 31
Slide 31 text
2013
INGESTION DELAYS
‣ Users grow accustomed to fast queries
‣ But they become frustrated when working with stale data
‣ We want to cover operational needs as well as historical analysis
‣ Two obstacles
‣ Loading raw data into Hadoop
‣ Materializing views into a query engine
Slide 32
Slide 32 text
2013
FAST DELIVERY WITH KAFKA
‣ High throughput event delivery
‣ Straightforward, reliable design
‣ Buffers incoming data to give consumers time to process it
Slide 33
Slide 33 text
2013
FAST DELIVERY WITH KAFKA
Kafka
Brokers
Producer
Producer
Producer
Consumer
Consumer
Consumer
Slide 34
Slide 34 text
2013
FAST LOADING WITH DRUID
‣ We have an indexing system
‣ We have a serving system that runs queries on data
‣ We can serve queries while building indexes!
2013
…SO WE’RE DONE?
‣ For simple use cases, yes!
‣ Now we can load events into Druid in real time
‣ But there are limitations
‣ Deduplication
‣ Joining multiple event streams
‣ Any nontrivial pre-processing
Slide 37
Slide 37 text
2013
FAST PROCESSING WITH STORM
‣ Storm is a stream processor— one event at a time
‣ We can already process our data using Hadoop MapReduce
‣ Let’s translate that to streams
‣ “Load” operations can stream data from Kafka
‣ “Map” operations are already stream-friendly
‣ “Reduce” operations can be windowed with in-memory state
Slide 38
Slide 38 text
2013
FAST PROCESSING WITH STORM
Kafka
Brokers
Storm
Worker
Storm
Worker
Storm
Worker
Druid
Realtime
Druid
Realtime
Druid
Realtime
Slide 39
Slide 39 text
2013
THE STORY SO FAR
Hadoop Druid
Hadoop
Event Streams
Insight
Kafka Storm Druid
Slide 40
Slide 40 text
2013
WHAT WE BOUGHT
‣ Druid queries reflect new events within seconds
‣ Systems are fully decoupled
‣ No query downtime or delivery bus downtime
‣ Brief processing delays during maintenance
‣ Because we need to restart Storm topologies
‣ But query performance is not affected; only data freshness
Slide 41
Slide 41 text
2013
WHAT WE GAVE UP
‣ Stream processing isn’t perfect
‣ Difficult to handle corrections of existing data
‣ Windows may be too small for fully accurate operations
‣ Hadoop was actually good at these things
Slide 42
Slide 42 text
2013
THE RETURN OF HADOOP
‣ Batch processing runs for all data older than a few hours
‣ Stream processing fills the gap
‣ Query broker merges results from both systems
“Fixed up,” immutable, historical data
–by Hadoop
Realtime data
–by Storm & Realtime Druid
Slide 43
Slide 43 text
2013
THE STACK
Event Streams
Insight
Kafka
Hadoop
Druid
Storm
‣ Real-time
‣ Only on-time data
‣ Some hours later
‣ All data
2013
TRANQUILITY
‣ Used in production at Metamarkets
‣ One job: Push data into Druid in real-time
‣ Manages partitioning, redundancy, and schema changes
‣ Can be used with any JVM language
‣ Includes Storm and Finagle bindings
‣ Open-sourced this week
‣ https://github.com/metamx/tranquility
Slide 48
Slide 48 text
2013
GET RADICAL
‣ Queries answered quickly, on fresh data
‣ Kafka provides fast, reliable event transport
‣ Storm and Hadoop clean and prepare data for Druid
‣ Druid handles queries and manages the serving layer
‣ “Real-time Analytics Data Stack”
‣ …a.k.a. RAD Stack
‣ …we needed a name