Slide 1

Slide 1 text

REAL-TIME ANALYTICS WITH OPEN SOURCE TECHNOLOGIES KAFKA · HADOOP · STORM · DRUID FANGJIN YANG · GIAN MERLINO SOFTWARE ENGINEERS @ METAMARKETS

Slide 2

Slide 2 text

PROBLEM DEALING WITH EVENT DATA MOTIVATION EVOLUTION OF A “REAL-TIME” STACK ARCHITECTURE THE “RAD”-STACK NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW

Slide 3

Slide 3 text

THE PROBLEM

Slide 4

Slide 4 text

2013 Fangjin Yang 2013

Slide 5

Slide 5 text

2013 Fangjin Yang 2013

Slide 6

Slide 6 text

2013 Fangjin Yang 2013

Slide 7

Slide 7 text

2013 Fangjin Yang 2013 Event Stream

Slide 8

Slide 8 text

2013 Fangjin Yang 2013 Event Stream

Slide 9

Slide 9 text

2013 Fangjin Yang 2013 ...AND WE ANALYZE DATA WE ARE METAMARKETS...

Slide 10

Slide 10 text

2013 THE PROBLEM ‣ Arbitrary and interactive exploration ‣ Recency matters! Alert on major changes ‣ Availability

Slide 11

Slide 11 text

2013 A SOLUTION ‣ Load all your data into Hadoop. Query it. Done! ‣ Good job guys, let’s go home

Slide 12

Slide 12 text

2013 PROBLEMS OF THE NAIVE SOLUTION ‣ MapReduce can handle almost every distributed computing problem ‣ MapReduce over your raw data is flexible but slow ‣ Hadoop is not optimized for query latency ‣ To optimize queries, we need a query layer

Slide 13

Slide 13 text

2013 FINDING A SOLUTION Hadoop Event Streams Insight

Slide 14

Slide 14 text

2013 FINDING A SOLUTION Hadoop Query Engine Hadoop Event Streams Insight

Slide 15

Slide 15 text

A FASTER QUERY LAYER

Slide 16

Slide 16 text

2013 MAKE QUERIES FASTER ‣ What types of queries to optimize for? • Revenue over time broken down by demographic • Top publishers by clicks over the last month • Number of unique visitors broken down by any dimension • Not dumping the entire dataset • Not examining individual events

Slide 17

Slide 17 text

2013 FINDING A SOLUTION Hadoop RDBMS Hadoop Event Streams Insight

Slide 18

Slide 18 text

2013 FINDING A SOLUTION Hadoop NoSQL K/V Stores Hadoop Event Streams Insight

Slide 19

Slide 19 text

2013 FINDING A SOLUTION Hadoop “In memory” Hadoop Event Streams Insight

Slide 20

Slide 20 text

2013 FINDING A SOLUTION Hadoop Redshift/ Vertica Hadoop Event Streams Insight

Slide 21

Slide 21 text

DRUID AS A QUERY ENGINE

Slide 22

Slide 22 text

2013 DRUID ‣ Druid project started in mid 2011 ‣ Open sourced in Oct. 2012 ‣ Growing Community • ~30 contributors (not all publicly listed) from many different companies ‣ Designed for low latency ingestion and aggregation • Optimized for the types of queries we were trying to make

Slide 23

Slide 23 text

2013 ARCHITECTURE (EARLY DAYS)

Slide 24

Slide 24 text

2013 DATA timestamp page language city country ... added deleted! 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65! 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62! 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45! 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53! ...

Slide 25

Slide 25 text

2013 COLUMN COMPRESSION · DICTIONARIES ‣ Create ids • Justin Bieber -> 0, Ke$ha -> 1 ‣ Store • page -> [0 0 0 1 1 1] • language -> [0 0 0 0 0 0] timestamp page language city country ... added deleted! 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65! 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62! 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45! 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53! ...

Slide 26

Slide 26 text

2013 BITMAP INDICES ‣ Justin Bieber -> [0, 1, 2] -> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111] timestamp page language city country ... added deleted! 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65! 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62! 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45! 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53! ...

Slide 27

Slide 27 text

2013 FAST AND FLEXIBLE QUERIES JUSTIN BIEBER [1, 1, 0, 0] KE$HA [0, 0, 1, 1] JUSTIN BIEBER OR KE$HA [1, 1, 1, 1] row page 0 Justin(Bieber 1 Justin(Bieber 2 Ke$ha 3 Ke$ha

Slide 28

Slide 28 text

2013 MORE PROBLEMS ‣ We’ve solved the query problem • Druid gave us arbitrary data exploration & fast queries ‣ What about data freshness? • Batch loading is slow! • We need “real-time” • Alerts, operational monitoring, etc.

Slide 29

Slide 29 text

A FASTER DATA PIPELINE

Slide 30

Slide 30 text

2013 THE STORY SO FAR Hadoop Druid Hadoop Event Streams Insight

Slide 31

Slide 31 text

2013 INGESTION DELAYS ‣ Users grow accustomed to fast queries ‣ But they become frustrated when working with stale data ‣ We want to cover operational needs as well as historical analysis ‣ Two obstacles ‣ Loading raw data into Hadoop ‣ Materializing views into a query engine

Slide 32

Slide 32 text

2013 FAST DELIVERY WITH KAFKA ‣ High throughput event delivery ‣ Straightforward, reliable design ‣ Buffers incoming data to give consumers time to process it

Slide 33

Slide 33 text

2013 FAST DELIVERY WITH KAFKA Kafka Brokers Producer Producer Producer Consumer Consumer Consumer

Slide 34

Slide 34 text

2013 FAST LOADING WITH DRUID ‣ We have an indexing system ‣ We have a serving system that runs queries on data ‣ We can serve queries while building indexes!

Slide 35

Slide 35 text

2013 FAST LOADING WITH DRUID Kafka Brokers Druid Realtime Worker Druid Realtime Worker Immediate Druid Historical Cluster Periodic Druid Query Broker

Slide 36

Slide 36 text

2013 …SO WE’RE DONE? ‣ For simple use cases, yes! ‣ Now we can load events into Druid in real time ‣ But there are limitations ‣ Deduplication ‣ Joining multiple event streams ‣ Any nontrivial pre-processing

Slide 37

Slide 37 text

2013 FAST PROCESSING WITH STORM ‣ Storm is a stream processor— one event at a time ‣ We can already process our data using Hadoop MapReduce ‣ Let’s translate that to streams ‣ “Load” operations can stream data from Kafka ‣ “Map” operations are already stream-friendly ‣ “Reduce” operations can be windowed with in-memory state

Slide 38

Slide 38 text

2013 FAST PROCESSING WITH STORM Kafka Brokers Storm Worker Storm Worker Storm Worker Druid Realtime Druid Realtime Druid Realtime

Slide 39

Slide 39 text

2013 THE STORY SO FAR Hadoop Druid Hadoop Event Streams Insight Kafka Storm Druid

Slide 40

Slide 40 text

2013 WHAT WE BOUGHT ‣ Druid queries reflect new events within seconds ‣ Systems are fully decoupled ‣ No query downtime or delivery bus downtime ‣ Brief processing delays during maintenance ‣ Because we need to restart Storm topologies ‣ But query performance is not affected; only data freshness

Slide 41

Slide 41 text

2013 WHAT WE GAVE UP ‣ Stream processing isn’t perfect ‣ Difficult to handle corrections of existing data ‣ Windows may be too small for fully accurate operations ‣ Hadoop was actually good at these things

Slide 42

Slide 42 text

2013 THE RETURN OF HADOOP ‣ Batch processing runs for all data older than a few hours ‣ Stream processing fills the gap ‣ Query broker merges results from both systems “Fixed up,” immutable, historical data –by Hadoop Realtime data –by Storm & Realtime Druid

Slide 43

Slide 43 text

2013 THE STACK Event Streams Insight Kafka Hadoop Druid Storm ‣ Real-time ‣ Only on-time data ‣ Some hours later ‣ All data

Slide 44

Slide 44 text

DO TRY THIS AT HOME

Slide 45

Slide 45 text

2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Storm - storm.incubator.apache.org - @stormprocessor ‣ Hadoop - hadoop.apache.org ‣ Kafka - kafka.apache.org - @apachekafka

Slide 46

Slide 46 text

2013 GLUE Event Streams Insight Kafka Hadoop Druid Storm storm-kafka Camus Druid Tranquility

Slide 47

Slide 47 text

2013 TRANQUILITY ‣ Used in production at Metamarkets ‣ One job: Push data into Druid in real-time ‣ Manages partitioning, redundancy, and schema changes ‣ Can be used with any JVM language ‣ Includes Storm and Finagle bindings ‣ Open-sourced this week ‣ https://github.com/metamx/tranquility

Slide 48

Slide 48 text

2013 GET RADICAL ‣ Queries answered quickly, on fresh data ‣ Kafka provides fast, reliable event transport ‣ Storm and Hadoop clean and prepare data for Druid ‣ Druid handles queries and manages the serving layer ‣ “Real-time Analytics Data Stack” ‣ …a.k.a. RAD Stack ‣ …we needed a name

Slide 49

Slide 49 text

THANK YOU