Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Analytics with Open Source Technologies

B45193666b6922c140244eb153cc5f3c?s=47 Druid
February 14, 2014

Real-time Analytics with Open Source Technologies

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a real-time analytics stack using Kafka, Storm, and Druid.

Analytics pipelines running purely on Hadoop can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time, or fragile solutions where the integrity of the data cannot be assured. Combining Hadoop with Kafka, Storm, and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries.

In the described system, Kafka provides a fast message bus and is the delivery point for machine-generated event streams. Storm and Hadoop work together to load data into Druid. Storm handles near-real-time data and Hadoop handles historical data and data corrections. Druid provides flexible, highly available, low-latency queries.

This talk is based on our real-world experiences building out such a stack for online advertising analytics at Metamarkets.

B45193666b6922c140244eb153cc5f3c?s=128

Druid

February 14, 2014
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. REAL-TIME ANALYTICS WITH OPEN SOURCE TECHNOLOGIES KAFKA · HADOOP ·

    STORM · DRUID FANGJIN YANG · GIAN MERLINO SOFTWARE ENGINEERS @ METAMARKETS
  2. PROBLEM DEALING WITH EVENT DATA MOTIVATION EVOLUTION OF A “REAL-TIME”

    STACK ARCHITECTURE THE “RAD”-STACK NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW
  3. THE PROBLEM

  4. 2013 Fangjin Yang 2013

  5. 2013 Fangjin Yang 2013

  6. 2013 Fangjin Yang 2013

  7. 2013 Fangjin Yang 2013 Event Stream

  8. 2013 Fangjin Yang 2013 Event Stream

  9. 2013 Fangjin Yang 2013 ...AND WE ANALYZE DATA WE ARE

    METAMARKETS...
  10. 2013 THE PROBLEM ‣ Arbitrary and interactive exploration ‣ Recency

    matters! Alert on major changes ‣ Availability
  11. 2013 A SOLUTION ‣ Load all your data into Hadoop.

    Query it. Done! ‣ Good job guys, let’s go home
  12. 2013 PROBLEMS OF THE NAIVE SOLUTION ‣ MapReduce can handle

    almost every distributed computing problem ‣ MapReduce over your raw data is flexible but slow ‣ Hadoop is not optimized for query latency ‣ To optimize queries, we need a query layer
  13. 2013 FINDING A SOLUTION Hadoop Event Streams Insight

  14. 2013 FINDING A SOLUTION Hadoop Query Engine Hadoop Event Streams

    Insight
  15. A FASTER QUERY LAYER

  16. 2013 MAKE QUERIES FASTER ‣ What types of queries to

    optimize for? • Revenue over time broken down by demographic • Top publishers by clicks over the last month • Number of unique visitors broken down by any dimension • Not dumping the entire dataset • Not examining individual events
  17. 2013 FINDING A SOLUTION Hadoop RDBMS Hadoop Event Streams Insight

  18. 2013 FINDING A SOLUTION Hadoop NoSQL K/V Stores Hadoop Event

    Streams Insight
  19. 2013 FINDING A SOLUTION Hadoop “In memory” Hadoop Event Streams

    Insight
  20. 2013 FINDING A SOLUTION Hadoop Redshift/ Vertica Hadoop Event Streams

    Insight
  21. DRUID AS A QUERY ENGINE

  22. 2013 DRUID ‣ Druid project started in mid 2011 ‣

    Open sourced in Oct. 2012 ‣ Growing Community • ~30 contributors (not all publicly listed) from many different companies ‣ Designed for low latency ingestion and aggregation • Optimized for the types of queries we were trying to make
  23. 2013 ARCHITECTURE (EARLY DAYS)

  24. 2013 DATA timestamp page language city country ... added deleted!

    2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65! 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62! 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45! 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53! ...
  25. 2013 COLUMN COMPRESSION · DICTIONARIES ‣ Create ids • Justin

    Bieber -> 0, Ke$ha -> 1 ‣ Store • page -> [0 0 0 1 1 1] • language -> [0 0 0 0 0 0] timestamp page language city country ... added deleted! 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65! 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62! 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45! 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53! ...
  26. 2013 BITMAP INDICES ‣ Justin Bieber -> [0, 1, 2]

    -> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111] timestamp page language city country ... added deleted! 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65! 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62! 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45! 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99! 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53! ...
  27. 2013 FAST AND FLEXIBLE QUERIES JUSTIN BIEBER [1, 1, 0,

    0] KE$HA [0, 0, 1, 1] JUSTIN BIEBER OR KE$HA [1, 1, 1, 1] row page 0 Justin(Bieber 1 Justin(Bieber 2 Ke$ha 3 Ke$ha
  28. 2013 MORE PROBLEMS ‣ We’ve solved the query problem •

    Druid gave us arbitrary data exploration & fast queries ‣ What about data freshness? • Batch loading is slow! • We need “real-time” • Alerts, operational monitoring, etc.
  29. A FASTER DATA PIPELINE

  30. 2013 THE STORY SO FAR Hadoop Druid Hadoop Event Streams

    Insight
  31. 2013 INGESTION DELAYS ‣ Users grow accustomed to fast queries

    ‣ But they become frustrated when working with stale data ‣ We want to cover operational needs as well as historical analysis ‣ Two obstacles ‣ Loading raw data into Hadoop ‣ Materializing views into a query engine
  32. 2013 FAST DELIVERY WITH KAFKA ‣ High throughput event delivery

    ‣ Straightforward, reliable design ‣ Buffers incoming data to give consumers time to process it
  33. 2013 FAST DELIVERY WITH KAFKA Kafka Brokers Producer Producer Producer

    Consumer Consumer Consumer
  34. 2013 FAST LOADING WITH DRUID ‣ We have an indexing

    system ‣ We have a serving system that runs queries on data ‣ We can serve queries while building indexes!
  35. 2013 FAST LOADING WITH DRUID Kafka Brokers Druid Realtime Worker

    Druid Realtime Worker Immediate Druid Historical Cluster Periodic Druid Query Broker
  36. 2013 …SO WE’RE DONE? ‣ For simple use cases, yes!

    ‣ Now we can load events into Druid in real time ‣ But there are limitations ‣ Deduplication ‣ Joining multiple event streams ‣ Any nontrivial pre-processing
  37. 2013 FAST PROCESSING WITH STORM ‣ Storm is a stream

    processor— one event at a time ‣ We can already process our data using Hadoop MapReduce ‣ Let’s translate that to streams ‣ “Load” operations can stream data from Kafka ‣ “Map” operations are already stream-friendly ‣ “Reduce” operations can be windowed with in-memory state
  38. 2013 FAST PROCESSING WITH STORM Kafka Brokers Storm Worker Storm

    Worker Storm Worker Druid Realtime Druid Realtime Druid Realtime
  39. 2013 THE STORY SO FAR Hadoop Druid Hadoop Event Streams

    Insight Kafka Storm Druid
  40. 2013 WHAT WE BOUGHT ‣ Druid queries reflect new events

    within seconds ‣ Systems are fully decoupled ‣ No query downtime or delivery bus downtime ‣ Brief processing delays during maintenance ‣ Because we need to restart Storm topologies ‣ But query performance is not affected; only data freshness
  41. 2013 WHAT WE GAVE UP ‣ Stream processing isn’t perfect

    ‣ Difficult to handle corrections of existing data ‣ Windows may be too small for fully accurate operations ‣ Hadoop was actually good at these things
  42. 2013 THE RETURN OF HADOOP ‣ Batch processing runs for

    all data older than a few hours ‣ Stream processing fills the gap ‣ Query broker merges results from both systems “Fixed up,” immutable, historical data –by Hadoop Realtime data –by Storm & Realtime Druid
  43. 2013 THE STACK Event Streams Insight Kafka Hadoop Druid Storm

    ‣ Real-time ‣ Only on-time data ‣ Some hours later ‣ All data
  44. DO TRY THIS AT HOME

  45. 2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Storm

    - storm.incubator.apache.org - @stormprocessor ‣ Hadoop - hadoop.apache.org ‣ Kafka - kafka.apache.org - @apachekafka
  46. 2013 GLUE Event Streams Insight Kafka Hadoop Druid Storm storm-kafka

    Camus Druid Tranquility
  47. 2013 TRANQUILITY ‣ Used in production at Metamarkets ‣ One

    job: Push data into Druid in real-time ‣ Manages partitioning, redundancy, and schema changes ‣ Can be used with any JVM language ‣ Includes Storm and Finagle bindings ‣ Open-sourced this week ‣ https://github.com/metamx/tranquility
  48. 2013 GET RADICAL ‣ Queries answered quickly, on fresh data

    ‣ Kafka provides fast, reliable event transport ‣ Storm and Hadoop clean and prepare data for Druid ‣ Druid handles queries and manages the serving layer ‣ “Real-time Analytics Data Stack” ‣ …a.k.a. RAD Stack ‣ …we needed a name
  49. THANK YOU