Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Day Texas 2016

Druid
January 23, 2016

Data Day Texas 2016

Open Source Lambda Architecture with Kafka, Samza, Hadoop, and Druid

Druid

January 23, 2016
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. OPEN SOURCE LAMBDA ARCHITECTURE
    KAFKA · HADOOP · SAMZA · DRUID
    FANGJIN YANG · CO-FOUNDER @ IMPLY

    View full-size slide

  2. PROBLEM BUSINESS INTELLIGENCE ANALYTICS
    POSSIBILITIES CHOOSING THE RIGHT TOOLS FOR THE JOB
    ARCHITECTURE COMBINING TECHNOLOGIES
    NEXT STEPS TRY IT OUT FOR YOURSELF
    OVERVIEW

    View full-size slide

  3. 2015
    THE PROBLEM
    ‣ Working with large volumes of data is complex!
    • Data manipulations/ETL, machine learning, build applications, etc.
    • Building data systems for business intelligence applications
    ‣ Dozens of solutions, projects, and methodologies
    ‣ How to choose the right tools for the job?

    View full-size slide

  4. DEMO
    IN CASE THE INTERNET DIDN’T WORK
    PRETEND YOU SAW SOMETHING COOL

    View full-size slide

  5. 2015
    A GENERAL SOLUTION?
    ‣ Load all your data into Hadoop/Spark. Query it. Done!
    ‣ Good job guys, let’s go home

    View full-size slide

  6. 2015
    A GENERAL SOLUTION?
    Hadoop/Spark
    Event Data
    Business Intelligence Applications

    View full-size slide

  7. 2015
    GENERAL SOLUTION LIMITATIONS
    ‣ When one technology becomes widely adopted, its limitations
    also become more well known
    ‣ General computing frameworks can handle many different
    distributed computing problems
    ‣ They are also sub-optimal for many use cases
    ‣ Analytic queries are inefficient
    ‣ Specialized technologies are adopted to address these
    inefficiencies

    View full-size slide

  8. POSSIBLE SOLUTIONS

    View full-size slide

  9. 2015
    MAKE QUERIES FASTER
    ‣ Optimizing business intelligence (OLAP) queries
    • Aggregate measures over time, broken down by dimensions
    • Revenue over time broken down by product type
    • Top selling products by volume in San Francisco
    • Number of unique visitors broken down by age
    • Not dumping the entire dataset
    • Not examining individual events

    View full-size slide

  10. 2015
    OPTIMIZING QUERIES
    Sharded RDBMS?
    Event Data
    Business Intelligence Applications
    ETL

    View full-size slide

  11. 2015
    ‣ Traditional data warehouse
    • Row store
    • Star schema
    • Aggregate tables
    • Query cache
    ‣ Becoming fast outdated
    • Scanning raw data is slow and expensive
    RDBMS

    View full-size slide

  12. 2015
    OPTIMIZING QUERIES
    Key/Value Stores?
    Event Data
    Business Intelligence Applications
    ETL

    View full-size slide

  13. 2015
    ‣ Pre-computation
    • Pre-compute every possible query
    • Pre-compute a subset of queries
    • Exponential scaling costs
    ‣ Range scans
    • Primary key: dimensions/attributes
    • Value: measures/metrics (things to aggregate)
    • Still too slow!
    KEY/VALUE STORES

    View full-size slide

  14. 2015
    OPTIMIZING QUERIES
    Column Stores
    Event Data
    Business Intelligence Applications
    ETL

    View full-size slide

  15. 2015
    ‣ Load/scan exactly what you need for a query
    ‣ Different compression algorithms for different columns
    ‣ Encoding for string columns
    ‣ Compression for measure columns
    ‣ Different indexes for different columns
    COLUMN STORES

    View full-size slide

  16. 2015
    DATA!
    timestamp page language city country ... added deleted
    2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
    2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62
    2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45
    2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87
    2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99
    2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53
    ...

    View full-size slide

  17. 2015
    PRE-AGGREGATION/ROLL-UP
    timestamp page language city country ... added deleted
    2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127
    2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45
    2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186
    2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53
    ...
    timestamp page language city country ... added deleted
    2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
    2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62
    2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45
    2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87
    2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99
    2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53
    ...

    View full-size slide

  18. 2015
    PARTITION DATA
    timestamp page language city country ... added deleted
    2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127
    2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45
    2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186
    2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53
    ‣ Shard data by time
    ‣ Immutable blocks of data called “segments”
    Segment 2011-01-01T02/2011-01-01T03
    Segment 2011-01-01T01/2011-01-01T02
    Segment 2011-01-01T00/2011-01-01T01

    View full-size slide

  19. 2015
    IMMUTABLE SEGMENTS
    ‣ Fundamental storage unit in Druid
    ‣ No contention between reads and writes
    ‣ One thread scans one segment
    ‣ Multiple threads can access same underlying data

    View full-size slide

  20. 2013
    COLUMN ORIENTATION
    timestamp publisher advertiser gender country impressions clicks revenue
    2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
    2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
    ‣ Scan/load only what you need
    ‣ Compression!
    ‣ Indexes!

    View full-size slide

  21. 2013
    BITMAP INDICES
    ‣ Justin Bieber -> [0, 1, 2] -> [111000]
    ‣ Ke$ha -> [3, 4, 5] -> [000111]
    ‣ Justin Bieber OR Ke$ha -> [111111]
    timestamp page language city country ... added deleted
    2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
    2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
    2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
    2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87
    2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99
    2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53
    ...

    View full-size slide

  22. 2015
    DRUID - ARCHITECTURE
    ‣ Different node types (processes) to solve different problems
    ‣ Processes dedicated for:
    ‣ Historical data
    ‣ Ingestion
    ‣ Coordination
    ‣ Result merging
    Druid
    Realtime
    Workers
    Druid
    Historical
    Nodes
    Druid
    Broker
    Nodes
    User queries

    View full-size slide

  23. 2015
    DRUID
    ‣ Production ready
    ‣ Scale
    • 100+ trillion events
    • 3M +events/s
    • 90% of queries < 1 second
    ‣ Growing Community
    • 150+ contributors
    • Many client libraries and UIs: R, Python, Perl, Node.js, Grafana, etc.
    • Used in production at numerous large and small organizations

    View full-size slide

  24. 2015
    OPTIMIZING QUERIES
    Druid
    Event Data
    Business Intelligence Applications
    ETL

    View full-size slide

  25. 2015
    ‣ Write-optimized data structure:
    hash map in heap
    ‣ Convert write optimized -> read
    optimized
    ‣ Read-optimized data structure:
    Druid segments
    ‣ Query data immediately
    STREAMING DATA INTO DRUID
    Memory
    Segment
    Events
    Queries
    Convert

    View full-size slide

  26. DRUID INGESTION
    ‣ Must have denormalized, flat data
    ‣ Druid cannot do stateful processing at ingestion time
    ‣ …like stream-stream joins
    ‣ …or user session reconstruction
    ‣ …or a bunch of other useful things!
    ‣ Many Druid users need an ETL pipeline

    View full-size slide

  27. 2013
    DRUID REAL-TIME INGESTION
    Druid
    Realtime
    Workers
    Immediate Druid
    Historical
    Nodes
    Periodic
    Druid
    Broker
    Nodes
    Data
    Source
    User queries

    View full-size slide

  28. 2013
    DRUID REAL-TIME INGESTION
    Druid
    Realtime
    Workers
    Immediate Druid
    Historical
    Nodes
    Periodic
    Druid
    Broker
    Nodes
    Data
    Source
    Stream
    Processor
    User queries

    View full-size slide

  29. 2013
    DRUID REAL-TIME INGESTION
    Druid
    Realtime
    Workers
    Immediate Druid
    Historical
    Nodes
    Periodic
    Druid
    Broker
    Nodes
    User queries

    View full-size slide

  30. STREAMING DATA PIPELINES

    View full-size slide

  31. AN EXAMPLE: ONLINE ADS
    ‣ Input data: impressions, clicks
    ‣ Output: enhanced impressions
    ‣ Steps
    ‣ Join impressions with clicks ->“clicks”
    ‣ Look up IDs to names -> “advertiser”, “publisher”, …
    ‣ Geocode -> “country”, …
    ‣ Lots of other additions

    View full-size slide

  32. PIPELINE
    Impressions
    Clicks
    Druid
    ?

    View full-size slide

  33. PIPELINE
    Impressions
    Partition 0
    {key: 186bd591-9442-48f0, publisher: foo, …}
    {key: 9b5e2cd2-a8ac-4232, publisher: qux, …}

    Partition 1
    {key: 1079026c-7151-4871, publisher: baz, …}

    Clicks
    Partition 0

    Partition 1
    {key: 186bd591-9442-48f0}

    View full-size slide

  34. PIPELINE
    Impressions
    Clicks
    Druid

    View full-size slide

  35. PIPELINE
    Impressions
    Clicks
    Shuffled
    Shuffle
    Druid

    View full-size slide

  36. PIPELINE
    Shuffled
    Partition 0
    {type: impression, key: 186bd591-9442-48f0, publisher: foo, …}
    {type: impression, key: 1079026c-7151-4871, publisher: baz, …}
    {type: click, key: 186bd591-9442-48f0}

    Partition 1
    {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …}

    View full-size slide

  37. PIPELINE
    Impressions
    Clicks
    Shuffled
    Shuffle
    Druid

    View full-size slide

  38. PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Druid

    View full-size slide

  39. PIPELINE
    Joined
    Partition 0
    {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo, …}
    {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …}

    Partition 1
    {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …}

    View full-size slide

  40. PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Druid

    View full-size slide

  41. PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Enhance & Output
    Druid

    View full-size slide

  42. ALTERNATIVE PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Enhance Druid
    Enhanced

    View full-size slide

  43. REPROCESSING

    View full-size slide

  44. WHY REPROCESS DATA?
    ‣ Bugs in processing code
    ‣ Imprecise streaming operations
    ‣ …like using short join windows
    ‣ Limitations of current software
    ‣ …Kafka, Samza can generate duplicate messages
    ‣ …Druid streaming ingestion is best-effort

    View full-size slide

  45. LAMBDA ARCHITECTURES
    ‣ Hybrid batch/streaming data pipeline
    ‣ Batch technologies
    • Hadoop MapReduce
    • Spark
    ‣ Streaming technologies
    • Samza
    • Storm
    • Spark Streaming

    View full-size slide

  46. LAMBDA ARCHITECTURES
    ‣ Advantages?
    • Works as advertised
    • Works with a huge variety of open software
    • Druid supports batch-replace-by-time-range through Hadoop

    View full-size slide

  47. DRUID REPLACE BY TIME

    View full-size slide

  48. LAMBDA ARCHITECTURES
    ‣ Disadvantages?
    ‣ Need code to run on two very different systems
    ‣ Maintaining two codebases is perilous
    ‣ …productivity loss
    ‣ …code drift
    ‣ …difficulty training new developers

    View full-size slide

  49. LAMBDA ARCHITECTURES
    Data
    streaming

    View full-size slide

  50. LAMBDA ARCHITECTURES
    Data batch

    View full-size slide

  51. LAMBDA ARCHITECTURES
    Data
    streaming
    batch

    View full-size slide

  52. KAPPA ARCHITECTURE
    ‣ Pure streaming
    ‣ Reprocess data by replaying the input stream
    ‣ Doesn’t require operating two systems
    ‣ Doesn’t overcome software limitations
    ‣ http://radar.oreilly.com/2014/07/questioning-the-lambda-
    architecture.html

    View full-size slide

  53. DO TRY THIS AT HOME

    View full-size slide

  54. 2013
    CORNERSTONES
    ‣ Druid - druid.io - @druidio
    ‣ Samza - samza.apache.org - @samzastream
    ‣ Kafka - kafka.apache.org - @apachekafka
    ‣ Hadoop - hadoop.apache.org

    View full-size slide

  55. GLUE
    Tranquility
    Camus / Secor Druid Hadoop indexer

    View full-size slide

  56. GLUE
    Camus / Secor Druid Hadoop indexer
    druid-kaka-eight

    View full-size slide

  57. TAKE AWAYS
    ‣ Consider Kafka for making your streams available
    ‣ Consider Samza for streaming data processing
    ‣ Consider Druid for interactive exploration of streams
    ‣ Have a reprocessing strategy if you’re interested in historical data

    View full-size slide