Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Day Texas 2016

Druid
January 23, 2016

Data Day Texas 2016

Open Source Lambda Architecture with Kafka, Samza, Hadoop, and Druid

Druid

January 23, 2016
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. OPEN SOURCE LAMBDA ARCHITECTURE KAFKA · HADOOP · SAMZA ·

    DRUID FANGJIN YANG · CO-FOUNDER @ IMPLY
  2. PROBLEM BUSINESS INTELLIGENCE ANALYTICS POSSIBILITIES CHOOSING THE RIGHT TOOLS FOR

    THE JOB ARCHITECTURE COMBINING TECHNOLOGIES NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW
  3. 2015 THE PROBLEM ‣ Working with large volumes of data

    is complex! • Data manipulations/ETL, machine learning, build applications, etc. • Building data systems for business intelligence applications ‣ Dozens of solutions, projects, and methodologies ‣ How to choose the right tools for the job?
  4. 2015 A GENERAL SOLUTION? ‣ Load all your data into

    Hadoop/Spark. Query it. Done! ‣ Good job guys, let’s go home
  5. 2015 GENERAL SOLUTION LIMITATIONS ‣ When one technology becomes widely

    adopted, its limitations also become more well known ‣ General computing frameworks can handle many different distributed computing problems ‣ They are also sub-optimal for many use cases ‣ Analytic queries are inefficient ‣ Specialized technologies are adopted to address these inefficiencies
  6. 2015 MAKE QUERIES FASTER ‣ Optimizing business intelligence (OLAP) queries

    • Aggregate measures over time, broken down by dimensions • Revenue over time broken down by product type • Top selling products by volume in San Francisco • Number of unique visitors broken down by age • Not dumping the entire dataset • Not examining individual events
  7. 2015 ‣ Traditional data warehouse • Row store • Star

    schema • Aggregate tables • Query cache ‣ Becoming fast outdated • Scanning raw data is slow and expensive RDBMS
  8. 2015 ‣ Pre-computation • Pre-compute every possible query • Pre-compute

    a subset of queries • Exponential scaling costs ‣ Range scans • Primary key: dimensions/attributes • Value: measures/metrics (things to aggregate) • Still too slow! KEY/VALUE STORES
  9. 2015 ‣ Load/scan exactly what you need for a query

    ‣ Different compression algorithms for different columns ‣ Encoding for string columns ‣ Compression for measure columns ‣ Different indexes for different columns COLUMN STORES
  10. 2015 DATA! timestamp page language city country ... added deleted

    2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...
  11. 2015 PRE-AGGREGATION/ROLL-UP timestamp page language city country ... added deleted

    2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ... timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:01:63Z Justin Bieber en SF USA 15 62 2011-01-01T01:02:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87 2011-01-01T01:02:24Z Ke$ha en Calgary CA 43 99 2011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53 ...
  12. 2015 PARTITION DATA timestamp page language city country ... added

    deleted 2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127 2011-01-01T01:00:00Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ‣ Shard data by time ‣ Immutable blocks of data called “segments” Segment 2011-01-01T02/2011-01-01T03 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T00/2011-01-01T01
  13. 2015 IMMUTABLE SEGMENTS ‣ Fundamental storage unit in Druid ‣

    No contention between reads and writes ‣ One thread scans one segment ‣ Multiple threads can access same underlying data
  14. 2013 COLUMN ORIENTATION timestamp publisher advertiser gender country impressions clicks

    revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 ‣ Scan/load only what you need ‣ Compression! ‣ Indexes!
  15. 2013 BITMAP INDICES ‣ Justin Bieber -> [0, 1, 2]

    -> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111] ‣ Justin Bieber OR Ke$ha -> [111111] timestamp page language city country ... added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T01:00:00Z Ke$ha en Calgary CA 17 87 2011-01-01T02:00:00Z Ke$ha en Calgary CA 43 99 2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53 ...
  16. 2015 DRUID - ARCHITECTURE ‣ Different node types (processes) to

    solve different problems ‣ Processes dedicated for: ‣ Historical data ‣ Ingestion ‣ Coordination ‣ Result merging Druid Realtime Workers Druid Historical Nodes Druid Broker Nodes User queries
  17. 2015 DRUID ‣ Production ready ‣ Scale • 100+ trillion

    events • 3M +events/s • 90% of queries < 1 second ‣ Growing Community • 150+ contributors • Many client libraries and UIs: R, Python, Perl, Node.js, Grafana, etc. • Used in production at numerous large and small organizations
  18. 2015 ‣ Write-optimized data structure: hash map in heap ‣

    Convert write optimized -> read optimized ‣ Read-optimized data structure: Druid segments ‣ Query data immediately STREAMING DATA INTO DRUID Memory Segment Events Queries Convert
  19. DRUID INGESTION ‣ Must have denormalized, flat data ‣ Druid

    cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline
  20. 2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Immediate Druid Historical

    Nodes Periodic Druid Broker Nodes Data Source User queries
  21. 2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Immediate Druid Historical

    Nodes Periodic Druid Broker Nodes Data Source Stream Processor User queries
  22. AN EXAMPLE: ONLINE ADS ‣ Input data: impressions, clicks ‣

    Output: enhanced impressions ‣ Steps ‣ Join impressions with clicks ->“clicks” ‣ Look up IDs to names -> “advertiser”, “publisher”, … ‣ Geocode -> “country”, … ‣ Lots of other additions
  23. PIPELINE Impressions Partition 0 {key: 186bd591-9442-48f0, publisher: foo, …} {key:

    9b5e2cd2-a8ac-4232, publisher: qux, …} … Partition 1 {key: 1079026c-7151-4871, publisher: baz, …} … Clicks Partition 0 … Partition 1 {key: 186bd591-9442-48f0} …
  24. PIPELINE Shuffled Partition 0 {type: impression, key: 186bd591-9442-48f0, publisher: foo,

    …} {type: impression, key: 1079026c-7151-4871, publisher: baz, …} {type: click, key: 186bd591-9442-48f0} … Partition 1 {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …} …
  25. PIPELINE Joined Partition 0 {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo,

    …} {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …} … Partition 1 {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …} …
  26. WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Imprecise

    streaming operations ‣ …like using short join windows ‣ Limitations of current software ‣ …Kafka, Samza can generate duplicate messages ‣ …Druid streaming ingestion is best-effort
  27. LAMBDA ARCHITECTURES ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies

    • Hadoop MapReduce • Spark ‣ Streaming technologies • Samza • Storm • Spark Streaming
  28. LAMBDA ARCHITECTURES ‣ Advantages? • Works as advertised • Works

    with a huge variety of open software • Druid supports batch-replace-by-time-range through Hadoop
  29. LAMBDA ARCHITECTURES ‣ Disadvantages? ‣ Need code to run on

    two very different systems ‣ Maintaining two codebases is perilous ‣ …productivity loss ‣ …code drift ‣ …difficulty training new developers
  30. KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying

    the input stream ‣ Doesn’t require operating two systems ‣ Doesn’t overcome software limitations ‣ http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html
  31. 2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Samza

    - samza.apache.org - @samzastream ‣ Kafka - kafka.apache.org - @apachekafka ‣ Hadoop - hadoop.apache.org
  32. TAKE AWAYS ‣ Consider Kafka for making your streams available

    ‣ Consider Samza for streaming data processing ‣ Consider Druid for interactive exploration of streams ‣ Have a reprocessing strategy if you’re interested in historical data