Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Indexing Timeseries with Samza and Druid

Druid
May 09, 2015

Indexing Timeseries with Samza and Druid

Druid

May 09, 2015
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. INDEXING TIME SERIES STREAMS
    WITH SAMZA AND DRUID
    GIAN MERLINO · DRUID DEVELOPER · SAMZA USER

    View Slide

  2. View Slide

  3. THE PROBLEM

    View Slide

  4. THE PROBLEM

    View Slide

  5. THE PROBLEM

    View Slide

  6. THE PROBLEM
    ‣ Business intelligence for ad-tech
    ‣ Arbitrary and interactive exploration
    ‣ Multi-tenancy: thousands of concurrent users
    ‣ Recency: explore current data, alert on major changes
    ‣ Efficiency: each event is individually very low-value

    View Slide

  7. THE PROBLEM
    ‣ Questions lead to more questions
    ‣ Interested not just in what happened, but why
    ‣ Dig into the dataset using filters, aggregates, and comparisons
    ‣ All interesting queries cannot be determined upfront

    View Slide

  8. EXPLORING TIME SERIES

    View Slide

  9. DRUID
    ‣ Druid project started in 2011, went open source in 2012
    ‣ Designed for low latency ingestion and ad-hoc aggregations
    ‣ Designed for keeping around a lot of history (years are ok)
    ‣ Growing Community
    • ~90 contributors
    • Used in production at numerous large and small organizations

    View Slide

  10. TIME SERIES DATA
    ‣ Unifying feature: some notion of “event timestamp”
    ‣ Questions are often time-oriented
    ‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min buckets
    ‣ Web analytics: How many unique users today?
    ‣ BI: Which accounts had large revenue deltas this week over last week?
    ‣ Performance: What was the 99%ile latency over the past hour?

    View Slide

  11. 2014
    REALTIME INGESTION
    >500K EVENTS / SECOND AVERAGE
    >1M EVENTS / SECOND PEAK
    10 – 100K EVENTS / SECOND / CORE
    DRUID IN PRODUCTION

    View Slide

  12. 2014
    0.0
    0.5
    1.0
    1.5
    0
    1
    2
    3
    4
    0
    5
    10
    15
    20
    90%ile 95%ile 99%ile
    Feb 03 Feb 10 Feb 17 Feb 24
    time
    query time (seconds)
    datasource
    a
    b
    c
    d
    e
    f
    g
    h
    Query latency percentiles
    QUERY LATENCY (500MS AVERAGE)
    90% < 1S 95% < 5S 99% < 10S
    DRUID IN PRODUCTION

    View Slide

  13. ONE WEIRD TRICK FOR FAST QUERIES
    ‣ Doctors hate it!
    ‣ Time-partitioned immutable shards
    ‣ Global index of time interval to shards
    ‣ Each shard contains indexes for fast boolean filtering
    ‣ Each shard is column-oriented and compressed
    ‣ Compute partial results locally and merge hierarchically

    View Slide

  14. DRUID INGESTION
    ‣ Must have denormalized, flat data
    ‣ Druid cannot do stateful processing at ingestion time
    ‣ …like stream-stream joins
    ‣ …or user session reconstruction
    ‣ …or a bunch of other useful things!
    ‣ Many Druid users need an ETL pipeline

    View Slide

  15. STREAMING DATA PIPELINES

    View Slide

  16. OUR GOALS
    ‣ Input data: impressions, clicks, ID-to-name mappings
    ‣ Output: enhanced impressions
    ‣ Steps
    ‣ Join impressions with clicks ->“is_clicked”
    ‣ Look up IDs to names -> “publisher_name”, …
    ‣ Dissect user agent -> “browser”, “os”, …
    ‣ Lots of other additions

    View Slide

  17. PIPELINE
    Impressions
    Clicks
    Druid
    ?

    View Slide

  18. PIPELINE
    Impressions
    Partition 0
    {key: 186bd591-9442-48f0, publisher: foo, …}
    {key: 9b5e2cd2-a8ac-4232, publisher: qux, …}

    Partition 1
    {key: 1079026c-7151-4871, publisher: baz, …}

    Clicks
    Partition 0

    Partition 1
    {key: 186bd591-9442-48f0}

    View Slide

  19. PIPELINE
    Impressions
    Clicks
    Druid

    View Slide

  20. PIPELINE
    Impressions
    Clicks
    Shuffled
    Shuffle
    Druid

    View Slide

  21. PIPELINE
    Shuffled
    Partition 0
    {type: impression, key: 186bd591-9442-48f0, publisher: foo, …}
    {type: impression, key: 1079026c-7151-4871, publisher: baz, …}
    {type: click, key: 186bd591-9442-48f0}

    Partition 1
    {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …}

    View Slide

  22. PIPELINE
    Impressions
    Clicks
    Shuffled
    Shuffle
    Druid

    View Slide

  23. PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Druid

    View Slide

  24. PIPELINE
    Joined
    Partition 0
    {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo, …}
    {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …}

    Partition 1
    {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …}

    View Slide

  25. PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Druid

    View Slide

  26. PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Enhance & Output
    Druid

    View Slide

  27. ALTERNATIVE PIPELINE
    Impressions
    Clicks
    Shuffled
    Joined
    Shuffle
    Join
    Enhance Druid
    Enhanced

    View Slide

  28. OPERATIONS

    View Slide

  29. NICE THINGS ABOUT SAMZA
    ‣ Multi-tenancy: one main thread per container
    ‣ Robustness: isolated containers limit slowness and failure
    ‣ Visibility
    ‣ Multistage jobs, lots of metrics per stage
    ‣ Can inspect the message queue in Kafka
    ‣ State is simple
    ‣ Logging and restoring handled for you
    ‣ Single-threaded programming is nice

    View Slide

  30. THINGS TO WATCH OUT FOR
    ‣ Multitenancy issues on Kafka
    ‣ Samza state size (affects restore times— a few GB seems ok)
    ‣ Serialization time can add up
    ‣ Default task.commit.ms is 60s

    View Slide

  31. MONITORING
    ‣ Kafka partition availability
    ‣ Kafka disk usage
    ‣ Samza consumer offsets
    ‣ Druid drop rate
    ‣ Druid query latency
    ‣ System metrics: CPU, network, disk
    ‣ Event counts at various stages

    View Slide

  32. STREAM METRICS

    View Slide

  33. STREAM METRICS

    View Slide

  34. REPROCESSING

    View Slide

  35. WHY REPROCESS DATA?
    ‣ Bugs in processing code
    ‣ Imprecise streaming operations
    ‣ …like using short join windows
    ‣ Software limitations
    ‣ …Kafka and Samza can generate duplicate messages
    ‣ …Druid streaming ingestion is best-effort

    View Slide

  36. LAMBDA ARCHITECTURES
    ‣ Hybrid batch/streaming data pipeline
    ‣ Batch technologies
    • Hadoop MapReduce
    • Spark
    ‣ Streaming technologies
    • Samza
    • Storm
    • Spark Streaming

    View Slide

  37. LAMBDA ARCHITECTURES
    ‣ Advantages?
    • Works as advertised
    • Works with a huge variety of open software
    • Druid supports batch-replace-by-time-range through Hadoop

    View Slide

  38. LAMBDA ARCHITECTURES
    ‣ Disadvantages?
    ‣ Need code to run on two very different systems
    ‣ Maintaining two codebases is perilous
    ‣ …productivity loss
    ‣ …code drift
    ‣ …difficulty training new developers

    View Slide

  39. LAMBDA ARCHITECTURES

    View Slide

  40. KAPPA ARCHITECTURE
    ‣ Pure streaming
    ‣ Reprocess data by replaying the input stream
    ‣ Doesn’t require operating two systems
    ‣ Doesn’t overcome software limitations
    ‣ I don’t have much experience with this

    View Slide

  41. DO TRY THIS AT HOME

    View Slide

  42. 2013
    CORNERSTONES
    ‣ Druid - druid.io - @druidio
    ‣ Samza - samza.apache.org - @samzastream
    ‣ Kafka - kafka.apache.org - @apachekafka

    View Slide

  43. GLUE
    Tranquility
    Camus / Secor Druid Hadoop indexer

    View Slide

  44. GLUE
    Camus / Secor Druid Hadoop indexer
    druid-kaka-eight

    View Slide

  45. TAKE AWAYS
    ‣ Consider Kafka for making your streams available
    ‣ Consider Samza for streaming data integration
    ‣ Consider Druid for interactive exploration of streams
    ‣ Metrics, metrics, metrics
    ‣ Have a reprocessing strategy if you’re interested in historical data

    View Slide

  46. THANK YOU

    View Slide