Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Analytics at 300 billion events/day with Kafka, Samza, and Druid

Druid
March 29, 2016

Streaming Analytics at 300 billion events/day with Kafka, Samza, and Druid

Druid

March 29, 2016
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. STREAMING ANALYTICS @ SCALE
    WITH DRUID / KAFKA / SAMZA
    XAVIER LÉAUTÉ
    HEAD OF BACKEND @ METAMARKETS
    DRUID COMMITTER
    [email protected]

    View Slide

  2. THE OBLIGATORY “WE’RE HIRING TOO”

    View Slide

  3. 2015
    A DELUGE OF DATA
    ‣ Ingest auction data from realtime ad exchanges
    ‣ Process, join, and aggregate data in realtime
    ‣ Make data available for interactive exploration, slicing, & dicing

    ‣ 100 billion transactions / day – 2.5 PB / month inbound
    ‣ multiple events per transaction

    ‣ 300 billion events per day into Druid
    ‣ all aggregated and queryable in real-time

    View Slide

  4. ELB
    S3
    KafkaHTTP
    DATA PIPELINE

    View Slide

  5. 2015
    RECEIVING DATA
    ‣ Single Kafka cluster for receiving data
    ‣ Failure is not an option
    ‣ Data schema can change at any time

    ‣ Keep it very simple: ingest, timestamp, batch, compress data upfront
    ‣ Distribute data to available partitions
    ‣ 7 day retention so we can sleep tight

    ‣ How do we scale it?

    View Slide

  6. SCALING KAFKA
    ‣ Add nodes, and increase partition count
    ‣ Problem: Kafka would rebalance all the data, saturating network

    ‣ Solution 1: throttle replication (possible at network level, but hard)
    ‣ Solution 2: create groups of nodes, with pre-assigned partitions
    ‣ now “push-button” with Yahoo Kafka Manager
    ‣ Hint: Kafka should support this out of the box for easy scaling

    View Slide

  7. TYPICAL ROUND ROBIN ASSIGNMENT

    View Slide

  8. SCALABLE ASSIGNMENT

    View Slide

  9. DEALING WITH FAILURE
    ‣ Node / Disk failures happen all the time
    ‣ Kafka retention is based on time
    ‣ Problem: Kafka does not understand time

    ‣ What happens on node failure?
    ‣ Replace a node – replicate all data – data is now timestamped today
    ‣ Replicated data won’t get pruned for another week -> requires 2x disk capacity

    (otherwise we need to go cleanup segments by hand, not fun!)

    ‣ Looking forward to Kafka 0.10.1 (KIP-33) to fix this

    View Slide

  10. WE HAVE ALL THIS DATA WHAT NOW?

    View Slide

  11. TYPICAL DATA PIPELINE
    ‣ auction feed ~ auction data + bids
    ‣ impression feed ~ which auction ids got shown
    ‣ click feed ~ which auction ids resulted in a click
    ‣ Join feeds based on auction id
    ‣ Maybe some lookups
    ‣ Business logic to derive dozens of metrics and dimensions

    View Slide

  12. IDIOSYNCRATIC WORKLOAD
    ‣ Hundreds of heterogeneous feeds
    ‣ Join window of ~ 15-20 min
    ‣ Each client has slightly different workflow and complexity
    ‣ Workload changes all the time – because we like our clients
    ‣ Capacity planning is hard!

    View Slide

  13. WHAT WE LIKE ABOUT SAMZA
    ‣ great workload isolation – different pipelines in different JVMs
    ‣ heterogeneous workloads – network /disk / cpu isolation matters

    ‣ hard to gauge how many nodes we need
    ‣ we’re on AWS, let’s use many many small nodes!
    ‣ easy to provision, good isolation of resources
    ‣ one container per node, hundreds of little nodes chugging along

    View Slide

  14. HOLD ON, SAMZA NEEDS KAFKA
    ‣ many pipelines, big and small
    ‣ how to keep load on Kafka even?
    ‣ partitions multiple of brokers
    ‣ also keep number of consumers per broker even

    -> # of samza containers per topic divides # of partitions
    ‣ but we want also want to scale up and down easily

    ‣ make sure your partition count has lots of divisors!

    View Slide

  15. STREAMING JOINS
    shuffle cogroup
    state
    process
    process

    View Slide

  16. MAINTAINING STATE
    ‣ Co-group state is stored locally and persisted to Kafka

    ‣ Two Kafka clusters:
    ‣ co-group cluster: disk intensive, uses log compaction
    ‣ messaging cluster: simple retention policy, network/cpu bound

    ‣ Separate clusters – both cheaper and easier to operate

    View Slide

  17. CHALLENGES
    ‣ On failure Samza restores from Kafka (but it takes time)
    ‣ Intermediate processing topics use keyed messages
    • cannot use the scaling technique we use for inbound data
    • re-balancing partitions would require massive over-provisioning
    • currently solved by new cluster and moving pipelines (loses state)
    ‣ magnetic storage works well if consumers are not lagging
    • Disk seek become a problem when consumers start falling behind
    • SSDs way to go for high throughput topics w/ multiple consumers

    View Slide

  18. METRICS

    View Slide

  19. COLLECTING METRICS
    ‣ Monitor latencies at every stage
    ‣ Identify bottlenecks as they happen
    ‣ All our service incorporate the same metrics collector
    ‣ JVM (heap, gc) & System metrics (cpu, network, disk)
    ‣ Application level metrics
    • Time spent consuming / processing / producing at each stage
    • Request latencies, error rates
    • Amount of data processed
    • Consumer lag in messages + time

    View Slide

  20. CONSUMING METRICS
    ‣ Put all the metrics into Druid to diagnose in realtime
    ‣ 15 billion metric data points per day
    ‣ Interactive exploration allows us to pinpoints problems quickly
    ‣ Granularity down to the individual query or server level
    ‣ Gives both the big picture and the detailed breakdown

    View Slide

  21. S3
    Metrics
    Emitter
    HTTP front
    METRICS PIPELINE

    View Slide

  22. View Slide

  23. WHAT ABOUT BATCH?
    ‣ clients ask to re-process older data
    ‣ expanded join window for late events
    ‣ correct at-least-once semantics of Kafka / Samza
    ‣ stuff happens, fix realtime hiccups

    View Slide

  24. WRITE ONCE, RUN LAMBDA
    ‣ Scala DSL to write data join / transformation / aggregation once
    ‣ Can be expressed using different drivers
    ‣ as Storm topology, Samza job, or Cascading job

    ‣ MapReduce no good? Spark is what the cool kids use?
    ‣ No problem!
    ‣ Write new driver for Spark, replace the Cascading driver
    ‣ Hundreds of pipelines moved from Hadoop to Spark in one month!

    View Slide

  25. HOW CAN WE USE ALL THIS DATA?

    View Slide

  26. View Slide

  27. PUT IT IN DRUID!
    ‣ Streams are indexed and queryable in realtime
    ‣ Batches replace realtime data at their own pace
    ‣ Druid makes this all happen seamlessly
    ‣ Data is immediately available for interactive queries
    ‣ Interactive slicing and dicing

    View Slide

  28. DRUID AT SCALE
    ‣ Production cluster runs several hundred nodes
    ‣ Several hundred terabytes of compressed + pre-aggregated data
    ‣ Typical event is complex: > 60 dimensions > 20 metrics
    ‣ Realtime
    • > 3 million events per second on average
    • > 6 Gigabytes per second
    ‣ All Aggregated on the fly
    ‣ Hundreds of concurrent requests – close to 1 million queries per day

    View Slide

  29. WHY DRUID MATTERS AT SCALE
    In one word
    DOWNTIME

    View Slide

  30. DRUID IS ALWAYS ON
    ‣ Replacing or upgrading nodes is seamless.
    ‣ Every component is sateless or fails over transparently
    ‣ Druid can always live upgrade from one version to the next
    ‣ Our current cluster has been running since 2011

    View Slide

  31. SCALING FOR PERFORMANCE
    ‣ Want things to be faster? Simply add nodes
    ‣ Rebalancing data to use additional capacity?
    ‣ Automatic, no downtime, no service degradation

    ‣ Druid data is memory mapped
    ‣ Want more in-memory? Just add RAM
    ‣ Want to save some $? Just add Disk

    View Slide

  32. SCALING FOR RELIABILITY
    ‣ Data replication is highly customizable
    ‣ Tiers of data can serve different latency needs
    ‣ Tiers can make replicas rack/datacenter-aware
    ‣ Queries can be prioritized across tiers

    View Slide

  33. DRUID DATA EVOLVES WITH YOU
    ‣ Data is chunked up in atomic units called segments
    ‣ Each segment represent a chunk of time (typically or day)
    ‣ New segments atomically replace older versions
    ‣ Batch data seamlessly replaces realtime data
    ‣ Schemas can evolve over time
    ‣ Druid handles mixed schemas transparently
    ‣ Supports schema-less ingestion

    View Slide

  34. WHY IS THIS IMPORTANT?
    ‣ Need to scale up and (sometimes) down dynamically
    ‣ Accommodate query load and data growth without service
    interruption
    ‣ Rebuilding a cluster from scratch would take several days
    ‣ Clients can add dimensions / metrics at will

    View Slide

  35. THANK YOU!

    View Slide