Streaming Analytics at 300 billion events/day with Kafka, Samza, and Druid

STREAMING ANALYTICS @ SCALE WITH DRUID / KAFKA / SAMZA
XAVIER LÉAUTÉ HEAD OF BACKEND @ METAMARKETS DRUID COMMITTER [email protected]

THE OBLIGATORY “WE’RE HIRING TOO”

2015 A DELUGE OF DATA ‣ Ingest auction data from
realtime ad exchanges ‣ Process, join, and aggregate data in realtime ‣ Make data available for interactive exploration, slicing, & dicing  ‣ 100 billion transactions / day – 2.5 PB / month inbound ‣ multiple events per transaction  ‣ 300 billion events per day into Druid ‣ all aggregated and queryable in real-time

ELB S3 KafkaHTTP DATA PIPELINE

2015 RECEIVING DATA ‣ Single Kafka cluster for receiving data
‣ Failure is not an option ‣ Data schema can change at any time  ‣ Keep it very simple: ingest, timestamp, batch, compress data upfront ‣ Distribute data to available partitions ‣ 7 day retention so we can sleep tight  ‣ How do we scale it?

SCALING KAFKA ‣ Add nodes, and increase partition count ‣
Problem: Kafka would rebalance all the data, saturating network  ‣ Solution 1: throttle replication (possible at network level, but hard) ‣ Solution 2: create groups of nodes, with pre-assigned partitions ‣ now “push-button” with Yahoo Kafka Manager ‣ Hint: Kafka should support this out of the box for easy scaling

TYPICAL ROUND ROBIN ASSIGNMENT

SCALABLE ASSIGNMENT

DEALING WITH FAILURE ‣ Node / Disk failures happen all
the time ‣ Kafka retention is based on time ‣ Problem: Kafka does not understand time  ‣ What happens on node failure? ‣ Replace a node – replicate all data – data is now timestamped today ‣ Replicated data won’t get pruned for another week -> requires 2x disk capacity  (otherwise we need to go cleanup segments by hand, not fun!)  ‣ Looking forward to Kafka 0.10.1 (KIP-33) to ﬁx this

WE HAVE ALL THIS DATA WHAT NOW?

TYPICAL DATA PIPELINE ‣ auction feed ~ auction data +
bids ‣ impression feed ~ which auction ids got shown ‣ click feed ~ which auction ids resulted in a click ‣ Join feeds based on auction id ‣ Maybe some lookups ‣ Business logic to derive dozens of metrics and dimensions

IDIOSYNCRATIC WORKLOAD ‣ Hundreds of heterogeneous feeds ‣ Join window
of ~ 15-20 min ‣ Each client has slightly different workﬂow and complexity ‣ Workload changes all the time – because we like our clients ‣ Capacity planning is hard!

WHAT WE LIKE ABOUT SAMZA ‣ great workload isolation –
different pipelines in different JVMs ‣ heterogeneous workloads – network /disk / cpu isolation matters  ‣ hard to gauge how many nodes we need ‣ we’re on AWS, let’s use many many small nodes! ‣ easy to provision, good isolation of resources ‣ one container per node, hundreds of little nodes chugging along

HOLD ON, SAMZA NEEDS KAFKA ‣ many pipelines, big and
small ‣ how to keep load on Kafka even? ‣ partitions multiple of brokers ‣ also keep number of consumers per broker even  -> # of samza containers per topic divides # of partitions ‣ but we want also want to scale up and down easily  ‣ make sure your partition count has lots of divisors!

STREAMING JOINS shufﬂe cogroup state process process

MAINTAINING STATE ‣ Co-group state is stored locally and persisted
to Kafka  ‣ Two Kafka clusters: ‣ co-group cluster: disk intensive, uses log compaction ‣ messaging cluster: simple retention policy, network/cpu bound  ‣ Separate clusters – both cheaper and easier to operate

CHALLENGES ‣ On failure Samza restores from Kafka (but it
takes time) ‣ Intermediate processing topics use keyed messages • cannot use the scaling technique we use for inbound data • re-balancing partitions would require massive over-provisioning • currently solved by new cluster and moving pipelines (loses state) ‣ magnetic storage works well if consumers are not lagging • Disk seek become a problem when consumers start falling behind • SSDs way to go for high throughput topics w/ multiple consumers

METRICS

COLLECTING METRICS ‣ Monitor latencies at every stage ‣ Identify
bottlenecks as they happen ‣ All our service incorporate the same metrics collector ‣ JVM (heap, gc) & System metrics (cpu, network, disk) ‣ Application level metrics • Time spent consuming / processing / producing at each stage • Request latencies, error rates • Amount of data processed • Consumer lag in messages + time

CONSUMING METRICS ‣ Put all the metrics into Druid to
diagnose in realtime ‣ 15 billion metric data points per day ‣ Interactive exploration allows us to pinpoints problems quickly ‣ Granularity down to the individual query or server level ‣ Gives both the big picture and the detailed breakdown

S3 Metrics Emitter HTTP front METRICS PIPELINE

WHAT ABOUT BATCH? ‣ clients ask to re-process older data
‣ expanded join window for late events ‣ correct at-least-once semantics of Kafka / Samza ‣ stuff happens, ﬁx realtime hiccups

WRITE ONCE, RUN LAMBDA ‣ Scala DSL to write data
join / transformation / aggregation once ‣ Can be expressed using different drivers ‣ as Storm topology, Samza job, or Cascading job  ‣ MapReduce no good? Spark is what the cool kids use? ‣ No problem! ‣ Write new driver for Spark, replace the Cascading driver ‣ Hundreds of pipelines moved from Hadoop to Spark in one month!

HOW CAN WE USE ALL THIS DATA?

PUT IT IN DRUID! ‣ Streams are indexed and queryable
in realtime ‣ Batches replace realtime data at their own pace ‣ Druid makes this all happen seamlessly ‣ Data is immediately available for interactive queries ‣ Interactive slicing and dicing

DRUID AT SCALE ‣ Production cluster runs several hundred nodes
‣ Several hundred terabytes of compressed + pre-aggregated data ‣ Typical event is complex: > 60 dimensions > 20 metrics ‣ Realtime • > 3 million events per second on average • > 6 Gigabytes per second ‣ All Aggregated on the ﬂy ‣ Hundreds of concurrent requests – close to 1 million queries per day

WHY DRUID MATTERS AT SCALE In one word DOWNTIME

DRUID IS ALWAYS ON ‣ Replacing or upgrading nodes is
seamless. ‣ Every component is sateless or fails over transparently ‣ Druid can always live upgrade from one version to the next ‣ Our current cluster has been running since 2011

SCALING FOR PERFORMANCE ‣ Want things to be faster? Simply
add nodes ‣ Rebalancing data to use additional capacity? ‣ Automatic, no downtime, no service degradation  ‣ Druid data is memory mapped ‣ Want more in-memory? Just add RAM ‣ Want to save some $? Just add Disk

SCALING FOR RELIABILITY ‣ Data replication is highly customizable ‣
Tiers of data can serve different latency needs ‣ Tiers can make replicas rack/datacenter-aware ‣ Queries can be prioritized across tiers

DRUID DATA EVOLVES WITH YOU ‣ Data is chunked up
in atomic units called segments ‣ Each segment represent a chunk of time (typically or day) ‣ New segments atomically replace older versions ‣ Batch data seamlessly replaces realtime data ‣ Schemas can evolve over time ‣ Druid handles mixed schemas transparently ‣ Supports schema-less ingestion

WHY IS THIS IMPORTANT? ‣ Need to scale up and
(sometimes) down dynamically ‣ Accommodate query load and data growth without service interruption ‣ Rebuilding a cluster from scratch would take several days ‣ Clients can add dimensions / metrics at will

THANK YOU!

Streaming Analytics at 300 billion events/day w...

Streaming Analytics at 300 billion events/day with Kafka, Samza, and Druid

Druid

More Decks by Druid

Other Decks in Technology

Featured

Transcript

STREAMING ANALYTICS @ SCALE WITH DRUID / KAFKA / SAMZA

THE OBLIGATORY “WE’RE HIRING TOO”

2015 A DELUGE OF DATA ‣ Ingest auction data from

ELB S3 KafkaHTTP DATA PIPELINE

2015 RECEIVING DATA ‣ Single Kafka cluster for receiving data

SCALING KAFKA ‣ Add nodes, and increase partition count ‣

TYPICAL ROUND ROBIN ASSIGNMENT

SCALABLE ASSIGNMENT

DEALING WITH FAILURE ‣ Node / Disk failures happen all

WE HAVE ALL THIS DATA WHAT NOW?

TYPICAL DATA PIPELINE ‣ auction feed ~ auction data +

IDIOSYNCRATIC WORKLOAD ‣ Hundreds of heterogeneous feeds ‣ Join window

WHAT WE LIKE ABOUT SAMZA ‣ great workload isolation –

HOLD ON, SAMZA NEEDS KAFKA ‣ many pipelines, big and

STREAMING JOINS shufﬂe cogroup state process process

MAINTAINING STATE ‣ Co-group state is stored locally and persisted

CHALLENGES ‣ On failure Samza restores from Kafka (but it

METRICS

COLLECTING METRICS ‣ Monitor latencies at every stage ‣ Identify

CONSUMING METRICS ‣ Put all the metrics into Druid to

S3 Metrics Emitter HTTP front METRICS PIPELINE

WHAT ABOUT BATCH? ‣ clients ask to re-process older data

WRITE ONCE, RUN LAMBDA ‣ Scala DSL to write data

HOW CAN WE USE ALL THIS DATA?

PUT IT IN DRUID! ‣ Streams are indexed and queryable

DRUID AT SCALE ‣ Production cluster runs several hundred nodes

WHY DRUID MATTERS AT SCALE In one word DOWNTIME

DRUID IS ALWAYS ON ‣ Replacing or upgrading nodes is

SCALING FOR PERFORMANCE ‣ Want things to be faster? Simply

SCALING FOR RELIABILITY ‣ Data replication is highly customizable ‣

DRUID DATA EVOLVES WITH YOU ‣ Data is chunked up

WHY IS THIS IMPORTANT? ‣ Need to scale up and

THANK YOU!