Streaming analytics at 300 billion events per day with Kafka, Samza, and Druid

Slide 1

Slide 1 text

STREAMING ANALYTICS AT SCALE WITH DRUID / KAFKA / SAMZA XAVIER LÉAUTÉ HEAD OF BACKEND @ METAMARKETS DRUID COMMITTER & PMC MEMBER [email protected]

Slide 2

Slide 2 text

2015 A DELUGE OF DATA ‣ Ingest auction data from realtime ad exchanges ‣ Process, join, and aggregate data in realtime ‣ Make data available for interactive exploration, slicing, & dicing  ‣ 100 billion transactions / day – 2.5 PB / month inbound ‣ multiple events per transaction  ‣ 300 billion events per day into Druid ‣ all aggregated and queryable in real-time

Slide 3

Slide 3 text

ELB S3 KafkaHTTP DATA PIPELINE

Slide 4

Slide 4 text

2015 RECEIVING DATA ‣ Single Kafka cluster for receiving data ‣ Failure is not an option ‣ Data schema can change at any time  ‣ Keep it very simple: ingest, timestamp, batch, compress data upfront ‣ Distribute data to available partitions ‣ 7 day retention so we can sleep tight  ‣ How do we scale it?

Slide 5

Slide 5 text

SCALING KAFKA ‣ Inbound messages are not keyed ‣ Add nodes, and increase partition count ‣ Problem: Kafka would rebalance all the data, saturating network  ‣ Solution 1: throttle replication/rebalancing ‣ Solution 2: create groups of nodes, with pre-assigned partitions

Slide 6

Slide 6 text

TYPICAL ROUND ROBIN ASSIGNMENT

Slide 7

Slide 7 text

SCALABLE ASSIGNMENT

Slide 8

Slide 8 text

DEALING WITH FAILURE ‣ Node / Disk failures happen all the time ‣ Kafka retention is based on time ‣ Problem: Kafka does not understand time  ‣ What happens on node failure? ‣ Replace a node – replicate all data – data is now timestamped today ‣ Replicated data won’t get pruned for another week -> requires 2x disk capacity  (otherwise we need to go cleanup segments by hand, not fun!)  ‣ Looking forward to Kafka 0.10.1 (KIP-33) to ﬁx this

Slide 9

Slide 9 text

WE HAVE ALL THIS DATA WHAT NOW?

Slide 10

Slide 10 text

TYPICAL DATA PIPELINE ‣ auction feed ~ auction data + bids ‣ impression feed ~ which auction ids got shown ‣ click feed ~ which auction ids resulted in a click ‣ Join feeds based on auction id ‣ Maybe some lookups ‣ Business logic to derive dozens of metrics and dimensions

Slide 11

Slide 11 text

IDIOSYNCRATIC WORKLOAD ‣ Hundreds of heterogeneous feeds ‣ Join window of ~ 15-20 min ‣ Each client has slightly different workﬂow and complexity ‣ Workload changes all the time – because we like our clients ‣ Capacity planning is hard!

Slide 12

Slide 12 text

WHAT WE LIKE ABOUT SAMZA ‣ great workload isolation – different pipelines in different JVMs ‣ heterogeneous workloads – network /disk / cpu isolation matters  ‣ hard to gauge how many nodes we need ‣ we’re on AWS, let’s use many many small nodes! ‣ easy to provision, good isolation of resources ‣ one container per node, hundreds of little nodes chugging along

Slide 13

Slide 13 text

HOLD ON, SAMZA NEEDS KAFKA ‣ Many pipelines, big and small ‣ Keeping load on Kafka even is critical ‣ Number of partitions must be a multiple of broker count ‣ Number of consumers per broker even  -> # of consumers per topic divides # of partitions ‣ but we want also want to scale up and down easily ‣ Partition count must have lots of divisors  -> use highly composite numbers  e.g. 120, is divided by 1,2,3,4,5,6,8,10,12,15,20,24,30,40,60,120

Slide 14

Slide 14 text

STREAMING JOINS shufﬂe cogroup state process process

Slide 15

Slide 15 text

MAINTAINING STATE ‣ Co-group state is stored locally and persisted to Kafka  ‣ Two Kafka clusters: ‣ co-group cluster: disk intensive, uses log compaction ‣ pubsub cluster: simple retention policy, network/cpu bound  ‣ Separate clusters – both cheaper and easier to operate

Slide 16

Slide 16 text

CHALLENGES ‣ On failure Samza restores state from Kafka (but it takes time) ‣ Intermediate processing topics use keyed messages • cannot use the scaling technique we use for inbound data • re-balancing partitions would require massive over-provisioning • currently solved by new cluster and moving pipelines (loses state) ‣ magnetic storage works well if consumers are not lagging • Disk seek becomes a problem when consumers start falling behind • SSDs way to go for high throughput topics w/ multiple consumers • EBS works well in production

Slide 17

Slide 17 text

WHAT ABOUT BATCH? ‣ clients ask to re-process older data ‣ expanded join window for late events ‣ correct at-least-once semantics of Kafka / Samza ‣ Stuff happens, ﬁx realtime hiccups ‣ We keep all data in Spark transient, store in S3 ‣ Clusters are disposable, run on AWS spot

Slide 18

Slide 18 text

WRITE ONCE, RUN LAMBDA ‣ Scala DSL to write data join / transformation / aggregation once ‣ Can be expressed using different drivers ‣ as Storm topology, Samza job, or Cascading job  ‣ MapReduce no good? Spark is what the cool kids use? ‣ No problem! ‣ Write new driver for Spark, replace the Cascading driver ‣ Hundreds of pipelines moved from Hadoop to Spark in one month!

Slide 19

Slide 19 text

METRICS

Slide 20

Slide 20 text

COLLECTING METRICS ‣ Monitor latencies at every stage ‣ Identify bottlenecks as they happen ‣ All our service incorporate the same metrics collector ‣ JVM (heap, gc) & System metrics (cpu, network, disk) ‣ Application level metrics • Time spent consuming / processing / producing at each stage • Request latencies, error rates • Amount of data processed • Consumer lag in messages + time

Slide 21

Slide 21 text

CONSUMING METRICS ‣ Put all the metrics into Druid to diagnose in realtime ‣ 15 billion metric data points per day ‣ Interactive exploration allows us to pinpoints problems quickly ‣ Granularity down to the individual query or server level ‣ Separate monitoring from alerting ‣ Make sure you have a separate system for critical alerts

Slide 22

Slide 22 text

S3 Metrics Emitter HTTP front METRICS PIPELINE

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

HOW CAN WE USE ALL THIS DATA?

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

DRUID ‣ Fast Distributed Column-Oriented Data Store ‣ Built For Interactive Analytics ‣ Exactly-once streaming ingestion ‣ Interactive slicing and dicing

Slide 27

Slide 27 text

2015 DRUID ARCHITECTURE Broker Node Historical Node Historical Node Real-time Node Broker Node Queries Batch Real-time Node Streaming Handover

Slide 28

Slide 28 text

INGESTING DATA ‣ Streams are indexed and queryable in realtime ‣ Batches replace realtime data at their own pace ‣ Druid makes this all happen seamlessly ‣ Data is immediately available for queries ‣ Leverages existing Hadoop / Spark resources for batch ingestion

Slide 29

Slide 29 text

DRUID DATA EVOLVES WITH YOU ‣ Data is chunked up in atomic units called segments ‣ Each segment represent a chunk of time (typically or day) ‣ New segments atomically replace older versions ‣ Batch data seamlessly replaces realtime data ‣ Schemas can evolve over time ‣ Druid handles mixed schemas transparently ‣ Supports schema-less ingestion

Slide 30

Slide 30 text

DRUID AT SCALE ‣ Production cluster runs several hundred nodes ‣ Several hundred terabytes of compressed + pre-aggregated data ‣ Typical event is complex: > 60 dimensions > 20 metrics ‣ Realtime • > 3 million events per second on average • > 6 Gigabytes per second ‣ All Aggregated on the ﬂy ‣ Hundreds of concurrent requests – close to 1 million queries per day

Slide 31

Slide 31 text

WHY DRUID MATTERS AT SCALE In one word DOWNTIME

Slide 32

Slide 32 text

DRUID IS ALWAYS ON ‣ Replacing or upgrading nodes is seamless. ‣ Every component is stateless or fails over transparently ‣ Druid can always live upgrade from one version to the next ‣ Our current cluster has been on-line since 2011

Slide 33

Slide 33 text

SCALING FOR PERFORMANCE ‣ Want things to be faster? Simply add nodes ‣ Rebalancing data to use additional capacity? ‣ Automatic, no downtime, no service degradation  ‣ Druid data is memory mapped and de-compressed on the ﬂy ‣ Want more in-memory? Just add RAM ‣ Want to save some $? Just add Disk ‣ RAM is expensive, SSDs and CPU are cheap

Slide 34

Slide 34 text

SCALING FOR RELIABILITY / PREDICTABILITY ‣ Data distribution is highly customizable ‣ Tiers of data can serve different latency needs ‣ Tiers can make replicas rack/datacenter-aware ‣ Queries can be prioritized across tiers ‣ Replicas can be put on cheaper hardware ‣ Many levels of caching ‣ Make sure your ZooKeeper is rock solid (we use 5 nodes)

Slide 35

Slide 35 text

WHY IS THIS IMPORTANT? ‣ Need to scale up and (sometimes) down dynamically ‣ Multi-tenant system, needs to serve different latency guarantees ‣ Accommodate query load and data growth without interruption ‣ Rebuilding a cluster from scratch would take several days ‣ Data needs to evolve over time

Slide 36

Slide 36 text

THANK YOU!