Bad/Bed time Stories with Kafka

Bad/Bed time Stories

Today's Menu • Quick Kafka Overview • Kafka Usage At
AppsFlyer • AppsFlyer First Cluster • Designing The Next Cluster: Requirements And Changes • Problems With The New Cluster • Changes To The New Cluster • Traffic Boost, Then More Issues • More Solutions • And More Failures • Splitting The Cluster, More Changes And The Current Configuration • Lessons Learned • Testing The Cluster • Collecting Metrics And Alerting

"A first sign of the beginning of understanding is the
wish to die." Franz Kafka

Quick Kafka

“An open source, distributed, partitioned and replicated commit-log based publish-
subscribe messaging system” Kafka Overview

Kafka Overview • Topic: Category which messages are published by
the message producers • Broker: Kafka server process (usually one per node) • Partitions: Topics are partitioned, each partition is represented by the ordered immutable sequence of messages. Each message in the partition is assigned a unique ID called offset

Kafka Usage in AppsFlyer

AppsFlyer First cluster • Traffic: Up to few hundreds millions
• Size: 4 M1.xlarge brokers • ~8 Topics • Replication factor 1 • Retention 8-12H • Default number of partitions 8 • Vanilla configuration Main reason for migration: Lack of storage capacity, limited parallelism due to low partition count and forecast for future needs.

Requirements for the Next Cluster • More capacity to support
Billions of messages • Messages replication to prevent data loss • Support loss of brokers up to entire AZ • Much higher parallelism to support more consumers • Longer retention period – 48 hours on most topics

The new Cluster changes • 18 m1.xlarge brokers, 6 per
AZ • Replication factor of 3 • All partitions are distributed between AZ • Topics # of partitions increased (between 12 to 120 depends on parallelism needs) • 4 Network and IO threads • Default log retention 48 hours • Auto Leader rebalance enabled • Imbalanced ratio set to default 15% * Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from * Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate Glossary

And After a few Months

Problems • Uneven distributions of leaders which cause high load
on specific brokers and eventually lag in consumers and brokers failures • Constantly rebalanced of brokers leaders which caused failures in python producers

Solutions • Increase number of brokers to 24 improve broker
leadership distribution • Rewrite Python producers in Clojure • Decrease number of partitions where high parallelism is not needed

Traffic increasing and...

Problems • High Iowait in the brokers • Missing ISR
due to leaders overloaded • Network bandwidth close to thresholds • Lag in consumers * ISR: In Active Replicas Glossary

More Solutions • Split into 2 clusters: launches which contain
80% of messages and all the rest • Move launches cluster to i2.2xlarge with local SSD • Finer tuning of leaders • Increase number of IO and Network Threads • Enable AWS enhanced networking

And some few more... • Decrease Replication factor to 2
in Launches cluster to reduce load on leaders, reduce disk capacity and AZ traffic costs • Move 2nd cluster to i2.2xlarge as well • Upgrade ZK due to performance issues

Lessons learned • Minimize replication factor as possible to avoid
extra load on the Leaders • Make sure that leaders count is well balanced between brokers • Balance partition number to support parallelism • Split cluster logically considering traffic and business importance • Retention (time based) should be long enough to recover from failures • In AWS, spread cluster between AZ • Support cluster dynamic changes by clients • Create automation for reassign • Save cluster-reassignment.json of each topic for future needs! • Don't be to cheap on the Zookeepers

Testing the cluster • Load test using kafka-producer-perf-test.sh & kafka-consumer-perf-test.sh
• Broker failure while running • Entire AZ failure while running • Reassign partitions on the fly • Kafka dashboard contains: Leader election rate, ISR status, offline partitions count, Log Flush time, All Topics Bytes in per broker, IOWait, LoadAvg, Disk Capacity and more • Set appropriate alerts

Collecting metrics & Alerting • Using Airbnb plugin for Kafka,
sending metrics to graphite • Internal application that collects Lag for each Topic and send values to graphite • Alerts are set on Lag (For each topic, Under replicated partitions, Broker topic metrics below threshold, Leader reelection

Bad/Bed time Stories with Kafka

Bad/Bed time Stories with Kafka

AppsFlyer

More Decks by AppsFlyer

Featured

Transcript