Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bad/Bed time Stories with Kafka

September 10, 2015

Bad/Bed time Stories with Kafka


September 10, 2015


  1. Today's Menu • Quick Kafka Overview • Kafka Usage At

    AppsFlyer • AppsFlyer First Cluster • Designing The Next Cluster: Requirements And Changes • Problems With The New Cluster • Changes To The New Cluster • Traffic Boost, Then More Issues • More Solutions • And More Failures • Splitting The Cluster, More Changes And The Current Configuration • Lessons Learned • Testing The Cluster • Collecting Metrics And Alerting
  2. Kafka Overview • Topic: Category which messages are published by

    the message producers • Broker: Kafka server process (usually one per node) • Partitions: Topics are partitioned, each partition is represented by the ordered immutable sequence of messages. Each message in the partition is assigned a unique ID called offset
  3. AppsFlyer First cluster • Traffic: Up to few hundreds millions

    • Size: 4 M1.xlarge brokers • ~8 Topics • Replication factor 1 • Retention 8-12H • Default number of partitions 8 • Vanilla configuration Main reason for migration: Lack of storage capacity, limited parallelism due to low partition count and forecast for future needs.
  4. Requirements for the Next Cluster • More capacity to support

    Billions of messages • Messages replication to prevent data loss • Support loss of brokers up to entire AZ • Much higher parallelism to support more consumers • Longer retention period – 48 hours on most topics
  5. The new Cluster changes • 18 m1.xlarge brokers, 6 per

    AZ • Replication factor of 3 • All partitions are distributed between AZ • Topics # of partitions increased (between 12 to 120 depends on parallelism needs) • 4 Network and IO threads • Default log retention 48 hours • Auto Leader rebalance enabled • Imbalanced ratio set to default 15% * Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from * Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate Glossary
  6. Problems • Uneven distributions of leaders which cause high load

    on specific brokers and eventually lag in consumers and brokers failures • Constantly rebalanced of brokers leaders which caused failures in python producers
  7. Solutions • Increase number of brokers to 24 improve broker

    leadership distribution • Rewrite Python producers in Clojure • Decrease number of partitions where high parallelism is not needed
  8. Problems • High Iowait in the brokers • Missing ISR

    due to leaders overloaded • Network bandwidth close to thresholds • Lag in consumers * ISR: In Active Replicas Glossary
  9. More Solutions • Split into 2 clusters: launches which contain

    80% of messages and all the rest • Move launches cluster to i2.2xlarge with local SSD • Finer tuning of leaders • Increase number of IO and Network Threads • Enable AWS enhanced networking
  10. And some few more... • Decrease Replication factor to 2

    in Launches cluster to reduce load on leaders, reduce disk capacity and AZ traffic costs • Move 2nd cluster to i2.2xlarge as well • Upgrade ZK due to performance issues
  11. Lessons learned • Minimize replication factor as possible to avoid

    extra load on the Leaders • Make sure that leaders count is well balanced between brokers • Balance partition number to support parallelism • Split cluster logically considering traffic and business importance • Retention (time based) should be long enough to recover from failures • In AWS, spread cluster between AZ • Support cluster dynamic changes by clients • Create automation for reassign • Save cluster-reassignment.json of each topic for future needs! • Don't be to cheap on the Zookeepers
  12. Testing the cluster • Load test using kafka-producer-perf-test.sh & kafka-consumer-perf-test.sh

    • Broker failure while running • Entire AZ failure while running • Reassign partitions on the fly • Kafka dashboard contains: Leader election rate, ISR status, offline partitions count, Log Flush time, All Topics Bytes in per broker, IOWait, LoadAvg, Disk Capacity and more • Set appropriate alerts
  13. Collecting metrics & Alerting • Using Airbnb plugin for Kafka,

    sending metrics to graphite • Internal application that collects Lag for each Topic and send values to graphite • Alerts are set on Lag (For each topic, Under replicated partitions, Broker topic metrics below threshold, Leader reelection