Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bad/Bed time Stories with Kafka

22d9eb22713520cfb9df28f6b1ce7f83?s=47 AppsFlyer
September 10, 2015
1.5k

Bad/Bed time Stories with Kafka

22d9eb22713520cfb9df28f6b1ce7f83?s=128

AppsFlyer

September 10, 2015
Tweet

Transcript

  1. Bad/Bed time Stories

  2. Today's Menu • Quick Kafka Overview • Kafka Usage At

    AppsFlyer • AppsFlyer First Cluster • Designing The Next Cluster: Requirements And Changes • Problems With The New Cluster • Changes To The New Cluster • Traffic Boost, Then More Issues • More Solutions • And More Failures • Splitting The Cluster, More Changes And The Current Configuration • Lessons Learned • Testing The Cluster • Collecting Metrics And Alerting
  3. "A first sign of the beginning of understanding is the

    wish to die." Franz Kafka
  4. Quick Kafka

  5. “An open source, distributed, partitioned and replicated commit-log based publish-

    subscribe messaging system” Kafka Overview
  6. Kafka Overview • Topic: Category which messages are published by

    the message producers • Broker: Kafka server process (usually one per node) • Partitions: Topics are partitioned, each partition is represented by the ordered immutable sequence of messages. Each message in the partition is assigned a unique ID called offset
  7. Kafka Usage in AppsFlyer

  8. AppsFlyer First cluster • Traffic: Up to few hundreds millions

    • Size: 4 M1.xlarge brokers • ~8 Topics • Replication factor 1 • Retention 8-12H • Default number of partitions 8 • Vanilla configuration Main reason for migration: Lack of storage capacity, limited parallelism due to low partition count and forecast for future needs.
  9. Requirements for the Next Cluster • More capacity to support

    Billions of messages • Messages replication to prevent data loss • Support loss of brokers up to entire AZ • Much higher parallelism to support more consumers • Longer retention period – 48 hours on most topics
  10. The new Cluster changes • 18 m1.xlarge brokers, 6 per

    AZ • Replication factor of 3 • All partitions are distributed between AZ • Topics # of partitions increased (between 12 to 120 depends on parallelism needs) • 4 Network and IO threads • Default log retention 48 hours • Auto Leader rebalance enabled • Imbalanced ratio set to default 15% * Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from * Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate Glossary
  11. And After a few Months

  12. Problems • Uneven distributions of leaders which cause high load

    on specific brokers and eventually lag in consumers and brokers failures • Constantly rebalanced of brokers leaders which caused failures in python producers
  13. Solutions • Increase number of brokers to 24 improve broker

    leadership distribution • Rewrite Python producers in Clojure • Decrease number of partitions where high parallelism is not needed
  14. Traffic increasing and...

  15. Problems • High Iowait in the brokers • Missing ISR

    due to leaders overloaded • Network bandwidth close to thresholds • Lag in consumers * ISR: In Active Replicas Glossary
  16. More Solutions • Split into 2 clusters: launches which contain

    80% of messages and all the rest • Move launches cluster to i2.2xlarge with local SSD • Finer tuning of leaders • Increase number of IO and Network Threads • Enable AWS enhanced networking
  17. And some few more... • Decrease Replication factor to 2

    in Launches cluster to reduce load on leaders, reduce disk capacity and AZ traffic costs • Move 2nd cluster to i2.2xlarge as well • Upgrade ZK due to performance issues
  18. Lessons learned • Minimize replication factor as possible to avoid

    extra load on the Leaders • Make sure that leaders count is well balanced between brokers • Balance partition number to support parallelism • Split cluster logically considering traffic and business importance • Retention (time based) should be long enough to recover from failures • In AWS, spread cluster between AZ • Support cluster dynamic changes by clients • Create automation for reassign • Save cluster-reassignment.json of each topic for future needs! • Don't be to cheap on the Zookeepers
  19. Testing the cluster • Load test using kafka-producer-perf-test.sh & kafka-consumer-perf-test.sh

    • Broker failure while running • Entire AZ failure while running • Reassign partitions on the fly • Kafka dashboard contains: Leader election rate, ISR status, offline partitions count, Log Flush time, All Topics Bytes in per broker, IOWait, LoadAvg, Disk Capacity and more • Set appropriate alerts
  20. Collecting metrics & Alerting • Using Airbnb plugin for Kafka,

    sending metrics to graphite • Internal application that collects Lag for each Topic and send values to graphite • Alerts are set on Lag (For each topic, Under replicated partitions, Broker topic metrics below threshold, Leader reelection
  21. None