Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing, Monitoring and other sorts of magic - Apache Kafka

AppsFlyer
January 14, 2016

Testing, Monitoring and other sorts of magic - Apache Kafka

AppsFlyer

January 14, 2016
Tweet

More Decks by AppsFlyer

Other Decks in Technology

Transcript

  1. • Statsd - aggregation and summary of application metrics •

    Graphite - graphing over statsd metrics • Consul - service discovery and configuration management • Sensu - monitoring framework • Pagerduty - waking you in the middle of the night! • Slack - AF main communication channel, including some alerting • Santa - proprietary in-house deployment system, allowing for healing and auto-scaling Monitoring - AF Tools Of The Trade
  2. Always Be Monitoring! • Started using Kafka without any(!!) related

    monitoring • Manual graphite and Kafka web-view checks • Life itself - it forced us to write AF-Kafka-Monitor • Using kafka cmdline tools • Monitors producing to every topic / partition • Monitors lags on every consumer group • AF-Kafka-monitor now monitors • ~80 consumer groups • ~15 different topics
  3. AF-Kafka-Monitor • Calls Kafka cmdline tools every 60 seconds (per

    topic / consumer group) • Publishes results as statsd metrics • Calling consumer lag cmdline: (defn call-kafka-cmdline [consumer topic zkconnect-string] (let [res (sh "/home/docker/kafka_2.9.2-0.8.1.1/bin/kafka-run-class.sh" "kafka.tools.ConsumerOffsetChecker" "--zkconnect" zkconnect-string "--group" (str consumer) "--topic" (str topic))] (when (= (:exit res) 0) (:out res))))
  4. AF-Kafka-Monitor 2 • Similarly to consumer-lag we monitor topics offset

    progression (defn call-kafka-cmdline-GetOffsetShell [topic kafka-brokers] (let [res (sh "/home/docker/kafka_2.9.2-0.8.1.1/bin/kafka-run-class.sh" "kafka.tools.GetOffsetShell" "--broker-list" (str kafka-brokers) "--topic" (str topic) "--time" (str -1))] (when (= (:exit res) 0) (:out res))))
  5. Alerting Based on Af-Kafka-Monitoring Metrics • Sensu alerts send Slack

    warning / PagerDuty alert on configured thresholds
  6. Alerting Based on Af-Kafka-Monitoring Metrics • Sensu alerts send Slack

    warning / PagerDuty alert on configured thresholds
  7. Why Auto-Scale? • Traffic varies during time of day and

    day of the week • Traffic growth, a constant “problem” at AppsFlyer ;) • Why not AWS auto-scale? • CPU (for example) is NOT the right metric. • Kafka Consumer Lag is EXACTLY the right metric - is the service keeping up?
  8. AppsFlyer Auto Scaling - af-loyals - Scaling Up • Service

    reads 2.5-4M launch events per minute, classifies according to media source • Lag goes above 2.5M threshold • Auto scale launches another 5 instances • Lag goes back down (steady state) • Regular daily occurrence
  9. AppsFlyer Auto Scaling - af-loyals - Scaling Up • Same

    service • Lag stays consistently below 2M • Auto scale stops 3 machines, then another 2 (hitting min) • Regular daily occurrence