Testing, Monitoring and other sorts of magic - Apache Kafka

TESTING, MONITORING AND OTHER TYPES OF MAGIC... Adi Belan, Plumbing
Team Leader

APPSFLYER R&D METHODOLOGIES USING KAFKA Monitoring

• Statsd - aggregation and summary of application metrics •
Graphite - graphing over statsd metrics • Consul - service discovery and configuration management • Sensu - monitoring framework • Pagerduty - waking you in the middle of the night! • Slack - AF main communication channel, including some alerting • Santa - proprietary in-house deployment system, allowing for healing and auto-scaling Monitoring - AF Tools Of The Trade

Always Be Monitoring! • Started using Kafka without any(!!) related
monitoring • Manual graphite and Kafka web-view checks • Life itself - it forced us to write AF-Kafka-Monitor • Using kafka cmdline tools • Monitors producing to every topic / partition • Monitors lags on every consumer group • AF-Kafka-monitor now monitors • ~80 consumer groups • ~15 different topics

Kafka Web View

• Lag= logSize-Offset Kafka Web View 2

AF-Kafka-Monitor • Calls Kafka cmdline tools every 60 seconds (per
topic / consumer group) • Publishes results as statsd metrics • Calling consumer lag cmdline: (defn call-kafka-cmdline [consumer topic zkconnect-string] (let [res (sh "/home/docker/kafka_2.9.2-0.8.1.1/bin/kafka-run-class.sh" "kafka.tools.ConsumerOffsetChecker" "--zkconnect" zkconnect-string "--group" (str consumer) "--topic" (str topic))] (when (= (:exit res) 0) (:out res))))

AF-Kafka-Monitor 2 • Similarly to consumer-lag we monitor topics offset
progression (defn call-kafka-cmdline-GetOffsetShell [topic kafka-brokers] (let [res (sh "/home/docker/kafka_2.9.2-0.8.1.1/bin/kafka-run-class.sh" "kafka.tools.GetOffsetShell" "--broker-list" (str kafka-brokers) "--topic" (str topic) "--time" (str -1))] (when (= (:exit res) 0) (:out res))))

Grafana Af-Kafka-Monitor • Jump in Lag indicates a problem with
consuming service

Grafana Af-Kafka-Monitor • Drop in offset progression indicates a problem
with producing service

Alerting Based on Af-Kafka-Monitoring Metrics • Sensu alerts send Slack
warning / PagerDuty alert on configured thresholds

Auto Scaling

Why Auto-Scale? • Traffic varies during time of day and
day of the week • Traffic growth, a constant “problem” at AppsFlyer ;) • Why not AWS auto-scale? • CPU (for example) is NOT the right metric. • Kafka Consumer Lag is EXACTLY the right metric - is the service keeping up?

Traffic Time of Day Variance • Launches vary between 2.6-4M
launches per minute (60% diff)

AppsFlyer Traffic Growth • Moving Average of total SDK launches
per minute (last ~9 months)

AppsFlyer Auto Scaling - af-loyals - Scaling Up • Service
reads 2.5-4M launch events per minute, classifies according to media source • Lag goes above 2.5M threshold • Auto scale launches another 5 instances • Lag goes back down (steady state) • Regular daily occurrence

AppsFlyer Auto Scaling - af-loyals - Scaling Up • Same
service • Lag stays consistently below 2M • Auto scale stops 3 machines, then another 2 (hitting min) • Regular daily occurrence

Auto Scale Consul Configuration

Producer Consumer 1 200 messages per second Auto Scale Live
Demo • Demo Scenario

Producer Consumer 1 400 messages per second Auto Scale Live
Demo • Demo Scenario

Producer Consumer 1 Consumer 2 400 messages per second Auto
Scale Live Demo • Demo Scenario

WE ARE HIRING!! Email: [email protected] Twitter: @adibelan

Testing, Monitoring and other sorts of magic - ...

Testing, Monitoring and other sorts of magic - Apache Kafka

AppsFlyer

More Decks by AppsFlyer

Other Decks in Technology

Featured

Transcript

TESTING, MONITORING AND OTHER TYPES OF MAGIC... Adi Belan, Plumbing

APPSFLYER R&D METHODOLOGIES USING KAFKA Monitoring

• Statsd - aggregation and summary of application metrics •

Always Be Monitoring! • Started using Kafka without any(!!) related

Kafka Web View

• Lag= logSize-Offset Kafka Web View 2

AF-Kafka-Monitor • Calls Kafka cmdline tools every 60 seconds (per

AF-Kafka-Monitor 2 • Similarly to consumer-lag we monitor topics offset

Grafana Af-Kafka-Monitor • Jump in Lag indicates a problem with

Grafana Af-Kafka-Monitor • Drop in offset progression indicates a problem

Alerting Based on Af-Kafka-Monitoring Metrics • Sensu alerts send Slack

Alerting Based on Af-Kafka-Monitoring Metrics • Sensu alerts send Slack

Auto Scaling

Why Auto-Scale? • Traffic varies during time of day and

Traffic Time of Day Variance • Launches vary between 2.6-4M

AppsFlyer Traffic Growth • Moving Average of total SDK launches

AppsFlyer Auto Scaling - af-loyals - Scaling Up • Service

AppsFlyer Auto Scaling - af-loyals - Scaling Up • Same

Auto Scale Consul Configuration

Producer Consumer 1 200 messages per second Auto Scale Live

Producer Consumer 1 400 messages per second Auto Scale Live

Producer Consumer 1 Consumer 2 400 messages per second Auto

WE ARE HIRING!! Email: [email protected] Twitter: @adibelan