Graphite - graphing over statsd metrics • Consul - service discovery and configuration management • Sensu - monitoring framework • Pagerduty - waking you in the middle of the night! • Slack - AF main communication channel, including some alerting • Santa - proprietary in-house deployment system, allowing for healing and auto-scaling Monitoring - AF Tools Of The Trade
monitoring • Manual graphite and Kafka web-view checks • Life itself - it forced us to write AF-Kafka-Monitor • Using kafka cmdline tools • Monitors producing to every topic / partition • Monitors lags on every consumer group • AF-Kafka-monitor now monitors • ~80 consumer groups • ~15 different topics
day of the week • Traffic growth, a constant “problem” at AppsFlyer ;) • Why not AWS auto-scale? • CPU (for example) is NOT the right metric. • Kafka Consumer Lag is EXACTLY the right metric - is the service keeping up?
reads 2.5-4M launch events per minute, classifies according to media source • Lag goes above 2.5M threshold • Auto scale launches another 5 instances • Lag goes back down (steady state) • Regular daily occurrence