Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring JUST EAT on AWS

Monitoring JUST EAT on AWS

Or, why we didn't just use CloudWatch.

Peter Mounce

April 24, 2015

More Decks by Peter Mounce

Other Decks in Technology


  1. Monitoring JUST EAT on AWS (Or: why we didn’t just

    use AWS CloudWatch) Peter Mounce @petemounce / @justeat_tech
  2. What did we want? Peter Mounce @petemounce / @justeat_tech One

    source of truth Alerts that fire in (hopefully) a few seconds Data we can keep for a long time Data we can get rid of when we want
  3. What did we end up with? Harvests OS-level perf-counters into

    statsd Apps publish their own metrics where they choose Publishers: PerfTap + app-specific Peter Mounce @petemounce / @justeat_tech
  4. What did we end up with? Send metrics over UDP:

    timers.uk.paymentsapi.checkout.200.005.eu-west-1.a:343|ms Receiver: StatsD (by Etsy) Peter Mounce @petemounce / @justeat_tech
  5. What did we end up with? Check-runner / alerter: Seyren

    Peter Mounce @petemounce / @justeat_tech
  6. What did we end up with? absolute( diffSeries( movingAverage( sumSeries(

    stats_counts.consumercommunicationservice.uk.*.event-*.reaction-savetoken.*.eu-west-1.*) ,50), movingAverage( sumSeries( stats.timers.api-consumer.asp-net-responses.*authorizetoken.put.200.*.*.*.count, stats.timers.api-consumer.asp-net-responses.loginuser.post.200.*.*.*.count, stats.timers.api-consumer.asp-net-responses.create.post.201.*.*.*.count ) ,50) ) ) Example alert (comprehensible) Peter Mounce @petemounce / @justeat_tech
  7. What did we end up with? • PagerDuty • Grafana

    • HipChat Some other stuff too Peter Mounce @petemounce / @justeat_tech
  8. What does it cost? Peter Mounce @petemounce / @justeat_tech Graphite

    + whisper 1x m3.2xlarge, 12x 1TB @ 500 PIOPs StatsD 1x m3.xlarge Carbon-relay 1x m3.xlarge Seyren 1x c3.xlarge Grafana S3 website PagerDuty somebody else’s problem ;-) Buys: 200k metrics / sec & alarm latency around 2min
  9. What did we gain? Graphite has more analysis functions than

    CloudWatch does. Graphite: ~100 CloudWatch: 5…? Rich set of data analysis functions Peter Mounce @petemounce / @justeat_tech
  10. What did we gain? CloudWatch - retains data for 2

    weeks … or until shortly after resources are terminated … so we would need to archive data ourselves Capability for historical analysis Peter Mounce @petemounce / @justeat_tech
  11. What did we gain? CloudWatch • 1 min granularity •

    ~2 min latency (CloudWatch::DynamoDB - 5 min granularity on CCU) Our MTR-React is shorter Peter Mounce @petemounce / @justeat_tech