Slide 1

Slide 1 text

Monitoring JUST EAT on AWS (Or: why we didn’t just use AWS CloudWatch) Peter Mounce @petemounce / @justeat_tech

Slide 2

Slide 2 text

What did we want? Peter Mounce @petemounce / @justeat_tech One source of truth Alerts that fire in (hopefully) a few seconds Data we can keep for a long time Data we can get rid of when we want

Slide 3

Slide 3 text

What did we end up with? Harvests OS-level perf-counters into statsd Apps publish their own metrics where they choose Publishers: PerfTap + app-specific Peter Mounce @petemounce / @justeat_tech

Slide 4

Slide 4 text

What did we end up with? Send metrics over UDP: timers.uk.paymentsapi.checkout.200.005.eu-west-1.a:343|ms Receiver: StatsD (by Etsy) Peter Mounce @petemounce / @justeat_tech

Slide 5

Slide 5 text

What did we end up with? Aggregator: Graphite Peter Mounce @petemounce / @justeat_tech

Slide 6

Slide 6 text

What did we end up with? Check-runner / alerter: Seyren Peter Mounce @petemounce / @justeat_tech

Slide 7

Slide 7 text

What did we end up with? absolute(diffSeries(movingAverage(sumSeries(stats_counts.consumercommunicationservice. uk.*.event-*.reaction-savetoken.*.eu-west-1.*),50),movingAverage(sumSeries(stats. timers.api-consumer.asp-net-responses.*authorizetoken.put.200.*.*.*.count,stats. timers.api-consumer.asp-net-responses.loginuser.post.200.*.*.*.count,stats.timers.api- consumer.asp-net-responses.create.post.201.*.*.*.count),50))) Just kidding. Example alert Peter Mounce @petemounce / @justeat_tech

Slide 8

Slide 8 text

What did we end up with? absolute( diffSeries( movingAverage( sumSeries( stats_counts.consumercommunicationservice.uk.*.event-*.reaction-savetoken.*.eu-west-1.*) ,50), movingAverage( sumSeries( stats.timers.api-consumer.asp-net-responses.*authorizetoken.put.200.*.*.*.count, stats.timers.api-consumer.asp-net-responses.loginuser.post.200.*.*.*.count, stats.timers.api-consumer.asp-net-responses.create.post.201.*.*.*.count ) ,50) ) ) Example alert (comprehensible) Peter Mounce @petemounce / @justeat_tech

Slide 9

Slide 9 text

What did we end up with? ● PagerDuty ● Grafana ● HipChat Some other stuff too Peter Mounce @petemounce / @justeat_tech

Slide 10

Slide 10 text

What does it look like? Peter Mounce @petemounce / @justeat_tech Diagram credit

Slide 11

Slide 11 text

What does it cost? Peter Mounce @petemounce / @justeat_tech Graphite + whisper 1x m3.2xlarge, 12x 1TB @ 500 PIOPs StatsD 1x m3.xlarge Carbon-relay 1x m3.xlarge Seyren 1x c3.xlarge Grafana S3 website PagerDuty somebody else’s problem ;-) Buys: 200k metrics / sec & alarm latency around 2min

Slide 12

Slide 12 text

What did we gain? Graphite has more analysis functions than CloudWatch does. Graphite: ~100 CloudWatch: 5…? Rich set of data analysis functions Peter Mounce @petemounce / @justeat_tech

Slide 13

Slide 13 text

What did we gain? CloudWatch - retains data for 2 weeks … or until shortly after resources are terminated … so we would need to archive data ourselves Capability for historical analysis Peter Mounce @petemounce / @justeat_tech

Slide 14

Slide 14 text

What did we gain? CloudWatch ● 1 min granularity ● ~2 min latency (CloudWatch::DynamoDB - 5 min granularity on CCU) Our MTR-React is shorter Peter Mounce @petemounce / @justeat_tech

Slide 15

Slide 15 text

Happiness! (Mostly) Peter Mounce @petemounce / @justeat_tech

Slide 16

Slide 16 text

We’re recruiting! http://tech.just-eat.com/jobs Peter Mounce @petemounce / @justeat_tech