$30 off During Our Annual Pro Sale. View Details »

Monitoring JUST EAT on AWS

Monitoring JUST EAT on AWS

Or, why we didn't just use CloudWatch.

Peter Mounce

April 24, 2015
Tweet

More Decks by Peter Mounce

Other Decks in Technology

Transcript

  1. Monitoring JUST EAT on AWS
    (Or: why we didn’t just use AWS CloudWatch)
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  2. What did we want?
    Peter Mounce @petemounce / @justeat_tech
    One source of truth
    Alerts that fire in (hopefully) a few seconds
    Data we can keep for a long time
    Data we can get rid of when we want

    View Slide

  3. What did we end up with?
    Harvests OS-level perf-counters into statsd
    Apps publish their own metrics where they choose
    Publishers: PerfTap + app-specific
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  4. What did we end up with?
    Send metrics over UDP:
    timers.uk.paymentsapi.checkout.200.005.eu-west-1.a:343|ms
    Receiver: StatsD (by Etsy)
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  5. What did we end up with?
    Aggregator: Graphite
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  6. What did we end up with?
    Check-runner / alerter: Seyren
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  7. What did we end up with?
    absolute(diffSeries(movingAverage(sumSeries(stats_counts.consumercommunicationservice.
    uk.*.event-*.reaction-savetoken.*.eu-west-1.*),50),movingAverage(sumSeries(stats.
    timers.api-consumer.asp-net-responses.*authorizetoken.put.200.*.*.*.count,stats.
    timers.api-consumer.asp-net-responses.loginuser.post.200.*.*.*.count,stats.timers.api-
    consumer.asp-net-responses.create.post.201.*.*.*.count),50)))
    Just kidding.
    Example alert
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  8. What did we end up with?
    absolute(
    diffSeries(
    movingAverage(
    sumSeries(
    stats_counts.consumercommunicationservice.uk.*.event-*.reaction-savetoken.*.eu-west-1.*)
    ,50),
    movingAverage(
    sumSeries(
    stats.timers.api-consumer.asp-net-responses.*authorizetoken.put.200.*.*.*.count,
    stats.timers.api-consumer.asp-net-responses.loginuser.post.200.*.*.*.count,
    stats.timers.api-consumer.asp-net-responses.create.post.201.*.*.*.count
    )
    ,50)
    )
    )
    Example alert (comprehensible)
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  9. What did we end up with?
    ● PagerDuty
    ● Grafana
    ● HipChat
    Some other stuff too
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  10. What does it look like?
    Peter Mounce @petemounce / @justeat_tech
    Diagram credit

    View Slide

  11. What does it cost?
    Peter Mounce @petemounce / @justeat_tech
    Graphite + whisper 1x m3.2xlarge, 12x 1TB @ 500 PIOPs
    StatsD 1x m3.xlarge
    Carbon-relay 1x m3.xlarge
    Seyren 1x c3.xlarge
    Grafana S3 website
    PagerDuty somebody else’s problem ;-)
    Buys:
    200k metrics / sec & alarm latency around 2min

    View Slide

  12. What did we gain?
    Graphite has more analysis functions than CloudWatch does.
    Graphite: ~100
    CloudWatch: 5…?
    Rich set of data analysis functions
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  13. What did we gain?
    CloudWatch - retains data for 2 weeks
    … or until shortly after resources are terminated
    … so we would need to archive data ourselves
    Capability for historical analysis
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  14. What did we gain?
    CloudWatch
    ● 1 min granularity
    ● ~2 min latency
    (CloudWatch::DynamoDB - 5 min granularity on CCU)
    Our MTR-React is shorter
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  15. Happiness! (Mostly)
    Peter Mounce @petemounce / @justeat_tech

    View Slide

  16. We’re recruiting!
    http://tech.just-eat.com/jobs
    Peter Mounce @petemounce / @justeat_tech

    View Slide