$30 off During Our Annual Pro Sale. View Details »

Modern Monitoring for .NET

Modern Monitoring for .NET

​In the world of .NET, monitoring your application has traditionally been to install WMI performance counters and let Operations sort it out. This means learning about how to install performance counters, how to do this via scripts in the cloud, how to get the data out of them, how to alert on them, and how to visualise them in something other than PerfMon. Not to mention how to debug why they break down under load.

Anyone who has done this (that hasn't given up in frustration!) knows that it's no simple task.

Chris and Pete will introduce the Open Source tools used at JUST EAT (and various others) in their high-volume, cloud-native, buzzword compliant microservice-based .NET platform. This talk will include a whirlwind tour of our tooling: statsd, graphite, grafana, logstash, elasticsearch, kibana as well as some culture changes that using these tools unlocked.

Peter Mounce

July 03, 2015
Tweet

More Decks by Peter Mounce

Other Decks in Technology

Transcript

  1. Knowing what went bump
    in Production
    - Modern monitoring in .NET
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  2. In the cloud no-one can
    hear you YSOD
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  3. (Custom) performance counters
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  4. Logs were everywhere and nowhere
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  5. Devs had no insight into production
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  6. @petemounce @chrisannodell & @justeat_tech
    Hackathon all the things!

    View Slide

  7. Hacked it together in a morning
    v1: .NET app -> UDP -> StatsD -> Graphite -> Pretty chart
    v2: … -> Automated alert via seyren
    v2: … + grafana for much prettier charts & dashboards
    (Monitoring, that is)
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  8. Spent next 4 months productionising
    (… but you wouldn’t have to)
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  9. ELK to the rescue
    @petemounce @chrisannodell & @justeat_tech
    https://registry.hub.docker.com/u/blacktop/elk/
    … and nxlog community edition for log-shipping

    View Slide

  10. Hacked ELK together in a week
    v1: nxlog CE -> logstash -> ElasticSearch -> Kibana
    v2: as above, but stable :-)
    Handling ~ 220Gb / day
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  11. Hacked ELK together in a week
    v1: nxlog CE -> logstash -> ElasticSearch -> Kibana
    v2: as above, but stable :-)
    Handling ~ 220Gb / day
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  12. 3 years later
    … we know we’ve got issues before customers do (usually)
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  13. Apps: “I’m healthy! I’m healthy!”
    public HealthCheckResult Execute() {
    var result = new HealthCheckResult(Name);
    try {
    var customThing = Run(result);
    EnrichResultWith(result, customThing);
    } catch (Exception exception) {
    result = ResultFromException(exception);
    _logger.Error(() => new { Log = "HealthCheck Error", Name, Error = exception.GetBaseException()
    }.ToJson());
    }
    return result;
    }
    … and an alert for when they say “HELP ME!”
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  14. Apps: “I’m healthy! I’m healthy!”
    public class LoadFromDynamoDbHealthCheck : HealthCheckBase {
    private readonly IDataAccess _dataAccess;
    public LoadFromDynamoDbHealthCheck(IDataAccess dataAccess, Logger logger) : base(logger) {
    Name = "LoadFromDynamoDbHealthCheck";
    _dataAccess = dataAccess;
    }
    protected override bool Run(HealthCheckResult result) {
    _dataAccess.Load("-2");
    return true;
    }
    }
    … and an alert for when they say “HELP ME!”
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  15. Apps: “I’m healthy! I’m healthy!”
    Not just for production - helps 1st checkout
    … and an alert for when they say “HELP ME!”
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  16. Moving forwards
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  17. Show us the code!
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  18. Publish a metric
    Counter: uk.payments.attempts:1|c
    Timer: uk.payments.attempts:34|ms
    Gauge: uk.payments.cpu:47|g
    Then, roughly:
    var client = new UdpClient(_hostNameOrAddress, _port)
    { Client = { SendBufferSize = 0 } };
    client.Client.SendPacketsAsync(data);
    If you can write a string to a UDP socket...
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  19. @petemounce @chrisannodell & @justeat_tech
    Being on-call
    “they mostly come at night... mostly”

    View Slide

  20. How to be on call
    @petemounce @chrisannodell & @justeat_tech
    1. Get an alert
    2. Log on & look at alert -> charts -> dashboards -> logs
    3. Establish IMPACT of the problem
    4. Provide options to mitigate
    a. Turn off a feature?
    b. DO NOTHING -> risk of change outweighs reward?
    5. Take action (which might be “escalate higher for help”)
    6. Repeat
    7. Do root cause analysis AFTER the issue has been resolved

    View Slide

  21. Behold!
    @petemounce @chrisannodell & @justeat_tech
    https://www.flickr.com/photos/coop666/5377946715/

    View Slide

  22. Being On-Call
    @petemounce @chrisannodell & @justeat_tech
    https://github.com/scobal/seyren

    View Slide

  23. Being On-Call
    @petemounce @chrisannodell & @justeat_tech
    https://www.pagerduty.com

    View Slide

  24. Being On-Call
    @petemounce @chrisannodell & @justeat_tech
    https://www.pagerduty.com

    View Slide

  25. @petemounce @chrisannodell & @justeat_tech
    So much data...
    http://reinebrand.com/sad-beholder/

    View Slide

  26. Alert Design
    @petemounce @chrisannodell & @justeat_tech
    https://registry.hub.docker.com/u/blacktop/elk/
    Make it easy to find the source

    View Slide

  27. Dashboard Design
    @petemounce @chrisannodell & @justeat_tech
    Highlight the most significant data

    View Slide

  28. Dashboard Design
    @petemounce @chrisannodell & @justeat_tech
    Make it easy to find the source

    View Slide

  29. ELK to the rescue
    @petemounce @chrisannodell & @justeat_tech
    https://www.elastic.co/blog/kibana-whats-cooking

    View Slide

  30. A wild CPU appears
    @petemounce @chrisannodell & @justeat_tech
    https://www.elastic.co/blog/kibana-whats-cooking

    View Slide

  31. Chris used Cloud!

    View Slide

  32. It’s super effective!
    @petemounce @chrisannodell & @justeat_tech
    https://www.elastic.co/blog/kibana-whats-cooking

    View Slide

  33. A job well done!

    View Slide

  34. Questions?
    @petemounce @chrisannodell & @justeat_tech

    View Slide

  35. Yes, we’re recruiting too.
    @petemounce @chrisannodell & @justeat_tech
    http://tech.just-eat.com/jobs

    View Slide