Slide 1

Slide 1 text

Knowing what went bump in Production - Modern monitoring in .NET @petemounce @chrisannodell & @justeat_tech

Slide 2

Slide 2 text

In the cloud no-one can hear you YSOD @petemounce @chrisannodell & @justeat_tech

Slide 3

Slide 3 text

(Custom) performance counters @petemounce @chrisannodell & @justeat_tech

Slide 4

Slide 4 text

Logs were everywhere and nowhere @petemounce @chrisannodell & @justeat_tech

Slide 5

Slide 5 text

Devs had no insight into production @petemounce @chrisannodell & @justeat_tech

Slide 6

Slide 6 text

@petemounce @chrisannodell & @justeat_tech Hackathon all the things!

Slide 7

Slide 7 text

Hacked it together in a morning v1: .NET app -> UDP -> StatsD -> Graphite -> Pretty chart v2: … -> Automated alert via seyren v2: … + grafana for much prettier charts & dashboards (Monitoring, that is) @petemounce @chrisannodell & @justeat_tech

Slide 8

Slide 8 text

Spent next 4 months productionising (… but you wouldn’t have to) @petemounce @chrisannodell & @justeat_tech

Slide 9

Slide 9 text

ELK to the rescue @petemounce @chrisannodell & @justeat_tech https://registry.hub.docker.com/u/blacktop/elk/ … and nxlog community edition for log-shipping

Slide 10

Slide 10 text

Hacked ELK together in a week v1: nxlog CE -> logstash -> ElasticSearch -> Kibana v2: as above, but stable :-) Handling ~ 220Gb / day @petemounce @chrisannodell & @justeat_tech

Slide 11

Slide 11 text

Hacked ELK together in a week v1: nxlog CE -> logstash -> ElasticSearch -> Kibana v2: as above, but stable :-) Handling ~ 220Gb / day @petemounce @chrisannodell & @justeat_tech

Slide 12

Slide 12 text

3 years later … we know we’ve got issues before customers do (usually) @petemounce @chrisannodell & @justeat_tech

Slide 13

Slide 13 text

Apps: “I’m healthy! I’m healthy!” public HealthCheckResult Execute() { var result = new HealthCheckResult(Name); try { var customThing = Run(result); EnrichResultWith(result, customThing); } catch (Exception exception) { result = ResultFromException(exception); _logger.Error(() => new { Log = "HealthCheck Error", Name, Error = exception.GetBaseException() }.ToJson()); } return result; } … and an alert for when they say “HELP ME!” @petemounce @chrisannodell & @justeat_tech

Slide 14

Slide 14 text

Apps: “I’m healthy! I’m healthy!” public class LoadFromDynamoDbHealthCheck : HealthCheckBase { private readonly IDataAccess _dataAccess; public LoadFromDynamoDbHealthCheck(IDataAccess dataAccess, Logger logger) : base(logger) { Name = "LoadFromDynamoDbHealthCheck"; _dataAccess = dataAccess; } protected override bool Run(HealthCheckResult result) { _dataAccess.Load("-2"); return true; } } … and an alert for when they say “HELP ME!” @petemounce @chrisannodell & @justeat_tech

Slide 15

Slide 15 text

Apps: “I’m healthy! I’m healthy!” Not just for production - helps 1st checkout … and an alert for when they say “HELP ME!” @petemounce @chrisannodell & @justeat_tech

Slide 16

Slide 16 text

Moving forwards @petemounce @chrisannodell & @justeat_tech

Slide 17

Slide 17 text

Show us the code! @petemounce @chrisannodell & @justeat_tech

Slide 18

Slide 18 text

Publish a metric Counter: uk.payments.attempts:1|c Timer: uk.payments.attempts:34|ms Gauge: uk.payments.cpu:47|g Then, roughly: var client = new UdpClient(_hostNameOrAddress, _port) { Client = { SendBufferSize = 0 } }; client.Client.SendPacketsAsync(data); If you can write a string to a UDP socket... @petemounce @chrisannodell & @justeat_tech

Slide 19

Slide 19 text

@petemounce @chrisannodell & @justeat_tech Being on-call “they mostly come at night... mostly”

Slide 20

Slide 20 text

How to be on call @petemounce @chrisannodell & @justeat_tech 1. Get an alert 2. Log on & look at alert -> charts -> dashboards -> logs 3. Establish IMPACT of the problem 4. Provide options to mitigate a. Turn off a feature? b. DO NOTHING -> risk of change outweighs reward? 5. Take action (which might be “escalate higher for help”) 6. Repeat 7. Do root cause analysis AFTER the issue has been resolved

Slide 21

Slide 21 text

Behold! @petemounce @chrisannodell & @justeat_tech https://www.flickr.com/photos/coop666/5377946715/

Slide 22

Slide 22 text

Being On-Call @petemounce @chrisannodell & @justeat_tech https://github.com/scobal/seyren

Slide 23

Slide 23 text

Being On-Call @petemounce @chrisannodell & @justeat_tech https://www.pagerduty.com

Slide 24

Slide 24 text

Being On-Call @petemounce @chrisannodell & @justeat_tech https://www.pagerduty.com

Slide 25

Slide 25 text

@petemounce @chrisannodell & @justeat_tech So much data... http://reinebrand.com/sad-beholder/

Slide 26

Slide 26 text

Alert Design @petemounce @chrisannodell & @justeat_tech https://registry.hub.docker.com/u/blacktop/elk/ Make it easy to find the source

Slide 27

Slide 27 text

Dashboard Design @petemounce @chrisannodell & @justeat_tech Highlight the most significant data

Slide 28

Slide 28 text

Dashboard Design @petemounce @chrisannodell & @justeat_tech Make it easy to find the source

Slide 29

Slide 29 text

ELK to the rescue @petemounce @chrisannodell & @justeat_tech https://www.elastic.co/blog/kibana-whats-cooking

Slide 30

Slide 30 text

A wild CPU appears @petemounce @chrisannodell & @justeat_tech https://www.elastic.co/blog/kibana-whats-cooking

Slide 31

Slide 31 text

Chris used Cloud!

Slide 32

Slide 32 text

It’s super effective! @petemounce @chrisannodell & @justeat_tech https://www.elastic.co/blog/kibana-whats-cooking

Slide 33

Slide 33 text

A job well done!

Slide 34

Slide 34 text

Questions? @petemounce @chrisannodell & @justeat_tech

Slide 35

Slide 35 text

Yes, we’re recruiting too. @petemounce @chrisannodell & @justeat_tech http://tech.just-eat.com/jobs