Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Microservices Monitoring at mercari Monitoring Seminar in mercari, Nov 29, 2017
Slide 2
Slide 2 text
@spesnova SRE at mercari
Slide 3
Slide 3 text
How to monitor Microservices?
Slide 4
Slide 4 text
but first,
Slide 5
Slide 5 text
Why Microservices?
Slide 6
Slide 6 text
We shouldn't forget the purpose, anytime
Slide 7
Slide 7 text
mercari is facing a “scalability” problem
Slide 8
Slide 8 text
100+ engineers
Slide 9
Slide 9 text
Developers have to coordinate a lot of things
Slide 10
Slide 10 text
Code dependency Other dev teams Deploy schedule QAs SREs …etc
Slide 11
Slide 11 text
coordination is important, but…
Slide 12
Slide 12 text
I can’t say this is “fast as possible”
Slide 13
Slide 13 text
How to go as “fast as possible”?
Slide 14
Slide 14 text
loosely coupled & bounded context
Slide 15
Slide 15 text
= Microservices
Slide 16
Slide 16 text
Key concepts
Slide 17
Slide 17 text
System and Organization redesign Self service Standardization Automation
Slide 18
Slide 18 text
Key technology
Slide 19
Slide 19 text
Kubernetes
Slide 20
Slide 20 text
“fast as possible” in monitoring area
Slide 21
Slide 21 text
monitoring area
Slide 22
Slide 22 text
1. Collecting 2. Alerting 3. Investigating
Slide 23
Slide 23 text
Make these things as fast as possible
Slide 24
Slide 24 text
with: Datadog GCP StackDriver PagerDuty Sentry NewRelic
Slide 25
Slide 25 text
1. Collecting
Slide 26
Slide 26 text
monolith vs microservice
Slide 27
Slide 27 text
In monolith world, Dev asks Ops to configure to collect metrics
Slide 28
Slide 28 text
This doesn’t scale in microservices world
Slide 29
Slide 29 text
In microservices world, Dev configures agent to collect metrics themselves
Slide 30
Slide 30 text
Dev puts monitoring configurations in pod manifest, instead of agent directly
Slide 31
Slide 31 text
Datadog discovers monitoring configurations in Kubernetes manifest (annotations) annotations: service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]' service-discovery.datadoghq.com/apache.init_configs: '[{},{}]' service-discovery.datadoghq.com/apache.instances: '[{"apache_status_url": "http://%%host%%/server-status?auto"},{"name": "My service", "url": "http://%%host% %", timeout: 1}]'
Slide 32
Slide 32 text
Datadog discovers monitoring configurations in Kubernetes manifest (annotations)
Slide 33
Slide 33 text
Dev don’t need to coordinate with SRE
Slide 34
Slide 34 text
Furthermore, SRE runs monitoring agent to every node
Slide 35
Slide 35 text
Basic metrics such as CPU, Memory are collected by Datadog automatically
Slide 36
Slide 36 text
Dev don’t need to collect basic metrics themselves
Slide 37
Slide 37 text
2. Alerting
Slide 38
Slide 38 text
In monolith world, Ops is On-Call
Slide 39
Slide 39 text
In microservices world, Dev is also On-Call
Slide 40
Slide 40 text
Alert accuracy is important
Slide 41
Slide 41 text
Alert on work metrics
Slide 42
Slide 42 text
Work metrics & Resource metrics & Events (+logs) IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOBMFSUJOHDIBSUQOH
Slide 43
Slide 43 text
NG - Alert on CPU usage OK - Alert on server latency Alert on work metrics
Slide 44
Slide 44 text
Alert on work metrics
Slide 45
Slide 45 text
You can say high latency is problem. But you can’t say high CPU is problem.
Slide 46
Slide 46 text
If you know high CPU usage is problem, keep it low by using auto-scaling.
Slide 47
Slide 47 text
PagerDuty service / team per microservice
Slide 48
Slide 48 text
Boilerplate for microservice
Slide 49
Slide 49 text
3. Investigating
Slide 50
Slide 50 text
In monolith world, Ops sees dashboard and investigate
Slide 51
Slide 51 text
In microservices world, Dev sees dashboard and investigate
Slide 52
Slide 52 text
At least 1 dashboard per microservice
Slide 53
Slide 53 text
Dev needs to fix problems themselves, SRE has to give enough visibility to them
Slide 54
Slide 54 text
Dev can see almost everything: Logs IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHBOOPVODJOHMPHTBOOPUBUFEMPHEFNPWQOH
Slide 55
Slide 55 text
Dev can see almost everything: Events
Slide 56
Slide 56 text
Dev can see almost everything: Errors IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHEBUBEPHTFOUSZJOUFHSBUJPODPMMBCPSBUJWFCVHpYJOHTFOUSZFWFOUTDSFFOCPBSEQOH
Slide 57
Slide 57 text
Dev can see almost everything: Tracing and Profiling
Slide 58
Slide 58 text
Dev can see almost everything: Slow query
Slide 59
Slide 59 text
Multidimensional metrics IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTLWUBHQOH
Slide 60
Slide 60 text
Dev and SRE can see metrics in any dimension IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTPMBQQOH
Slide 61
Slide 61 text
Dev can see metrics only in their context lBWHEPDLFSDQVVTBHF\LVCF@OBNFTQBDFGPP^z
Slide 62
Slide 62 text
SRE can see metrics across dev teams lBWHEPDLFSDQVVTBHFCZ\LVCF@OBNFTQBDF^z
Slide 63
Slide 63 text
Include everything in one dashboard
Slide 64
Slide 64 text
Frontend (CDN, Synthetic, Browser) Backend (Trace, Profile, Error) Infrastructure (LB, Server, DB…) Events (Deploy, Auto-Scale, SaaS/Iaas) Logs (Frontend ~ Infrastructure) Business metrics (KGI, KPI)
Slide 65
Slide 65 text
Give dev not only visibility, but also a compass in monitoring area
Slide 66
Slide 66 text
Monitoring Framework IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOJOWFTUJHBUJOHEJBHSBNQOH
Slide 67
Slide 67 text
Future Plans
Slide 68
Slide 68 text
SLO + Error Budget Failure Friday (On-Call training) Monitoring Guide (documentation) Processes monitoring (kubelet etc) Topology Map End-to-End error / log tracking Internal status page
Slide 69
Slide 69 text
Recap
Slide 70
Slide 70 text
loosely coupled & bounded context
Slide 71
Slide 71 text
These principles are also important in monitoring area
Slide 72
Slide 72 text
monitoring framework instead of dependent skills
Slide 73
Slide 73 text
make everyone can monitoring
Slide 74
Slide 74 text
Thanks