A talk about how is mercari adopting microservices and trying to monitor it.
Monitoring Seminar in mercari https://mackerelio.connpass.com/event/71256/
Microservices Monitoringat mercariMonitoring Seminar in mercari, Nov 29, 2017
View Slide
@spesnovaSRE at mercari
How to monitor Microservices?
but first,
Why Microservices?
We shouldn't forget the purpose, anytime
mercari is facing a “scalability” problem
100+ engineers
Developers have to coordinate a lot of things
Code dependencyOther dev teamsDeploy scheduleQAsSREs…etc
coordination is important, but…
I can’t say this is “fast as possible”
How to go as“fast as possible”?
loosely coupled & bounded context
= Microservices
Key concepts
System and Organization redesignSelf serviceStandardizationAutomation
Key technology
Kubernetes
“fast as possible”in monitoring area
monitoring area
1. Collecting2. Alerting3. Investigating
Make these things as fast as possible
with:DatadogGCP StackDriverPagerDutySentryNewRelic
1. Collecting
monolith vs microservice
In monolith world,Dev asks Ops to configure to collect metrics
This doesn’t scale in microservices world
In microservices world,Dev configures agent to collect metrics themselves
Dev puts monitoring configurationsin pod manifest, instead of agent directly
Datadog discovers monitoring configurationsin Kubernetes manifest (annotations)annotations:service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]'service-discovery.datadoghq.com/apache.init_configs: '[{},{}]'service-discovery.datadoghq.com/apache.instances: '[{"apache_status_url":"http://%%host%%/server-status?auto"},{"name": "My service", "url": "http://%%host%%", timeout: 1}]'
Datadog discovers monitoring configurationsin Kubernetes manifest (annotations)
Dev don’t need to coordinate with SRE
Furthermore,SRE runs monitoring agent to every node
Basic metrics such as CPU, Memory arecollected by Datadog automatically
Dev don’t need to collect basic metrics themselves
2. Alerting
In monolith world, Ops is On-Call
In microservices world, Dev is also On-Call
Alert accuracy is important
Alert on work metrics
Work metrics & Resource metrics & Events (+logs)IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOBMFSUJOHDIBSUQOH
NG - Alert on CPU usageOK - Alert on server latencyAlert on work metrics
You can say high latency is problem.But you can’t say high CPU is problem.
If you know high CPU usage is problem,keep it low by using auto-scaling.
PagerDuty service / team per microservice
Boilerplate for microservice
3. Investigating
In monolith world,Ops sees dashboard and investigate
In microservices world,Dev sees dashboard and investigate
At least 1 dashboard per microservice
Dev needs to fix problems themselves,SRE has to give enough visibility to them
Dev can see almost everything: LogsIUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHBOOPVODJOHMPHTBOOPUBUFEMPHEFNPWQOH
Dev can see almost everything: Events
Dev can see almost everything: ErrorsIUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHEBUBEPHTFOUSZJOUFHSBUJPODPMMBCPSBUJWFCVHpYJOHTFOUSZFWFOUTDSFFOCPBSEQOH
Dev can see almost everything: Tracing and Profiling
Dev can see almost everything: Slow query
Multidimensional metricsIUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTLWUBHQOH
Dev and SRE can see metrics in any dimensionIUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTPMBQQOH
Dev can see metrics only in their contextlBWHEPDLFSDQVVTBHF\LVCF@OBNFTQBDFGPP^z
SRE can see metrics across dev teamslBWHEPDLFSDQVVTBHFCZ\LVCF@OBNFTQBDF^z
Include everything in one dashboard
Frontend (CDN, Synthetic, Browser)Backend (Trace, Profile, Error)Infrastructure (LB, Server, DB…)Events (Deploy, Auto-Scale, SaaS/Iaas)Logs (Frontend ~ Infrastructure)Business metrics (KGI, KPI)
Give dev not only visibility,but also a compass in monitoring area
Monitoring FrameworkIUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOJOWFTUJHBUJOHEJBHSBNQOH
Future Plans
SLO + Error BudgetFailure Friday (On-Call training)Monitoring Guide (documentation)Processes monitoring (kubelet etc)Topology MapEnd-to-End error / log trackingInternal status page
Recap
These principles are also important in monitoring area
monitoring frameworkinstead of dependent skills
make everyone can monitoring
Thanks