Microservices Monitoring at mercari

Microservices Monitoring at mercari Monitoring Seminar in mercari, Nov 29,
2017

@spesnova SRE at mercari

How to monitor Microservices?

but ﬁrst,

Why Microservices?

We shouldn't forget the purpose, anytime

mercari is facing a “scalability” problem

100+ engineers

Developers have to coordinate a lot of things

Code dependency Other dev teams Deploy schedule QAs SREs …etc

coordination is important, but…

I can’t say this is “fast as possible”

How to go as “fast as possible”?

loosely coupled & bounded context

= Microservices

Key concepts

System and Organization redesign Self service Standardization Automation

Key technology

Kubernetes

“fast as possible” in monitoring area

monitoring area

1. Collecting 2. Alerting 3. Investigating

Make these things as fast as possible

with: Datadog GCP StackDriver PagerDuty Sentry NewRelic

1. Collecting

monolith vs microservice

In monolith world, Dev asks Ops to conﬁgure to collect
metrics

This doesn’t scale in microservices world

In microservices world, Dev conﬁgures agent to collect metrics themselves

Dev puts monitoring conﬁgurations in pod manifest, instead of agent
directly

Datadog discovers monitoring conﬁgurations in Kubernetes manifest (annotations) annotations: service-discovery.datadoghq.com/apache.check_names:
'["apache","http_check"]' service-discovery.datadoghq.com/apache.init_configs: '[{},{}]' service-discovery.datadoghq.com/apache.instances: '[{"apache_status_url": "http://%%host%%/server-status?auto"},{"name": "My service", "url": "http://%%host% %", timeout: 1}]'

Datadog discovers monitoring conﬁgurations in Kubernetes manifest (annotations)

Dev don’t need to coordinate with SRE

Furthermore, SRE runs monitoring agent to every node

Basic metrics such as CPU, Memory are collected by Datadog
automatically

Dev don’t need to collect basic metrics themselves

2. Alerting

In monolith world, Ops is On-Call

In microservices world, Dev is also On-Call

Alert accuracy is important

Alert on work metrics

Work metrics & Resource metrics & Events (+logs) IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOBMFSUJOHDIBSUQOH

NG - Alert on CPU usage OK - Alert on
server latency Alert on work metrics

Alert on work metrics

You can say high latency is problem. But you can’t
say high CPU is problem.

If you know high CPU usage is problem, keep it
low by using auto-scaling.

PagerDuty service / team per microservice

Boilerplate for microservice

3. Investigating

In monolith world, Ops sees dashboard and investigate

In microservices world, Dev sees dashboard and investigate

At least 1 dashboard per microservice

Dev needs to ﬁx problems themselves, SRE has to give
enough visibility to them

Dev can see almost everything: Logs IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHBOOPVODJOHMPHTBOOPUBUFEMPHEFNPWQOH

Dev can see almost everything: Events

Dev can see almost everything: Errors IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHEBUBEPHTFOUSZJOUFHSBUJPODPMMBCPSBUJWFCVHpYJOHTFOUSZFWFOUTDSFFOCPBSEQOH

Dev can see almost everything: Tracing and Proﬁling

Dev can see almost everything: Slow query

Multidimensional metrics IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTLWUBHQOH

Dev and SRE can see metrics in any dimension IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTPMBQQOH

Dev can see metrics only in their context lBWHEPDLFSDQVVTBHF\LVCF@OBNFTQBDFGPP^z

SRE can see metrics across dev teams lBWHEPDLFSDQVVTBHFCZ\LVCF@OBNFTQBDF^z

Include everything in one dashboard

Frontend (CDN, Synthetic, Browser) Backend (Trace, Proﬁle, Error) Infrastructure (LB,
Server, DB…) Events (Deploy, Auto-Scale, SaaS/Iaas) Logs (Frontend ~ Infrastructure) Business metrics (KGI, KPI)

Give dev not only visibility, but also a compass in
monitoring area

Monitoring Framework IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOJOWFTUJHBUJOHEJBHSBNQOH

Future Plans

SLO + Error Budget Failure Friday (On-Call training) Monitoring Guide
(documentation) Processes monitoring (kubelet etc) Topology Map End-to-End error / log tracking Internal status page

loosely coupled & bounded context

These principles are also important in monitoring area

monitoring framework instead of dependent skills

make everyone can monitoring

Thanks

Microservices Monitoring at mercari

Microservices Monitoring at mercari

More Decks by Seigo Uchida

Other Decks in Technology

Featured

Transcript