Microservices Monitoring at mercari

Slide 1

Slide 1 text

Microservices Monitoring at mercari Monitoring Seminar in mercari, Nov 29, 2017

Slide 2

Slide 2 text

@spesnova SRE at mercari

Slide 3

Slide 3 text

How to monitor Microservices?

Slide 4

Slide 4 text

but ﬁrst,

Slide 5

Slide 5 text

Why Microservices?

Slide 6

Slide 6 text

We shouldn't forget the purpose, anytime

Slide 7

Slide 7 text

mercari is facing a “scalability” problem

Slide 8

Slide 8 text

100+ engineers

Slide 9

Slide 9 text

Developers have to coordinate a lot of things

Slide 10

Slide 10 text

Code dependency Other dev teams Deploy schedule QAs SREs …etc

Slide 11

Slide 11 text

coordination is important, but…

Slide 12

Slide 12 text

I can’t say this is “fast as possible”

Slide 13

Slide 13 text

How to go as “fast as possible”?

Slide 14

Slide 14 text

loosely coupled & bounded context

Slide 15

Slide 15 text

= Microservices

Slide 16

Slide 16 text

Key concepts

Slide 17

Slide 17 text

System and Organization redesign Self service Standardization Automation

Slide 18

Slide 18 text

Key technology

Slide 19

Slide 19 text

Kubernetes

Slide 20

Slide 20 text

“fast as possible” in monitoring area

Slide 21

Slide 21 text

monitoring area

Slide 22

Slide 22 text

1. Collecting 2. Alerting 3. Investigating

Slide 23

Slide 23 text

Make these things as fast as possible

Slide 24

Slide 24 text

with: Datadog GCP StackDriver PagerDuty Sentry NewRelic

Slide 25

Slide 25 text

1. Collecting

Slide 26

Slide 26 text

monolith vs microservice

Slide 27

Slide 27 text

In monolith world, Dev asks Ops to conﬁgure to collect metrics

Slide 28

Slide 28 text

This doesn’t scale in microservices world

Slide 29

Slide 29 text

In microservices world, Dev conﬁgures agent to collect metrics themselves

Slide 30

Slide 30 text

Dev puts monitoring conﬁgurations in pod manifest, instead of agent directly

Slide 31

Slide 31 text

Datadog discovers monitoring conﬁgurations in Kubernetes manifest (annotations) annotations: service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]' service-discovery.datadoghq.com/apache.init_configs: '[{},{}]' service-discovery.datadoghq.com/apache.instances: '[{"apache_status_url": "http://%%host%%/server-status?auto"},{"name": "My service", "url": "http://%%host% %", timeout: 1}]'

Slide 32

Slide 32 text

Datadog discovers monitoring conﬁgurations in Kubernetes manifest (annotations)

Slide 33

Slide 33 text

Dev don’t need to coordinate with SRE

Slide 34

Slide 34 text

Furthermore, SRE runs monitoring agent to every node

Slide 35

Slide 35 text

Basic metrics such as CPU, Memory are collected by Datadog automatically

Slide 36

Slide 36 text

Dev don’t need to collect basic metrics themselves

Slide 37

Slide 37 text

2. Alerting

Slide 38

Slide 38 text

In monolith world, Ops is On-Call

Slide 39

Slide 39 text

In microservices world, Dev is also On-Call

Slide 40

Slide 40 text

Alert accuracy is important

Slide 41

Slide 41 text

Alert on work metrics

Slide 42

Slide 42 text

Work metrics & Resource metrics & Events (+logs) IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOBMFSUJOHDIBSUQOH

Slide 43

Slide 43 text

NG - Alert on CPU usage OK - Alert on server latency Alert on work metrics

Slide 44

Slide 44 text

Alert on work metrics

Slide 45

Slide 45 text

You can say high latency is problem. But you can’t say high CPU is problem.

Slide 46

Slide 46 text

If you know high CPU usage is problem, keep it low by using auto-scaling.

Slide 47

Slide 47 text

PagerDuty service / team per microservice

Slide 48

Slide 48 text

Boilerplate for microservice

Slide 49

Slide 49 text

3. Investigating

Slide 50

Slide 50 text

In monolith world, Ops sees dashboard and investigate

Slide 51

Slide 51 text

In microservices world, Dev sees dashboard and investigate

Slide 52

Slide 52 text

At least 1 dashboard per microservice

Slide 53

Slide 53 text

Dev needs to ﬁx problems themselves, SRE has to give enough visibility to them

Slide 54

Slide 54 text

Dev can see almost everything: Logs IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHBOOPVODJOHMPHTBOOPUBUFEMPHEFNPWQOH

Slide 55

Slide 55 text

Dev can see almost everything: Events

Slide 56

Slide 56 text

Dev can see almost everything: Errors IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHEBUBEPHTFOUSZJOUFHSBUJPODPMMBCPSBUJWFCVHpYJOHTFOUSZFWFOUTDSFFOCPBSEQOH

Slide 57

Slide 57 text

Dev can see almost everything: Tracing and Proﬁling

Slide 58

Slide 58 text

Dev can see almost everything: Slow query

Slide 59

Slide 59 text

Multidimensional metrics IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTLWUBHQOH

Slide 60

Slide 60 text

Dev and SRE can see metrics in any dimension IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTPMBQQOH

Slide 61

Slide 61 text

Dev can see metrics only in their context lBWHEPDLFSDQVVTBHF\LVCF@OBNFTQBDFGPP^z

Slide 62

Slide 62 text

SRE can see metrics across dev teams lBWHEPDLFSDQVVTBHFCZ\LVCF@OBNFTQBDF^z

Slide 63

Slide 63 text

Include everything in one dashboard

Slide 64

Slide 64 text

Frontend (CDN, Synthetic, Browser) Backend (Trace, Proﬁle, Error) Infrastructure (LB, Server, DB…) Events (Deploy, Auto-Scale, SaaS/Iaas) Logs (Frontend ~ Infrastructure) Business metrics (KGI, KPI)

Slide 65

Slide 65 text

Give dev not only visibility, but also a compass in monitoring area

Slide 66

Slide 66 text

Monitoring Framework IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOJOWFTUJHBUJOHEJBHSBNQOH

Slide 67

Slide 67 text

Future Plans

Slide 68

Slide 68 text

SLO + Error Budget Failure Friday (On-Call training) Monitoring Guide (documentation) Processes monitoring (kubelet etc) Topology Map End-to-End error / log tracking Internal status page