Microservices Monitoring at mercari

Microservices Monitoring at mercari

A talk about how is mercari adopting microservices and trying to monitor it.

Monitoring Seminar in mercari
https://mackerelio.connpass.com/event/71256/

32f2e5ddb187baa2abac66d7e8b283fe?s=128

Seigo Uchida

November 30, 2017
Tweet

Transcript

  1. Microservices Monitoring at mercari Monitoring Seminar in mercari, Nov 29,

    2017
  2. @spesnova SRE at mercari

  3. How to monitor Microservices?

  4. but first,

  5. Why Microservices?

  6. We shouldn't forget the purpose, anytime

  7. mercari is facing a “scalability” problem

  8. 100+ engineers

  9. Developers have to coordinate a lot of things

  10. Code dependency Other dev teams Deploy schedule QAs SREs …etc

  11. coordination is important, but…

  12. I can’t say this is “fast as possible”

  13. How to go as “fast as possible”?

  14. loosely coupled & bounded context

  15. = Microservices

  16. Key concepts

  17. System and Organization redesign Self service Standardization Automation

  18. Key technology

  19. Kubernetes

  20. “fast as possible” in monitoring area

  21. monitoring area

  22. 1. Collecting 2. Alerting 3. Investigating

  23. Make these things as fast as possible

  24. with: Datadog GCP StackDriver PagerDuty Sentry NewRelic

  25. 1. Collecting

  26. monolith vs microservice

  27. In monolith world, Dev asks Ops to configure to collect

    metrics
  28. This doesn’t scale in microservices world

  29. In microservices world, Dev configures agent to collect metrics themselves

  30. Dev puts monitoring configurations in pod manifest, instead of agent

    directly
  31. Datadog discovers monitoring configurations in Kubernetes manifest (annotations) annotations: service-discovery.datadoghq.com/apache.check_names:

    '["apache","http_check"]' service-discovery.datadoghq.com/apache.init_configs: '[{},{}]' service-discovery.datadoghq.com/apache.instances: '[{"apache_status_url": "http://%%host%%/server-status?auto"},{"name": "My service", "url": "http://%%host% %", timeout: 1}]'
  32. Datadog discovers monitoring configurations in Kubernetes manifest (annotations)

  33. Dev don’t need to coordinate with SRE

  34. Furthermore, SRE runs monitoring agent to every node

  35. Basic metrics such as CPU, Memory are collected by Datadog

    automatically
  36. Dev don’t need to collect basic metrics themselves

  37. 2. Alerting

  38. In monolith world, Ops is On-Call

  39. In microservices world, Dev is also On-Call

  40. Alert accuracy is important

  41. Alert on work metrics

  42. Work metrics & Resource metrics & Events (+logs) IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOBMFSUJOHDIBSUQOH

  43. NG - Alert on CPU usage OK - Alert on

    server latency Alert on work metrics
  44. Alert on work metrics

  45. You can say high latency is problem. But you can’t

    say high CPU is problem.
  46. If you know high CPU usage is problem, keep it

    low by using auto-scaling.
  47. PagerDuty service / team per microservice

  48. Boilerplate for microservice

  49. 3. Investigating

  50. In monolith world, Ops sees dashboard and investigate

  51. In microservices world, Dev sees dashboard and investigate

  52. At least 1 dashboard per microservice

  53. Dev needs to fix problems themselves, SRE has to give

    enough visibility to them
  54. Dev can see almost everything: Logs IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHBOOPVODJOHMPHTBOOPUBUFEMPHEFNPWQOH

  55. Dev can see almost everything: Events

  56. Dev can see almost everything: Errors IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHEBUBEPHTFOUSZJOUFHSBUJPODPMMBCPSBUJWFCVHpYJOHTFOUSZFWFOUTDSFFOCPBSEQOH

  57. Dev can see almost everything: Tracing and Profiling

  58. Dev can see almost everything: Slow query

  59. Multidimensional metrics IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTLWUBHQOH

  60. Dev and SRE can see metrics in any dimension IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTPMBQQOH

  61. Dev can see metrics only in their context lBWHEPDLFSDQVVTBHF\LVCF@OBNFTQBDFGPP^z

  62. SRE can see metrics across dev teams lBWHEPDLFSDQVVTBHFCZ\LVCF@OBNFTQBDF^z

  63. Include everything in one dashboard

  64. Frontend (CDN, Synthetic, Browser) Backend (Trace, Profile, Error) Infrastructure (LB,

    Server, DB…) Events (Deploy, Auto-Scale, SaaS/Iaas) Logs (Frontend ~ Infrastructure) Business metrics (KGI, KPI)
  65. Give dev not only visibility, but also a compass in

    monitoring area
  66. Monitoring Framework IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOJOWFTUJHBUJOHEJBHSBNQOH

  67. Future Plans

  68. SLO + Error Budget Failure Friday (On-Call training) Monitoring Guide

    (documentation) Processes monitoring (kubelet etc) Topology Map End-to-End error / log tracking Internal status page
  69. Recap

  70. loosely coupled & bounded context

  71. These principles are also important in monitoring area

  72. monitoring framework instead of dependent skills

  73. make everyone can monitoring

  74. Thanks