Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Microservices Monitoring at mercari

Seigo Uchida
November 30, 2017

Microservices Monitoring at mercari

A talk about how is mercari adopting microservices and trying to monitor it.

Monitoring Seminar in mercari
https://mackerelio.connpass.com/event/71256/

Seigo Uchida

November 30, 2017
Tweet

More Decks by Seigo Uchida

Other Decks in Technology

Transcript

  1. Microservices Monitoring
    at mercari
    Monitoring Seminar in mercari, Nov 29, 2017

    View Slide

  2. @spesnova
    SRE at mercari

    View Slide

  3. How to monitor Microservices?

    View Slide

  4. but first,

    View Slide

  5. Why Microservices?

    View Slide

  6. We shouldn't forget the purpose, anytime

    View Slide

  7. mercari is facing a “scalability” problem

    View Slide

  8. 100+ engineers

    View Slide

  9. Developers have to coordinate a lot of things

    View Slide

  10. Code dependency
    Other dev teams
    Deploy schedule
    QAs
    SREs
    …etc

    View Slide

  11. coordination is important, but…

    View Slide

  12. I can’t say this is “fast as possible”

    View Slide

  13. How to go as
    “fast as possible”?

    View Slide

  14. loosely coupled & bounded context

    View Slide

  15. = Microservices

    View Slide

  16. Key concepts

    View Slide

  17. System and Organization redesign
    Self service
    Standardization
    Automation

    View Slide

  18. Key technology

    View Slide

  19. Kubernetes

    View Slide

  20. “fast as possible”
    in monitoring area

    View Slide

  21. monitoring area

    View Slide

  22. 1. Collecting
    2. Alerting
    3. Investigating

    View Slide

  23. Make these things as fast as possible

    View Slide

  24. with:
    Datadog
    GCP StackDriver
    PagerDuty
    Sentry
    NewRelic

    View Slide

  25. 1. Collecting

    View Slide

  26. monolith vs microservice

    View Slide

  27. In monolith world,
    Dev asks Ops to configure to collect metrics

    View Slide

  28. This doesn’t scale in microservices world

    View Slide

  29. In microservices world,
    Dev configures agent to collect metrics themselves

    View Slide

  30. Dev puts monitoring configurations
    in pod manifest, instead of agent directly

    View Slide

  31. Datadog discovers monitoring configurations
    in Kubernetes manifest (annotations)
    annotations:
    service-discovery.datadoghq.com/apache.check_names: '["apache","http_check"]'
    service-discovery.datadoghq.com/apache.init_configs: '[{},{}]'
    service-discovery.datadoghq.com/apache.instances: '[{"apache_status_url":
    "http://%%host%%/server-status?auto"},{"name": "My service", "url": "http://%%host%
    %", timeout: 1}]'

    View Slide

  32. Datadog discovers monitoring configurations
    in Kubernetes manifest (annotations)

    View Slide

  33. Dev don’t need to coordinate with SRE

    View Slide

  34. Furthermore,
    SRE runs monitoring agent to every node

    View Slide

  35. Basic metrics such as CPU, Memory are
    collected by Datadog automatically

    View Slide

  36. Dev don’t need to collect basic metrics themselves

    View Slide

  37. 2. Alerting

    View Slide

  38. In monolith world, Ops is On-Call

    View Slide

  39. In microservices world, Dev is also On-Call

    View Slide

  40. Alert accuracy is important

    View Slide

  41. Alert on work metrics

    View Slide

  42. Work metrics & Resource metrics & Events (+logs)
    IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOBMFSUJOHDIBSUQOH

    View Slide

  43. NG - Alert on CPU usage
    OK - Alert on server latency
    Alert on work metrics

    View Slide

  44. Alert on work metrics

    View Slide

  45. You can say high latency is problem.
    But you can’t say high CPU is problem.

    View Slide

  46. If you know high CPU usage is problem,
    keep it low by using auto-scaling.

    View Slide

  47. PagerDuty service / team per microservice

    View Slide

  48. Boilerplate for microservice

    View Slide

  49. 3. Investigating

    View Slide

  50. In monolith world,
    Ops sees dashboard and investigate

    View Slide

  51. In microservices world,
    Dev sees dashboard and investigate

    View Slide

  52. At least 1 dashboard per microservice

    View Slide

  53. Dev needs to fix problems themselves,
    SRE has to give enough visibility to them

    View Slide

  54. Dev can see almost everything: Logs
    IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHBOOPVODJOHMPHTBOOPUBUFEMPHEFNPWQOH

    View Slide

  55. Dev can see almost everything: Events

    View Slide

  56. Dev can see almost everything: Errors
    IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHEBUBEPHTFOUSZJOUFHSBUJPODPMMBCPSBUJWFCVHpYJOHTFOUSZFWFOUTDSFFOCPBSEQOH

    View Slide

  57. Dev can see almost everything: Tracing and Profiling

    View Slide

  58. Dev can see almost everything: Slow query

    View Slide

  59. Multidimensional metrics
    IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTLWUBHQOH

    View Slide

  60. Dev and SRE can see metrics in any dimension
    IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHUIFQPXFSPGUBHHFENFUSJDTPMBQQOH

    View Slide

  61. Dev can see metrics only in their context
    lBWHEPDLFSDQVVTBHF\LVCF@OBNFTQBDFGPP^z

    View Slide

  62. SRE can see metrics across dev teams
    lBWHEPDLFSDQVVTBHFCZ\LVCF@OBNFTQBDF^z

    View Slide

  63. Include everything in one dashboard

    View Slide

  64. Frontend (CDN, Synthetic, Browser)
    Backend (Trace, Profile, Error)
    Infrastructure (LB, Server, DB…)
    Events (Deploy, Auto-Scale, SaaS/Iaas)
    Logs (Frontend ~ Infrastructure)
    Business metrics (KGI, KPI)

    View Slide

  65. Give dev not only visibility,
    but also a compass in monitoring area

    View Slide

  66. Monitoring Framework
    IUUQTEBUBEPHQSPEJNHJYOFUJNHCMPHNPOJUPSJOHJOWFTUJHBUJPOJOWFTUJHBUJOHEJBHSBNQOH

    View Slide

  67. Future Plans

    View Slide

  68. SLO + Error Budget
    Failure Friday (On-Call training)
    Monitoring Guide (documentation)
    Processes monitoring (kubelet etc)
    Topology Map
    End-to-End error / log tracking
    Internal status page

    View Slide

  69. Recap

    View Slide

  70. loosely coupled & bounded context

    View Slide

  71. These principles are also important in monitoring area

    View Slide

  72. monitoring framework
    instead of dependent skills

    View Slide

  73. make everyone can monitoring

    View Slide

  74. Thanks

    View Slide