Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring and Visualizing Your (Micro)services

Monitoring and Visualizing Your (Micro)services

JCConf Taiwan 2019

Shin Tanimoto

October 04, 2019
Tweet

More Decks by Shin Tanimoto

Other Decks in Technology

Transcript

  1. WHO AM I? ➤ Shin Tanimoto (Twitter: @cero_t) ➤ Senior

    Solution Architect / Troubleshooter ➤ Everforth Co.,LTD. ➤ Acroquest Technology Co.,LTD. ➤ Leader of Japan Java User Group (JJUG) ➤ Java Champion ➤ Oracle Groundbreaker Ambassador ➤ Fighting Games / BABYMETAL
  2. MONITORING / VISUALIZING ➤ When it comes to monitoring,
 what

    do you think of? ➤ Tools? ➤ Charts? ➤ Alerts?
  3. USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring

    ➤ Alerting ➤ Troubleshooting ➤ Discovering
  4. USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring

    → What’s going on “now” ➤ Alerting → Is okay “now" ➤ Troubleshooting → What happened in the “past” ➤ Discovering → To know the unknown
  5. A METAPHOR ➤ The speed-meter of cars is a tachometer


    (or a digital meter with single number)
 ➤ It’s enough to get the current status, isn’t it?
  6. A METAPHOR ➤ Japanese cars were equipped with a warning

    sound when the speed exceeded 100km/h ➤ This is alerting ➤ When the traffic accident, logs like drive recorders are necessary ➤ That is troubleshooting ➤ Analyzing driver’s steering and pedal operation my prevent the potential traffic accident ➤ That is discovering
  7. MONITORING TARGET ➤ E-Commerce Microservices
 built with Spring Boot. UI


    (vue.js / Nginx) store-web
 (Spring Boot) item-service
 (Spring Boot) stock-service
 (Spring Boot) cart-service
 (Spring Boot) order-service
 (Spring Boot) payment-service
 (Spring Boot)
  8. WHAT SHOULD BE COLLECTED ➤ Metrics ➤ Time-series numerical data

    ➤ In most cases, sampling-based ➤ Logs ➤ Text messages with timestamp ➤ In most cases, event-driven
  9. COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /

    Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count
  10. COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /

    Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Capacity Performance Capacity Performance Capacity
  11. COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /

    Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Monitoring / Alerting Troubleshooting Monitoring / Alerting Troubleshooting Monitoring
  12. COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP

    response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time
  13. COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP

    response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time Troubleshooting Troubleshooting Monitoring / Alerting
  14. COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access

    count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services
  15. COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access

    count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting
  16. COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access

    count of front-end servers ➤ Both too much, and too less are problems ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting
  17. COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of

    front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting
  18. COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of

    front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting Alert ERRORs?
  19. MONITORING ENVIRONMENT / ARCHITECTURE Server / Container Spring Boot Application

    Agent Data Store Visualizer / 
 Alerting Metrics
 Logs Read
 Summary Server / Container Spring Boot Application Agent Metrics
 Logs .
 .
 .
  20. MONITORING ENVIRONMENT / AGENTS Server / Container Spring Boot Application

    In-app Agent In-server Agent Log Shipper
 Agent Logs Data Store JVM
 Metrics Server
 Metrics Logs
  21. MONITORING ENVIRONMENT / ELASTICSEARCH ➤ Elasticsearch + Kibana ➤ Elasticsearch

    ➤ Open source full-text search engine ➤ Useful for logs and metrics data store ➤ Can store, search and aggregate logs quickly ➤ Kibana ➤ Open source visualizer for Elasticsearch ➤ View / create charts, maps, tables, etc. in web browser
  22. MONITORING ENVIRONMENT / SPOG ➤ Single Pane of Glass ➤

    See all the information in one tool ➤ Monitoring environment should be like a baseball scoreboard. ➤ Viewed by all members ➤ To see the current situation at a glance ➤ Don't break the combination between Monitoring, Alerting, Troubleshooting and Discovering
  23. MONITORING ENVIRONMENT / AGENTS ➤ Agents (Data collectors) ➤ Metricbeat

    ➤ Server / Container resources ➤ Filebeat or Fluentd ➤ Log shipper ➤ Elastic APM / Elastic APM Java Agent ➤ APM (Application Performance Monitoring) ➤ JVM resources
  24. MONITORING ENVIRONMENT / AGENTS Server Spring Boot Application Elastic APM

    Java Agent Metricbeat Filebeat Logs Elasticsearch JVM
 Metrics Server
 Metrics Logs Kibana Read
 Summary
  25. MONITORING ENVIRONMENT / AGENTS Container Spring Boot Application Elastic APM

    Java Agent Metricbeat Fluentd Logs Elasticsearch JVM
 Metrics Container
 Metrics Logs (fluentd Log-driver) Kibana Read
 Summary
  26. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Metricbeat ➤ Server monitoring

    ➤ Just run metricbeat (out-of-the-box, no configuration) ➤ Container (Docker) monitoring ➤ Enable metricbeat/module.d/docker.yml ➤ Run metricbeat on docker ➤ Metricbeat (auto)discovers containers and retrieves container metrics
  27. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent

    (1/2) ➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>co.elastic.apm</groupId> <artifactId>apm-agent-attach</artifactId>
 <version>1.9.0</version> </dependency>
  28. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent

    (2/2) ➤ Add elasticapm.propereties ➤ Add “ElasticApmAttacher.attach();” to the main method service_name=store-web public static void main(String[] args) { ElasticApmAttacher.attach(); SpringApplication.run(StoreApplication.class, args); }
  29. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Fluentd ➤ Run (micro)services

    with fluentd log-driver ➤ Run fluentd on Docker store-web:
 # snip
 logging: driver: "fluentd" options: tag: "docker.services"
  30. TIPS #1 / JSON FORMAT LOGS ➤ Logs should be

    parsed before storing to Elasticsearch ➤ Log is like ➤ Grok pattern is like ➤ Multi line logs are hell ➤ Like aggregation of stream events 2016-02-26 11:15:47.561 INFO [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1] i.s.c.sleuth.docs.service1.Application : Hello from service1. Calling service2 %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[% {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid} \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}
  31. TIPS #1 / JSON FORMAT LOGS ➤ Format logs as

    JSON at the log out put time! ➤ Application Log ➤ logstash-logback-encoder <appender name="json-console" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder"> <providers> <timestamp> <timeZone>UTC</timeZone> </timestamp> <pattern> <pattern> { "severity": "%level", "service": "${springAppName:-}", "type": "application",
  32. TIPS #1 / JSON FORMAT LOGS ➤ Format logs as

    JSON at the log out put time! ➤ Access Log ➤ Write configuration by hand server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM- dd'T'HH:mm:ss.SSSZ}t","service":"$ {spring.application.name}","type":"access","method":"%m","rem ote":"%a","path":"%U","query":"%q","duration":"%D","status":" %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"% {Referer}i","session-id":"%S"}
  33. TIPS #2 / DISTRIBUTED TRACING ➤ In monolithic Java applications,

    application behaviors can be traced from application logs using “thread-id”. ➤ Logs of microservices are micro-partitioned ➤ No correlation ids for a single request! ➤ Distributed tracing help us trace the behaviors for a request ➤ Trace-ID is like thread-id or request-id in the microservices ➤ It is passed via HTTP header or AMQP header etc.
  34. TIPS #2 / DISTRIBUTED TRACING ➤ Tools ➤ Spring Cloud

    Sleuth ➤ Depends on Spring DI, AOP ➤ Available for only Spring Boot applications ➤ Elastic APM ➤ Using bytecode instrumentation by ByteBuddy ➤ Available for all Java applications
  35. TIPS #2 / DISTRIBUTED TRACING ➤ Spring Cloud Sleuth vs

    Elastic APM ➤ Spring Cloud Sleuth ➤ Supports most of Spring modules ➤ RestTemplate, WebFlux, Netty … ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka … ➤ Elastic APM ➤ Supports limited libraries ➤ Servlet, JAX-RS, Apache HttpClient, JDBC … ➤ Spring MVC, RestTemplate … ➤ Does NOT support WebFlux, RabbitMQ, Kafka …
  36. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth configuration

    ➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency>
  37. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth configuration

    ➤ Modify logback-spring.xml if you customized the log output format <pattern> <pattern> { "severity": "%level", "service": "${springAppName:-}", "type": "application", "trace": "%X{X-B3-TraceId:-}", "span": "%X{X-B3-SpanId:-}", "parent": "%X{X-B3-ParentSpanId:-}", "exportable": "%X{X-Span-Export:-}",
  38. From 17:00 at Room 401 How to Properly Blame Things

    for Causing Latency:
 An Introduction to Distributed Tracing and Zipkin By Adrian Cole
  39. PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤

    Threshold base alerting only detects predictable problems ➤ If we only collect CPU usages and memory usages, we cannot notice disk full. ➤ If we only collect disk volume usage, we cannot notice the partition out of space. ➤ If we only collect disk size, we cannot notice the inode insufficiency. ➤ Not all problems can be predicted from the beginning. ➤ If an unknown problem occurs, it should be fed back to the monitoring for the problem will occur in the near future
  40. PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤

    To discover the unknown problem … ➤ Filter out all “known” logs on Kibana ➤ Only unknown logs are displayed ➤ Unknown logs may indicate unknown problems ➤ Leading more tasks, and system more stable
  41. PITFALL #2 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤

    (Micro)service seems healthy but down actually ➤ Returns 200 OK to the health check request ➤ Metrics are normal ➤ But returns no response at all ➤ Last update time of log of each services should be monitored ➤ Living application produces some log ➤ If there is no log, it may be dead
  42. MONITORING AND ALERTING / METRICS ➤ Monitoring KPIs (1)The mean

    time from when a problem occurs
 to when it is detected (2)# of problems detected by alerting / Total # of problems (3)# of monitoring improvements / Total # of problems (4) (2) +(3) / Total # of problems ➤ We can clarify if the monitoring environment became better or not ➤ Monitoring improvement leads to more stable system operation
  43. ENJOY MONITORING YOUR SYSTEM! ➤ Demo application and monitoring environment

    ➤ https://github.com/cero-t/ spring-store-2019/ ➤ My Twitter (@cero_t) ➤ https://twitter.com/cero_t