Pro Yearly is on sale from $80 to $50! »

Monitoring and Visualizing Your (Micro)services

Monitoring and Visualizing Your (Micro)services

JCConf Taiwan 2019

Ad3fbc316f916a341c24035e34913121?s=128

Shin Tanimoto

October 04, 2019
Tweet

Transcript

  1. MONITORING AND VISUALIZING YOUR (MICRO)SERVICES Shin TANIMOTO (@cero_t)
 Everforth /

    Acroquest Technology
  2. DEMO SOURCE CODE http://bit.ly/jc2019monitoring

  3. WHO AM I? ➤ Shin Tanimoto (Twitter: @cero_t) ➤ Senior

    Solution Architect / Troubleshooter ➤ Everforth Co.,LTD. ➤ Acroquest Technology Co.,LTD. ➤ Leader of Japan Java User Group (JJUG) ➤ Java Champion ➤ Oracle Groundbreaker Ambassador ➤ Fighting Games / BABYMETAL
  4. MONITORING / VISUALIZING ➤ When it comes to monitoring,
 what

    do you think of? ➤ Tools? ➤ Charts? ➤ Alerts?
  5. USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring

    ➤ Alerting ➤ Troubleshooting ➤ Discovering
  6. USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring

    → What’s going on “now” ➤ Alerting → Is okay “now" ➤ Troubleshooting → What happened in the “past” ➤ Discovering → To know the unknown
  7. A METAPHOR ➤ The speed-meter of cars is a tachometer


    (or a digital meter with single number)
 ➤ It’s enough to get the current status, isn’t it?
  8. A METAPHOR ➤ Japanese cars were equipped with a warning

    sound when the speed exceeded 100km/h ➤ This is alerting ➤ When the traffic accident, logs like drive recorders are necessary ➤ That is troubleshooting ➤ Analyzing driver’s steering and pedal operation my prevent the potential traffic accident ➤ That is discovering
  9. AGENDA 1. Collecting Data 2. Building monitoring environment 3. Tip

    and pitfalls 4. Improving monitoring
  10. COLLECTING DATA

  11. MONITORING TARGET ➤ E-Commerce Microservices
 built with Spring Boot. UI


    (vue.js / Nginx) store-web
 (Spring Boot) item-service
 (Spring Boot) stock-service
 (Spring Boot) cart-service
 (Spring Boot) order-service
 (Spring Boot) payment-service
 (Spring Boot)
  12. DEMO

  13. WHAT SHOULD BE COLLECTED ➤ Metrics ➤ Time-series numerical data

    ➤ In most cases, sampling-based ➤ Logs ➤ Text messages with timestamp ➤ In most cases, event-driven
  14. WHAT SHOULD BE COLLECTED ➤ Metrics to collect ➤ Resources

    ➤ Performance ➤ Health
  15. COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /

    Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count
  16. COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /

    Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Capacity Performance Capacity Performance Capacity
  17. COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /

    Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Monitoring / Alerting Troubleshooting Monitoring / Alerting Troubleshooting Monitoring
  18. COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP

    response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time
  19. COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP

    response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time Troubleshooting Troubleshooting Monitoring / Alerting
  20. COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access

    count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services
  21. COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access

    count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting
  22. COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access

    count of front-end servers ➤ Both too much, and too less are problems ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting
  23. COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of

    front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting
  24. COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of

    front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting Alert ERRORs?
  25. BUILDING
 MONITORING ENVIRONMENT

  26. MONITORING ENVIRONMENT / ARCHITECTURE Server / Container Spring Boot Application

    Agent Data Store Visualizer / 
 Alerting Metrics
 Logs Read
 Summary Server / Container Spring Boot Application Agent Metrics
 Logs .
 .
 .
  27. MONITORING ENVIRONMENT / AGENTS Server / Container Spring Boot Application

    In-app Agent In-server Agent Log Shipper
 Agent Logs Data Store JVM
 Metrics Server
 Metrics Logs
  28. MONITORING ENVIRONMENT / ELASTICSEARCH ➤ Elasticsearch + Kibana ➤ Elasticsearch

    ➤ Open source full-text search engine ➤ Useful for logs and metrics data store ➤ Can store, search and aggregate logs quickly ➤ Kibana ➤ Open source visualizer for Elasticsearch ➤ View / create charts, maps, tables, etc. in web browser
  29. MONITORING ENVIRONMENT / SPOG ➤ Single Pane of Glass ➤

    See all the information in one tool ➤ Monitoring environment should be like a baseball scoreboard. ➤ Viewed by all members ➤ To see the current situation at a glance ➤ Don't break the combination between Monitoring, Alerting, Troubleshooting and Discovering
  30. MONITORING ENVIRONMENT / AGENTS ➤ Agents (Data collectors) ➤ Metricbeat

    ➤ Server / Container resources ➤ Filebeat or Fluentd ➤ Log shipper ➤ Elastic APM / Elastic APM Java Agent ➤ APM (Application Performance Monitoring) ➤ JVM resources
  31. MONITORING ENVIRONMENT / AGENTS Server Spring Boot Application Elastic APM

    Java Agent Metricbeat Filebeat Logs Elasticsearch JVM
 Metrics Server
 Metrics Logs Kibana Read
 Summary
  32. MONITORING ENVIRONMENT / AGENTS Container Spring Boot Application Elastic APM

    Java Agent Metricbeat Fluentd Logs Elasticsearch JVM
 Metrics Container
 Metrics Logs (fluentd Log-driver) Kibana Read
 Summary
  33. DEMO

  34. (APPENDIX)
 CONFIGURATIONS
 OF AGENTS

  35. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Metricbeat ➤ Server monitoring

    ➤ Just run metricbeat (out-of-the-box, no configuration) ➤ Container (Docker) monitoring ➤ Enable metricbeat/module.d/docker.yml ➤ Run metricbeat on docker ➤ Metricbeat (auto)discovers containers and retrieves container metrics
  36. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent

    (1/2) ➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>co.elastic.apm</groupId> <artifactId>apm-agent-attach</artifactId>
 <version>1.9.0</version> </dependency>
  37. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent

    (2/2) ➤ Add elasticapm.propereties ➤ Add “ElasticApmAttacher.attach();” to the main method service_name=store-web public static void main(String[] args) { ElasticApmAttacher.attach(); SpringApplication.run(StoreApplication.class, args); }
  38. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Fluentd ➤ Run (micro)services

    with fluentd log-driver ➤ Run fluentd on Docker store-web:
 # snip
 logging: driver: "fluentd" options: tag: "docker.services"
  39. TIPS AND PITFALLS

  40. TIPS #1
 LOGS SHOULD BE
 JSON FORMATTED

  41. TIPS #1 / JSON FORMAT LOGS ➤ Logs should be

    parsed before storing to Elasticsearch ➤ Log is like ➤ Grok pattern is like ➤ Multi line logs are hell ➤ Like aggregation of stream events 2016-02-26 11:15:47.561 INFO [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1] i.s.c.sleuth.docs.service1.Application : Hello from service1. Calling service2 %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[% {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid} \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}
  42. TIPS #1 / JSON FORMAT LOGS ➤ Format logs as

    JSON at the log out put time! ➤ Application Log ➤ logstash-logback-encoder <appender name="json-console" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder"> <providers> <timestamp> <timeZone>UTC</timeZone> </timestamp> <pattern> <pattern> { "severity": "%level", "service": "${springAppName:-}", "type": "application",
  43. TIPS #1 / JSON FORMAT LOGS ➤ Format logs as

    JSON at the log out put time! ➤ Access Log ➤ Write configuration by hand server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM- dd'T'HH:mm:ss.SSSZ}t","service":"$ {spring.application.name}","type":"access","method":"%m","rem ote":"%a","path":"%U","query":"%q","duration":"%D","status":" %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"% {Referer}i","session-id":"%S"}
  44. TIPS #2
 LOGS SHOULD BE
 DISTRIBUTED TRACED

  45. TIPS #2 / DISTRIBUTED TRACING ➤ In monolithic Java applications,

    application behaviors can be traced from application logs using “thread-id”. ➤ Logs of microservices are micro-partitioned ➤ No correlation ids for a single request! ➤ Distributed tracing help us trace the behaviors for a request ➤ Trace-ID is like thread-id or request-id in the microservices ➤ It is passed via HTTP header or AMQP header etc.
  46. TIPS #2 / DISTRIBUTED TRACING ➤ Tools ➤ Spring Cloud

    Sleuth ➤ Depends on Spring DI, AOP ➤ Available for only Spring Boot applications ➤ Elastic APM ➤ Using bytecode instrumentation by ByteBuddy ➤ Available for all Java applications
  47. TAKE A LOOK AT
 THE LOGS

  48. TIPS #2 / DISTRIBUTED TRACING ➤ Spring Cloud Sleuth vs

    Elastic APM ➤ Spring Cloud Sleuth ➤ Supports most of Spring modules ➤ RestTemplate, WebFlux, Netty … ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka … ➤ Elastic APM ➤ Supports limited libraries ➤ Servlet, JAX-RS, Apache HttpClient, JDBC … ➤ Spring MVC, RestTemplate … ➤ Does NOT support WebFlux, RabbitMQ, Kafka …
  49. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth configuration

    ➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency>
  50. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth configuration

    ➤ Modify logback-spring.xml if you customized the log output format <pattern> <pattern> { "severity": "%level", "service": "${springAppName:-}", "type": "application", "trace": "%X{X-B3-TraceId:-}", "span": "%X{X-B3-SpanId:-}", "parent": "%X{X-B3-ParentSpanId:-}", "exportable": "%X{X-Span-Export:-}",
  51. From 17:00 at Room 401 How to Properly Blame Things

    for Causing Latency:
 An Introduction to Distributed Tracing and Zipkin By Adrian Cole
  52. PITFALL #1
 ALERTING CAN ONLY DETECT
 PREDICTABLE PROBLEMS

  53. PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤

    Threshold base alerting only detects predictable problems ➤ If we only collect CPU usages and memory usages, we cannot notice disk full. ➤ If we only collect disk volume usage, we cannot notice the partition out of space. ➤ If we only collect disk size, we cannot notice the inode insufficiency. ➤ Not all problems can be predicted from the beginning. ➤ If an unknown problem occurs, it should be fed back to the monitoring for the problem will occur in the near future
  54. PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤

    To discover the unknown problem … ➤ Filter out all “known” logs on Kibana ➤ Only unknown logs are displayed ➤ Unknown logs may indicate unknown problems ➤ Leading more tasks, and system more stable
  55. PITFALL #2
 LIVING DEAD PROCESS

  56. PITFALL #2 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤

    (Micro)service seems healthy but down actually ➤ Returns 200 OK to the health check request ➤ Metrics are normal ➤ But returns no response at all ➤ Last update time of log of each services should be monitored ➤ Living application produces some log ➤ If there is no log, it may be dead
  57. IMPROVE MONITORING

  58. KPI OF
 MONITORING ITSELF

  59. MONITORING AND ALERTING / METRICS ➤ Monitoring KPIs (1)The mean

    time from when a problem occurs
 to when it is detected (2)# of problems detected by alerting / Total # of problems (3)# of monitoring improvements / Total # of problems (4) (2) +(3) / Total # of problems ➤ We can clarify if the monitoring environment became better or not ➤ Monitoring improvement leads to more stable system operation
  60. ENJOY MONITORING YOUR SYSTEM! ➤ Demo application and monitoring environment

    ➤ https://github.com/cero-t/ spring-store-2019/ ➤ My Twitter (@cero_t) ➤ https://twitter.com/cero_t