sound when the speed exceeded 100km/h ➤ This is alerting ➤ When the traffic accident, logs like drive recorders are necessary ➤ That is troubleshooting ➤ Analyzing driver’s steering and pedal operation my prevent the potential traffic accident ➤ That is discovering
response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time Troubleshooting Troubleshooting Monitoring / Alerting
count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting
count of front-end servers ➤ Both too much, and too less are problems ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting
➤ Open source full-text search engine ➤ Useful for logs and metrics data store ➤ Can store, search and aggregate logs quickly ➤ Kibana ➤ Open source visualizer for Elasticsearch ➤ View / create charts, maps, tables, etc. in web browser
See all the information in one tool ➤ Monitoring environment should be like a baseball scoreboard. ➤ Viewed by all members ➤ To see the current situation at a glance ➤ Don't break the combination between Monitoring, Alerting, Troubleshooting and Discovering
➤ Just run metricbeat (out-of-the-box, no configuration) ➤ Container (Docker) monitoring ➤ Enable metricbeat/module.d/docker.yml ➤ Run metricbeat on docker ➤ Metricbeat (auto)discovers containers and retrieves container metrics
(1/2) ➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>co.elastic.apm</groupId> <artifactId>apm-agent-attach</artifactId> <version>1.9.0</version> </dependency>
parsed before storing to Elasticsearch ➤ Log is like ➤ Grok pattern is like ➤ Multi line logs are hell ➤ Like aggregation of stream events 2016-02-26 11:15:47.561 INFO [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1] i.s.c.sleuth.docs.service1.Application : Hello from service1. Calling service2 %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[% {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid} \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}
JSON at the log out put time! ➤ Access Log ➤ Write configuration by hand server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM- dd'T'HH:mm:ss.SSSZ}t","service":"$ {spring.application.name}","type":"access","method":"%m","rem ote":"%a","path":"%U","query":"%q","duration":"%D","status":" %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"% {Referer}i","session-id":"%S"}
application behaviors can be traced from application logs using “thread-id”. ➤ Logs of microservices are micro-partitioned ➤ No correlation ids for a single request! ➤ Distributed tracing help us trace the behaviors for a request ➤ Trace-ID is like thread-id or request-id in the microservices ➤ It is passed via HTTP header or AMQP header etc.
Sleuth ➤ Depends on Spring DI, AOP ➤ Available for only Spring Boot applications ➤ Elastic APM ➤ Using bytecode instrumentation by ByteBuddy ➤ Available for all Java applications
Elastic APM ➤ Spring Cloud Sleuth ➤ Supports most of Spring modules ➤ RestTemplate, WebFlux, Netty … ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka … ➤ Elastic APM ➤ Supports limited libraries ➤ Servlet, JAX-RS, Apache HttpClient, JDBC … ➤ Spring MVC, RestTemplate … ➤ Does NOT support WebFlux, RabbitMQ, Kafka …
➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency>
Threshold base alerting only detects predictable problems ➤ If we only collect CPU usages and memory usages, we cannot notice disk full. ➤ If we only collect disk volume usage, we cannot notice the partition out of space. ➤ If we only collect disk size, we cannot notice the inode insufficiency. ➤ Not all problems can be predicted from the beginning. ➤ If an unknown problem occurs, it should be fed back to the monitoring for the problem will occur in the near future
To discover the unknown problem … ➤ Filter out all “known” logs on Kibana ➤ Only unknown logs are displayed ➤ Unknown logs may indicate unknown problems ➤ Leading more tasks, and system more stable
(Micro)service seems healthy but down actually ➤ Returns 200 OK to the health check request ➤ Metrics are normal ➤ But returns no response at all ➤ Last update time of log of each services should be monitored ➤ Living application produces some log ➤ If there is no log, it may be dead
time from when a problem occurs to when it is detected (2)# of problems detected by alerting / Total # of problems (3)# of monitoring improvements / Total # of problems (4) (2) +(3) / Total # of problems ➤ We can clarify if the monitoring environment became better or not ➤ Monitoring improvement leads to more stable system operation